NUMA Node to PCI Slot Mapping in Red Hat Enterpise Linux

 

understanding-dpdk-31-638

Sandybridge I/O Controller to PCI-E Mapping 

 

Using a few simple commands you can easily map a PCI slot back to its directly connected NUMA node. This information comes in very handy when implementing NFV leveraged technologies such as CPU Pinning and SRIOV.

 

First, you will need to install hwloc and hwloc-gui, if it is not already installed on your system. hwloc-gui provides the lstopo command, so you will need to install the gui package even if you are going to run the command on a headless system.

# yum -y install hwloc.x86_64 hwloc-gui.x86_64

Now you can run lstopo. Below is the output from one of my dual socket, quad core Xeon systems.

# lstopo
Machine (40GB)
NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#11)
NUMANode L#1 (P#1 24GB) + Socket L#1 + L3 L#1 (8192KB)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#12)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#13)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#14)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
HostBridge L#0
PCIBridge
PCI 8086:10c9
Net L#0 “enp8s0f0”
PCI 8086:10c9
Net L#1 “enp8s0f1”
PCIBridge
PCIBridge
PCIBridge
PCI 8086:10e8
Net L#2 “enp5s0f0”
PCI 8086:10e8
Net L#3 “enp5s0f1”
PCIBridge
PCI 8086:10e8
Net L#4 “enp4s0f0”
PCI 8086:10e8
Net L#5 “enp4s0f1”
PCIBridge
PCI 102b:0532
GPU L#6 “card0”
GPU L#7 “controlD64”
PCI 8086:3a22
Block L#8 “sr0”
Block L#9 “sda”
Block L#10 “sdb”
Block L#11 “sdc”

The first 27 lines of output tell you which cores are in each socket.

Lines starting with “HostBridge L#0” list the PCI devices attached to socket 0. On more modern dual socket systems (think Sandybridge) you would have a “HostBridge L#8” section as well.

 

“The PCI host bridge provides an interconnect between the processor and peripheral components. Through the PCI host bridge, the processor can directly access main memory independent of other PCI bus masters. For example, while the CPU is fetching data from the cache controller in the host bridge, other PCI devices can also access the system memory through the host bridge. The advantage of this architecture lies in its separation of the I/O bus from the processor’s host bus.”

 

Unfortunately, my lab systems are Nehalem based machines which implement what is called QPI to share a host bridge between CPU sockets.  See image below.

 

019_QPI_1IOH

Nehalem QPI Architecture

 

Nonetheless, we are able to determine which CPU socket is associated with a specific PCI device. For this example, we will focus on the devices below since they are both directly attached to the PCI Host Bridge and not the PCI Bus.

 

HostBridge L#0
PCIBridge
PCI 8086:10c9
Net L#0 “enp8s0f0”
PCI 8086:10c9
Net L#1 “enp8s0f1”

Now using the lspci command I can find the exact devices per NUMA node.

lspci -nn | grep 8086:10c9
08:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
08:00.1 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)

 

 

 

 

5 thoughts on “NUMA Node to PCI Slot Mapping in Red Hat Enterpise Linux

  1. Hi Christopher,

    Really nice post.
    I’ve been trying to figure out which NIC is connected to which NUMA node on one of my servers.
    I’ve been running lstopo on 2 different servers:
    ProLiant BL460c Gen9
    —————————–

    NUMANode L#0 (P#0 64GB)
    Socket L#0 + L3 L#0 (30MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
    PU L#0 (P#0)
    PU L#1 (P#24)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
    PU L#2 (P#1)
    PU L#3 (P#25)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
    PU L#4 (P#2)
    PU L#5 (P#26)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
    PU L#6 (P#3)
    PU L#7 (P#27)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
    PU L#8 (P#4)
    PU L#9 (P#28)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
    PU L#10 (P#5)
    PU L#11 (P#29)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
    PU L#12 (P#6)
    PU L#13 (P#30)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
    PU L#14 (P#7)
    PU L#15 (P#31)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
    PU L#16 (P#8)
    PU L#17 (P#32)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
    PU L#18 (P#9)
    PU L#19 (P#33)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
    PU L#20 (P#10)
    PU L#21 (P#34)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
    PU L#22 (P#11)
    PU L#23 (P#35)
    HostBridge L#0
    PCIBridge
    PCI 8086:10f8
    Net L#0 “eno49”
    PCI 8086:10f8
    Net L#1 “eno50”
    PCIBridge
    PCI 103c:3239
    Block L#2 “sda”
    Block L#3 “sdb”
    PCIBridge
    PCI 8086:10f8
    Net L#4 “ens1f0”
    PCI 8086:10f8
    Net L#5 “ens1f1”
    PCIBridge
    PCI 102b:0533
    GPU L#6 “card0”
    GPU L#7 “controlD64”
    NUMANode L#1 (P#1 64GB) + Socket L#1 + L3 L#1 (30MB)
    L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
    PU L#24 (P#12)
    PU L#25 (P#36)
    L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
    PU L#26 (P#13)
    PU L#27 (P#37)
    L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
    PU L#28 (P#14)
    PU L#29 (P#38)
    L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
    PU L#30 (P#15)
    PU L#31 (P#39)
    L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
    PU L#32 (P#16)
    PU L#33 (P#40)
    L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
    PU L#34 (P#17)
    PU L#35 (P#41)
    L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
    PU L#36 (P#18)
    PU L#37 (P#42)
    L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
    PU L#38 (P#19)
    PU L#39 (P#43)
    L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
    PU L#40 (P#20)
    PU L#41 (P#44)
    L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
    PU L#42 (P#21)
    PU L#43 (P#45)
    L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
    PU L#44 (P#22)
    PU L#45 (P#46)
    L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
    PU L#46 (P#23)
    PU L#47 (P#47)

    And on an older brother:

    ProLiant SL390s G7
    ——————————
    Machine (96GB)
    NUMANode L#0 (P#0 48GB) + Socket L#0 + L3 L#0 (12MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
    PU L#0 (P#0)
    PU L#1 (P#12)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
    PU L#2 (P#2)
    PU L#3 (P#14)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
    PU L#4 (P#4)
    PU L#5 (P#16)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
    PU L#6 (P#6)
    PU L#7 (P#18)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
    PU L#8 (P#8)
    PU L#9 (P#20)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
    PU L#10 (P#10)
    PU L#11 (P#22)
    NUMANode L#1 (P#1 48GB) + Socket L#1 + L3 L#1 (12MB)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
    PU L#12 (P#1)
    PU L#13 (P#13)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
    PU L#14 (P#3)
    PU L#15 (P#15)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
    PU L#16 (P#5)
    PU L#17 (P#17)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
    PU L#18 (P#7)
    PU L#19 (P#19)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
    PU L#20 (P#9)
    PU L#21 (P#21)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
    PU L#22 (P#11)
    PU L#23 (P#23)
    HostBridge L#0
    PCIBridge
    PCI 8086:10c9
    Net L#0 “enp4s0f0”
    PCI 8086:10c9
    Net L#1 “enp4s0f1”
    PCIBridge
    PCI 15b3:6746
    Net L#2 “enp5s0”
    Net L#3 “enp5s0d1”
    OpenFabrics L#4 “mlx4_0”
    PCIBridge
    PCI 1002:515e
    GPU L#5 “card0”
    GPU L#6 “renderD128”
    GPU L#7 “controlD64”
    PCI 8086:3a20
    Block L#8 “sda”
    Block L#9 “sdb”
    PCI 8086:3a26

    You can see that the output of ProLiant SL390s G7 “HostBridge L#0” is pretty much like your example.
    “HostBridge L#0” and the NICs under it aren’t directly under any of the NUMA nodes.

    If you’ll look at the newer “ProLiant BL460c Gen9” you’ll see that “HostBridge L#0” is directly under “NUMANode L#0” (the output is indented) and indeed the NICs under it are using NUMA node 0.

    I was able to verify this using:

    [root@192-1 ~]# cat /sys/class/net/ens1f0/device/numa_node
    0

    The thing is… on ProLiant SL390s G7 the same command as above produces ‘-1’.
    [root@s167-1 ~]# cat /sys/class/net/enp4s0f0/device/numa_node
    -1

    It made me wonder if that NIC (enp4s0f0) doesn’t ‘prefer’ one NUMA node over the other?
    The following post mentions that it might be related to BIOS ACPI feature:
    https://software.intel.com/en-us/forums/watercooler-catchall/topic/277658

    From your post: “Lines starting with “HostBridge L#0” list the PCI devices attached to socket 0”
    How can you tell that those PCI devices are indeed attached to socket 0?

    10x!

  2. I don’t understand the last part of this. Which NUMA node is 8086:10c9 “more local” to, and why? On my system, which I think is the same, numa_node in /sys/bus/pci/devices/<>/numa_node is -1 … but you seem to be saying it does have some NUMA locality after all?

  3. We can identify the PCI devices attached numa through the below command through:
    cat /sys/bus/pci/devices/0000\:84\:00.1/numa_node

  4. Is there any possibility that our PCI not mapped to its closest CPU node? If that happened, how to modify that PCI so it mapped to its closest CPU node.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.