RHEL OSP 10/11 – OVS+DPDK Tunables

93905-6a00e551c39e1c883401a511e32c88970c-pi

Tunables for Dell R630s for use when deploying OVS+DPDK

# OSP 10/11 DPDK Tunables
#
# R630 NUMA locality – CPUs
# node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
# 24 26 28 30 32 34 36 38 40 42 44 46
#
# node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
# 25 27 29 31 33 35 37 39 41 43 45 47
#
#
# R630 NUMA locality – NIC
# node 0 dpdk interface – p3p1
# node 1 dpdk interface – p1p1
#
#
#
# NovaVcpuPinSet (OSP 10+)
# These are the cores that Nova will use for scheduling instances. Pair sibling threads together.
# Using cores from NUMA node 0 only to prevent crossing NUMA boundaries
NovaVcpuPinSet: “‘4,6,8,10,12,14,16,18,20,22,28,30,32,34,36,38,40,42,44,46′”
#
# NeutronDpdkCoreList (OSP 10/11) OvsPmdCoreList (OSP 12+)
# This parameter configures a list of CPU cores to be used by the OVS-DPDK Poll Mode Drivers
# The first core from a CPU, should be reserved for host processes, and should be excluded from this list.
NeutronDpdkCoreList: “‘2,26,3,27′”
#
# HostIsolatedCoreList (OSP 10/11) IsolCpusList (OSP 12+)
# A set list or range of cores (and their sibling threads) to be appended to the tuned cpu-partitioning profile and isolated from the host.
# These cores will be isolated from any host processes
# Assuming you want to isolate nova cores from all system processes, NovaVcpuPinSet + NeutronDpdkCoreList = HostIsolatedCoreList
HostIsolatedCoreList: “‘2,3,4,6,8,10,12,14,16,18,20,22,26,27,28,30,32,34,36,38,40,42,44,46′”
#
# HostCpusList (OSP 10/11) & OvsDpdkCoreList (OSP 12+)
# A list of logical cores used by OVS-DPDK processes for dpdk-lcore-mask for non-datapath operations
# These cores must be mutually exclusive from the list of cores in NeutronDpdkCoreList/OvsPmdCoreList and NovaVcpuPinSet.
# Allocate the first physical core (and sibling thread) from each NUMA node irrespective of DPDK interface NUMA locality.
HostCpusList: “‘0,24,1,25′”
#
# Provide the number of memory channels in the format – [allowed_pattern: “[0-9]+”]:
NeutronDpdkMemoryChannels: “4”
#
# Set the memory allocated for each socket:
NeutronDpdkSocketMemory: “‘2048,2048′”
#
# An array of filters used by Nova to filter a node.These filters will be applied in the order they are listed,
# so place your most restrictive filters first to make the filtering process more efficient.
NovaSchedulerDefaultFilters: “RamFilter,ComputeFilter,AvailabilityZoneFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,PciPassthroughFilter,NUMATopologyFilter”
#
# Kernel arguments for Compute node
ComputeKernelArgs: “default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on”

 

OpenStack: Configuring SR-IOV in RHEL OSP 8

openstack

Introduction

This article documents the configuration used to configure SR-IOV in OSP 8/Liberty on Dell hardware

Compute Node Configuration

This section will outline the changes needed to configure SR-IOV on each Compute Node.

Bios Configuration on Dell Compute Nodes

First, you will need to ssh to the drac of each Compute Node. Then, type the command below to enter racadm command line.

#racadm

Type the command below to enable SRIOV.

#racadm set BIOS.IntegratedDevices.SriovGlobalEnable Enabled
[Key=BIOS.Setup.1-1#IntegratedDevices]
RAC1017: Successfully modified the object value and the change is in
pending state.
To apply modified value, create a configuration job and reboot
the system. To create the commit and reboot jobs, use “jobqueue”
command. For more information about the “jobqueue” command, see RACADM
help.

Type the command below to verify your configuration.

#racadm>>get BIOS.IntegratedDevices.SriovGlobalEnable

racadm get BIOS.IntegratedDevices.SriovGlobalEnable
[Key=BIOS.Setup.1-1#IntegratedDevices]
SriovGlobalEnable=Enabled

If the server already has an OS, reboot it to make these settings stick. If it doesn’t, use these racadm commands to power cycle.

#racadm serveraction powerdown
#racadm serveraction powerup

Grub Configuration on Compute Nodes

Add “intel_iommu=on” to the GRUB_CMDLINE_LINUX line as shown below.

# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=”$(sed ‘s, release .*$,,g’ /etc/system-release)”
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT=”console”
GRUB_CMDLINE_LINUX=”console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet intel_iommu=on”
GRUB_DISABLE_RECOVERY=”true”
audit=1

First, make a backup of /etc/default/grub.

# cp -p /etc/default/grub /etc/default/grub/.$(date +%F_%R)

Edit the line below.

“GRUB_CMDLINE_LINUX=\”console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet\”

Change it to this.

“GRUB_CMDLINE_LINUX=\”console=tty0 console=ttyS0,115200n8 crashkernel=auto rhgb quiet intel_iommu=on\”

Now, rebuild GRUB config as shown below.

# grub2-mkconfig -o /boot/grub2/grub.cfg

Specify the number of VFs to Create in rc.local

Add the following line to /etc/rc.d/rc.local, adjusting for the device name and #VFs. In this instance, our physical adapters each support 32 VFs.

echo 32 > /sys/class/net/<device>/device/sriov_numvfs

For example:

# echo 32 > /sys/class/net/p1p1/device/sriov_numvfs
# echo 32 > /sys/class/net/p3p1/device/sriov_numvfs

Also,ensure the correct SELinux context is restored.

# restorecon -R -v /etc/rc.d/rc.local

Ensure that /etc/rc.d/rc.local is executable.

#chmod +x /etc/rc.d/rc.local

Whitelist PCI devices nova-compute (Compute)

Tell nova-compute which pci devices are allowed to be passed through. Edit the file /etc/nova/nova.conf:

[default]
pci_passthrough_whitelist = [{“vendor_id”:”8086″,”product_id”:”154d”}][{“devname”:”p1p1″,”physical_network”:”sriov_net1″},{“devname”:”p3p1″,”physical_network”:”sriov_net2″}]

This tells nova that all VFs belonging to the physical interface, “p1p1“, are allowed to be passed through to VMs and belong to the neutron provider network “sriov_net1” and all VFs belonging to the physical interface, “p3p1“, are allowed to be passed through for the network “sriov_net2“.

Restart nova-compute on each compute node with the command shown below.

#systemctl restart openstack-nova-compute

Install and Enable Neutron Sriov-Agent (Compute)

Note that the sriov-agent is not required, however, we are going to install and configure it anyway.

Install the following rpm.

# yum -y install openstack-neutron-sriov-nic-agent

Now, on each compute node edit the file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini:

[securitygroup]
firewall_driver = neutron.agent.firewall.NoopFirewallDriver

[sriov_nic]
physical_device_mappings = sriov_net1:p1p1,sriov_net2:p3p1

Now enable and start the nic agent.

# systemctl enable neutron-sriov-nic-agent.service && systemctl start neutron-sriov-nic-agent.service

Controller Node Configuration

Perform the following steps on each Controller Node. Note we will modify Nova and Neutron config files.

Neutron-Server changes in /etc/neutron/plugins/ml2/ml2_conf.ini(Controller)

The following changes take place in the file /etc/neutron/plugins/ml2/ml2_conf.ini.

Add sriovnicswitch as mechanism driver.

mechanism_drivers =openvswitch,bsn_ml2,sriovnicswitch

Set type_drivers to vlan as shown below.

type_drivers = vlan

Set tenant_network_types to vlan.

tenant_network_types = vlan

Set flat_networks as shown below where “sriov_net1” and “sriov_net2” are the networks we are going to create.

flat_networks =datacentre,sriov_net1,sriov_net2

Add VLAN ranges for the SRIOV networks to the network_vlan_ranges line as shown below.

network_vlan_ranges =datacentre:10:100,datacentre:101:122,sriov_net1:200:300,sriov_net1:200:300

Neutron-Server changes in /etc/neutron/plugins/ml2/ml2_conf_sriov.ini(Controller)

The change below needs to be made in /etc/neutron/plugins/ml2/ml2_conf_sriov.ini on each controller

Update the /etc/neutron/plugins/ml2/ml2_conf_sriov.ini on each controller.
In our case,the vendor_id is 8086 and the product_id is 10ed.

supported_pci_vendor_devs = 8086:10ed

Modify Nuetron-Server Startup

Edit /usr/lib/systemd/system/neutron-server.service. Here we add –config-file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini to the ExecStart line. See example below.

# cat neutron-server.service
[Unit]
Description=OpenStack Neutron Server
After=syslog.target network.target

[Service]
Type=notify
User=neutron
ExecStart=/usr/bin/neutron-server –config-file /usr/share/neutron/neutron-dist.conf –config-dir /usr/share/neutron/server –config-file /etc/neutron/neutron.conf –config-file /etc/neutron/plugin.ini –config-file /etc/neutron/plugins/ml2/ml2_conf_sriov.ini –config-dir /etc/neutron/conf.d/common –config-dir /etc/neutron/conf.d/neutron-server –log-file /var/log/neutron/server.log
PrivateTmp=true
NotifyAccess=all
KillMode=process

[Install]
WantedBy=multi-user.target

Restart neutron on the controllers via pacemaker. See command below.

#pcs resource restart neutron-server-clone

Configure nova-scheduler (Controller)

On every controller node running nova-scheduler add PCIDeviceScheduler to the scheduler_default_filters parameter.

Also  add a new line for scheduler_available_filters parameter under the [default] section in /etc/nova/nova.conf.

[DEFAULT]
scheduler_default_filters = RetryFilter, AvailabilityZoneFilter, RamFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAntiAffinityFilter, ServerGroupAffinityFilter, PciPassthroughFilter
scheduler_available_filters = nova.scheduler.filters.all_filters
scheduler_available_filters = nova.scheduler.filters.pci_passthrough_filter.PciPassthroughFilter

Now restart nova-scheduler via Pacemaker as shown below.

# pcs resource restart openstack-nova-scheduler-clone

 

Reference

http://docs.openstack.org/liberty/networking-guide/adv-config-sriov.html

NUMA Node to PCI Slot Mapping in Red Hat Enterpise Linux

 

understanding-dpdk-31-638

Sandybridge I/O Controller to PCI-E Mapping 

 

Using a few simple commands you can easily map a PCI slot back to its directly connected NUMA node. This information comes in very handy when implementing NFV leveraged technologies such as CPU Pinning and SRIOV.

 

First, you will need to install hwloc and hwloc-gui, if it is not already installed on your system. hwloc-gui provides the lstopo command, so you will need to install the gui package even if you are going to run the command on a headless system.

# yum -y install hwloc.x86_64 hwloc-gui.x86_64

Now you can run lstopo. Below is the output from one of my dual socket, quad core Xeon systems.

# lstopo
Machine (40GB)
NUMANode L#0 (P#0 16GB) + Socket L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#9)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#10)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#11)
NUMANode L#1 (P#1 24GB) + Socket L#1 + L3 L#1 (8192KB)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#12)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#13)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#14)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
HostBridge L#0
PCIBridge
PCI 8086:10c9
Net L#0 “enp8s0f0”
PCI 8086:10c9
Net L#1 “enp8s0f1”
PCIBridge
PCIBridge
PCIBridge
PCI 8086:10e8
Net L#2 “enp5s0f0”
PCI 8086:10e8
Net L#3 “enp5s0f1”
PCIBridge
PCI 8086:10e8
Net L#4 “enp4s0f0”
PCI 8086:10e8
Net L#5 “enp4s0f1”
PCIBridge
PCI 102b:0532
GPU L#6 “card0”
GPU L#7 “controlD64”
PCI 8086:3a22
Block L#8 “sr0”
Block L#9 “sda”
Block L#10 “sdb”
Block L#11 “sdc”

The first 27 lines of output tell you which cores are in each socket.

Lines starting with “HostBridge L#0” list the PCI devices attached to socket 0. On more modern dual socket systems (think Sandybridge) you would have a “HostBridge L#8” section as well.

 

“The PCI host bridge provides an interconnect between the processor and peripheral components. Through the PCI host bridge, the processor can directly access main memory independent of other PCI bus masters. For example, while the CPU is fetching data from the cache controller in the host bridge, other PCI devices can also access the system memory through the host bridge. The advantage of this architecture lies in its separation of the I/O bus from the processor’s host bus.”

 

Unfortunately, my lab systems are Nehalem based machines which implement what is called QPI to share a host bridge between CPU sockets.  See image below.

 

019_QPI_1IOH

Nehalem QPI Architecture

 

Nonetheless, we are able to determine which CPU socket is associated with a specific PCI device. For this example, we will focus on the devices below since they are both directly attached to the PCI Host Bridge and not the PCI Bus.

 

HostBridge L#0
PCIBridge
PCI 8086:10c9
Net L#0 “enp8s0f0”
PCI 8086:10c9
Net L#1 “enp8s0f1”

Now using the lspci command I can find the exact devices per NUMA node.

lspci -nn | grep 8086:10c9
08:00.0 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)
08:00.1 Ethernet controller [0200]: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01)