IBM POWER8 100 GigE Adapter Best Practices

Introduction IBM POWER8 100 GigE Adapter Best Practices With higher network speeds in new network adapters, achieving peak performance requires careful tuning of the adapters and workloads using them. IBM POWER8 servers now support 100 GigE adapters and this guide will help understand what performance you can get and help maximize the utilization. Achieving 100 Gb/s bandwidth takes careful tuning and traditional methods of measuring network performance may not show the full potential of the adapters. In addition the latest adapters are capable of handling a very high number of network packets depending on the application and tuning used. All measurements and tuning below is for TCP/IP traffic. All the tuning recommendations apply to Power Systems Scale-Out S812 and Power Systems Scale-Out S82x systems running AIX. In addition to AIX we also show some results on Bare Metal Linux ( Ubuntu ) for comparison. In the sections below, we will cover peak performance, impact of performance due to number of TCP sockets and message sizes, recommended tuning, and finally how to measure peak performance.

Section 1. Peak performance single adapter The following section shows the peak performance of a single adapter port. Bandwidth The following are measurement results taken on Power Systems Scale-Out S822 Servers/Power Systems Scale-Out S824 Servers and Power Systems Scale-Out S822L Servers systems. Measurements on your own system can vary depending on the number of CPU's, the CPU frequency of the machine, and number of memory DIMMs installs. These measurements were done on machines where all memory DIMM slots were populated which ensures peak memory performance. All measurements were made where the partition (LPAR) had direct native adapter access. These results do not apply to virtualization of the adapter to multiple LPARS using VIOS/SEA. However, where noted, we did use PowerVM which does introduce some virtualization overhead on the performance. These measurements are also under ideal laboratory conditions. Your results may vary depending on how the system software and application behave. See section 2 for more detail on how your application characteristics can impact actual performance. When using native Linux without PowerVM, we are able to demonstrate link limited bandwidth. Since our measurements only measure actual data transferred, the rate is slightly lower than the 100 GigE speed of the adapter port. The difference is what is consumed by headers and other data on the cable to support the Ethernet protocol. Peak performance single adapter bare metal environments: MTU 1500 Bandwidth BML (Ubuntu Power Systems Scale-Out S822LC Servers) Receive: 94 Gb/s Transmit: 94 Gb/s Duplex: 171 Gb/s MTU 9000 Bandwidth BML (Ubuntu Power Systems Scale-Out S822LC Servers) Receive: 98 Gb/s Transmit: 98 Gb/s Duplex: 188 Gb/s When measuring performance on a virtualized system using PowerVM, the 100 GigE adapter does not achieve link limited bandwidth due to impact of virtualization in the POWER8 hardware. This is not seen on slower adapters because the peak bandwidth of those adapter ports is below the single port virtualization limit. It is only when trying to sustain close to 100 Gb/s that it shows in the results. The following are results of running AIX on PowerVM with dedicated 100 GigE adapters to the LPAR. Note that the peak bandwidth is slightly lower than the peaks achieved with Bare Metal Linux.

Peak performance single adapter virtualized environments: AIX (7.2 ) Receive: 88 Gb/s Transmit: 85 Gb/s Duplex: 90 Gb/s MTU 1500 Bandwidth AIX (7.2 ) Receive: 97 Gb/s Transmit: 93 Gb/s Duplex: 128 Gb/s MTU 9000 Bandwidth When using virtualization like PowerVM, we currently are seeing lower bandwidth, With affinitization on AIX for MTU 1500 we can get around 88 Gb/s and for RHEL 91 Gb/s. For MTU 9000 on AIX we get up to 97 Gb/s.. Latency The following is the half round trip latency for the 100 GigE adapter when using a 1 byte message. The difference between bare metal Ubuntu and AIX is that for AIX we are using PowerVM virtualization which introduces overhead, also differences between the TCP/IP implementation and features supported. Configuration Ubuntu 14.04 BML Power Systems Scale-Out S822LC Servers AIX (7.2 ) PowerVM Power Systems Scale-Out S824 Servers Latency usec 19.24 26.11 Small message rate The following are the current small message rates for 100 GigE adapter. These were measured using 150 concurrent TCP sockets each passing back and forth 1 byte data payloads or messages. Configuration Ubuntu 14.04 BML Power Systems Scale-Out S822LC Servers AIX (7.2 ) PowerVM Power Systems Scale-Out S824 Servers Small RR message at 150 TCP sockets 501809 550000 The higher small message rates on AIX are because of differences in the implementation of TCP and the devices drivers. Multiple port per adapter speedup limitation. Using both ports in the adapter will not give double the speed of a single port. Two ports on the same adapter in use can not exceed the speed of the underlying PCI bus that the adapter is plugged into, therefore the performance limitation will come from the PCI bus not the adapter. The limitation is the

128 Gb/s PCI bus limit. This limit also applies if you Etherchannel both ports in the same adapter. Etherchannel is also known as port bonding and/or LACP on Linux. When running multiple 100 GigE adapters make sure that you have PCIe gen3 x16 slots available. The adapter will not physically plug into slower x8 and other PCI slots in the machine. If they are not available you may have to move adapters around to get access to the higher speed slots. Most slower adapters do not need the higher performance x16 slot. In addition, make sure that the PCIe slot is enabled for HDDW addressing. For FSP based system you can check the setting from the FSP GUI called ASM. When utilizing multiple high speed adapters you need to ensure that there is enough system resources to support the adapter traffic. Some systems will not support more than 2 adapters at full speed. In addition you may run out of CPU if the application you are using consumes a lot of other CPU cycles not leaving enough for the network traffic. See more on CPU requirements later in this article.

Section 2. Traffic characteristics affecting peak performance If an application or workload does not use the same number of TCP sockets or message sizes used in our measurements, the performance will be lower. The following section shows the impact of fewer sockets and message sizes on actual performance As seen in the graph, peak receive bandwidth is not obtained until there are 40 TCP sockets all receiving at the same time. However, if only 8 sockets are receiving the bandwidth drops to about 26 Gb/s. For a single TCP socket, for things like FTP, the bandwidth drops to about 4 Gb/s. The following graph is for small messages rates. This measurement is the number of messages exchanged between two machines using multiple TCP sockets when each TCP socket has only one packet in transit. To get the peak small message rates of 550K messages a second you need 100 TCP sockets or more active at any one time. If you only have 20 TCP socket active the rate drops to 276K and for 1 socket only 19K per second.

The following graph shows bandwidth at different TCP socket write sizes. An application or workload has to write to the TCP socket at least 32K at a time to achieve peak bandwidth. If only 4K at a time is written, the peak utilization of the adapter will be around 25 Gb/s.

Section 3. Adapter tuning options The following are recommended tuning changes to apply to AIX for peak bandwidth performance. Adapter tuning options: Attribute name: Default value Recommended value Description queues_rx 8 20 Number of receive queues used by the network adapter for incoming network traffic. queues_tx 2 12 Number of transmit queues used by the adapter for outbound network traffic. rx_max_pktx 1024 2048 Receive queue maximum packet count tx_send_cnt 8 16 Number of transmit packets chained for adapter processing 1) To display the current values use: lsattr -El entx where X is the number of the network adapter 2) To list the settable options of the attribute use: lsattr -Rl entx -a <attribute> 3) To change the current value use: chdev -l entx -a <attribute>=<value> For more information see these links: http://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.cmds1/chdev.htm http://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.cmds3/lsattr.htm The benefit of using more receive queues is that each queue has a unique MSI-X interrupt so that will result in spreading the packets across more queues and thus across more CPU threads, to ensure the CPU is not the bottleneck. This can also help reduce latency for latency sensitive workloads. However, do not increase the number of queues to more than what is needed for good performance. Increasing the queues has the negative impact of:

1. Consuming more memory for receive buffers as each queue has to have a receive buffer pool. 2. Spreads the interrupts across more queues which results in lower interrupt coalescing (ie fewer packets per interrupt), and thus higher interrupt overhead. Single threads workloads like a single FTP transfer will only use a single transmit or receive queue due to how TCP connections are hashed to a queue. Multiple queues are needed to ensure good performance as more TCP connections are in use and more CPU threads (applications) are active. However above some point, more queues just consume more memory and increase system interrupt overhead with no increase in throughput.

Section 4. Interrupt affinitization To get the best performance out of any high speed adapter POWER8 servers need to be configured so that incoming interrupt processing is handled on cores as close as possible to the PCIe bus in the system. This lowers latency and improves response time for adapter events. The following sections describes how to find out the location of the adapters in a system and to affinitize the interrupt handling to cores close to the PCI bus. Not affinitizing the interrupt handling, can cause the peak bandwidth to drop by over 30%. Determining adapter location To determine the location code of the adapter, you can either check in the HMC in the LPAR configuration or from the OS. For AIX, to find the hardware location of the adapter you can use the following: lscfg -vl entx where X is the number of the adapter. If you are looking for the adapter location code in Linux you can use the following: iface=ethx; cat /proc/device-tree/`cat /sys/class/net/$iface/device/devspec`/ibm\,loc-code ; echo Where ethx is the interface name for the adapter from ifconfig. Determining which CPUs are local to the adapter Each model P8 system has a different PCI bus numbering system. The following tables list which CPU socket a given PCI bus is located. By getting the location of the adapter you can check the tables to determine which range of CPU's to affinitize the interrupts. Each system can have a different CPU range even if they have the same number of cores. The number of CPU's seen by the operating system is based on the number of cores multiplied by the SMT level. A system with 16 cores and ST mode will only show 16 CPUs. However, a system with 32 cores and SMT4 will show 128 CPU's. Here are the CPU ranges to assign interrupts on for various systems: Power Systems Scale-Out S812 Servers Any CPU will do because the PCI bus is local to the single CPU socket in the system. Power Systems Scale-Out S822 Servers, Power Systems Scale-Out S824 Servers and Power Systems

Scale-Out S822L Servers Slot location ending in: C6 or C7 C3 or C5 CPU range to use for interrupts First half of the CPU's in the system Last half of the CPU's in the system P850 Slot location ending in: C10 or C12 C8 or C9 C3 or C4 C1 or C2 CPU range to use for interrupts First quarter of the CPU's in the system Second quarter of the CPU's in the system Third quarter of the CPU's in the system Last quarter of the CPU's in the system E870/880 For help with E870/880 systems you will need assistance from IBM. Multiple LPAR location determination If multiple LPARs are configured it is possible that the bus with the adapter is not local to resources for the LPAR. Therefor no affinitization may be possible. Consult IBM for further information. If you are using DLPAR no affinitization is possible. Finding the interrupt numbers for an adapter: entstat -d entx will list out the interrupt numbers for transmit or receive queues. Binding interrupts: AIX On AIX the bindintcpu command is used to bind interrupts to CPU's. Man page information: http://www.ibm.com/support/knowledgecenter/ssw_aix_72/com.ibm.aix.cmds1/bindintcpu.htm CAUTION: Changing rx and tx queues changes interrupt numbers. The output listed in entstat will change if the number of transmit and receive queues are changed on the adapter. Reboot may change interrupt numbers. After a reboot, the output from entstat may not be the same interrupt number as before the reboot. If using bindintcpu in a script, the values would need to be modified.

Linux In Linux, the device driver will automatically try and affinitize interrupts to cores local to the adapter when the system is initialized.

Section 5. How to measure peak performance To measure peak bandwidth of the 100 GigE adapter, most multiple socket network benchmarks will work. Typically iperf and uperf can be used. To achieve peak bandwidth, as seen earlier, up to 40 TCP sockets may be needed. For BML since there is no overhead from the virtualization layer, you can demonstrate link limited bandwidth with as few as 8 TCP sockets. However, to achieve peak bandwidth using PowerVM, it may require from 24 to 40 TCP sockets. When setting up the traffic profile to use, you will need to use a TCP socket write size of at least 32K or larger. Smaller TCP socket write sizes, used as a default in many network benchmarks, may not show link limited bandwidth because of the overhead of processing smaller buffers from the socket layer. On AIX, you will need to increase the TCP send and receive space sizes to 768K or larger. This larger TCP window size is needed to keep data flowing between the two systems at 100 Gb/s. On Linux, the default setting that have a 4MB upper tcp_wmem and tcp_rmem sizes will suffice. CPU requirements Driving 100 Gb/s of network traffic required a lot of CPU, for 10 GigE adapters we usually recommend between.7 and 1 core for AIX. Since 100 GigE is 10 times faster we recommend that there be at least 7 cores of CPU available to measure link limited bandwidth. Linux will require less CPU in the 5 6 core range depending on CPU frequency. These estimates do not cover using SEA/VIOS or KVM using vhost_net. Multiple client performance Using more than one client machine generally shows more consistent and sometimes slightly better performance. With the high speed of 100 GigE adapters, using only one client, the performance is limited by the slower of two otherwise identically configured machines. Using multiple clients gets rid of the lowest common speed machine from the results and you end up measuring the server performance. Using multiple clients requires the use of a 100 GigE switch. See switch tuning tips below. Apply latest PTF's Before making performance measurements, make sure you have applied the latest updates or code released to pick up any recent improvements. Ethernet switch tuning With the use of higher speeds of networks the tuning of the network and switches is becoming more

important to achieve peak rates. With 100 GigE, our measurements found that if flow control through the network is not set up correctly, performance can suffer. With 100 GigE networks being 2.5 to 10 times faster than current networks, any stall in network traffic or delay has a much bigger negative impact on throughput. Because of the high bandwidth, you can't take for granted flow control through the network so a check should be done to make sure Global Pause is turned on in all switch ports and adapters used.