DTN End Host performance and tuning

Size: px

Start display at page:

Download "DTN End Host performance and tuning"

Daisy Cunningham
5 years ago
Views:

Office of the CTO GEANT AssociaOon - Cambridge Workshop:

1 DTN End Host performance and tuning 1 Gigabit Ethernet & NVME Disks Richard- Hughes Jones Senior Network Advisor, Office of the CTO GEANT AssociaOon - Cambridge Workshop: Moving My Data at High Speeds over the Network, Prague 12 Jun 216

2 The GÉANT DTN A story of how we explored the hardware and what we found Almost a technical report of progress Not really a teach- in, but we do give some hints for tuning Welcome input 2

3 The GÉANT DTN Hardware A lot of help from Boston Labs (London UK) Mellanox (UK & Israel ) Supermicro X1DRT- i+ motherboard Two 6 core 3.4GHz Xeon E v3 processors, Mellanox ConnectX- 4 1 GE NIC 16 lane PCI- e As many interrupts as cores Driver MLNX_OFED NVME SSD Set 8 lane PCI- e Fedora 23 with the fc23.x86_64 kernel NIC NVME 3

4 Explore the Hardware How are the peripherals connected? NUMA: Which PCIe I/f & bus is connected to which CPU socket or node? To which core do the IRQs go? Tools dmsg lspci tv lspci vv (for the PCIe slot) numactl H Look in /sys & /proc cat /proc/irq/<irq>/smp_affinity [root@geant_dtn1 mlnx_tuning_scripts]#./ show_irq_affinity.sh enp131sf1 183: 4 184: 8 185: 1 186: 2 187: 4 188: 8 189: 4 19: 8 191: 1 192: 2 193: Networks 4 Services People 194: 8 [root@dhcp richard]# lspci tv -+-[:ff]-+-8. Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link \-1f.2 Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 VCU +-[:8]-+-.-[81] [82]----. Mellanox Technologies MT275 Family [ConnectX-3] +-2.-[83]--+-. Mellanox Technologies MT277 Family [ConnectX-4] \-.1 Mellanox Technologies MT277 Family [ConnectX-4] +-3.-[84] [:7f]-+-8. Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link +-9. Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 QPI Link 1... \-[:]-+-. Intel Corporation Xeon E7 v3/xeon E5 v3/core i7 DMI [1]--+-. Intel Corporation Ethernet Controller 1-Gigabit X54-AT2 \-.1 Intel Corporation Ethernet Controller 1-Gigabit X54-AT [2]----. Intel Corporation PCIe Data Center SSD [3]----. Intel Corporation PCIe Data Center SSD [4] [5]----. Intel Corporation PCIe Data Center SSD [6]----. Intel Corporation PCIe Data Center SSD [7]----. Intel Corporation PCIe Data Center SSD [8]----. Intel Corporation PCIe Data Center SSD 4

5 udpmon: UDP Achievable Throughput Ideal shape Flat poroons Limited by capacity of link Available BW on a loaded link Recv Wire rate Mbit/s Mumbai-Singapore 5 bytes 1 bytes 2 bytes 4 bytes 6 bytes 8 bytes 1 bytes 12 bytes 14 bytes Shape follows 1/t Packet spacing most important Spacing betw een frames us 1472 bytes Cannot send packets back- 2- back End host: NIC setup Ome on PCI / context switches

6 udpmon on Boston Lab hosts: Achievable Throughput & Packet loss Move IRQs from core 11, set affinity to lock udpmon to core 11 node 1. Interrupt coalescence on (3us) Recv Wire rate Mbit/s haswell1-2_x_nobackground_5nov15 Recv Wire rate Mbit/s haswell1-2_x_nobackground_5nov Gbit/s Spacing between frames us 8972 bytes Spacing between frames us 8972 bytes %cpu kernel snd % cpu kernel rec 1 haswell1-2_x_nobackground_5nov Spacing between frames us 1 haswell1-2_x_nobackground_5nov % kernel 4 2 Networks Services People Spacing between frames us 8972 bytes 8972 bytes %cpu kernel snd haswell1-2_x_nobackground_5nov15 96% kernel sending Spacing between frames us 8972 bytes Swapping between user and kernel mode 6

7 udpmon on GÉANT DTN: Achievable Throughput & Packet loss Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Interrupt coalescence on (16us) Recv Wire rate Gbit/s %cpu kernel snd % cpu1 kernel rec 6 DTN1-2_noFW_a4_4Jun Gbit/s Spacing between frames us 1 DTN1-2_noFW_a4_4Jun % kernel sending Spacing between frames us 1 DTN1-2_noFW_a4_4Jun % kernel 4 2 Networks Services 2 People Spacing between frames us 4 bytes 6 bytes 7813 bytes 8972 bytes 4 bytes 6 bytes 7813 bytes 8972 bytes 4 bytes 6 bytes 7813 bytes Jumbo size packet should be highest! Swapping between user and kernel mode Also lost packets in the receiving host 7

8 udpmon_send: How fast can I transmit? Sending Rate as a FuncOon of Packet Size Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. 6 pkt_size_geant_dtn1_4may16 3 pkt_size_geant_dtn1_4may Send user data rate Mbit/s Average send time/packet us Size of user data in packet bytes Size of user data in packet bytes Drop 14.5 Gbit/s from 43 Gbit/s Step.75 µs Step occurs bytes user data Drop 3.6 Gbit/s Collaborate with Mellanox Step.29 µs 8

9 Some aspects of the Impact of Reality on Throughput

10 udpmon_send: How fast can I transmit? Turn off Checksum offload Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Only ~ 2 Gbit/s drop but can also turn off TCP offload! 6 pkt_size_txoffrxoff_geant_dtn1_ax4_4jun16 5 Send user data rate Mbit/s Xsum OFF Xsum ON Size of user data in packet bytes

11 udpmon_send: How fast can I transmit? Turn on the Firewalls Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Max drop 11 Gbit/s at 7872 bytes 6 pkt_size_dtn1_ax4_5jun16 5 Send user data rate Mbit/s Xsum On FW Off Xsum On FW On Size of user data in packet bytes

12 udpmon_send: How fast can I transmit? Which CPU core and node? Move IRQs from core 6, set affinity to lock udpmon to core 6 node 1. Run udpmon on core 2 no IRQs Firewalls ON Max drop 7.2 Gbit/s at 8972 bytes Send user data rate Mbit/s pkt_size_xsumonfwon_geant_dtn1_ax4_5jun16 Core 6 Node 1 Core 2 Node Size of user data in packet bytes Summary for 7772 byte user data Case Gbit/s No FW 43.3 FW ON 32.4 Wrong core 25.5

13 udpmon_send on servers at Boston Labs : 5 UDP flows Set affinity to lock udpmon_send to run on CPU cores on node 1. Send 8972 byte packets with wait Ome 3.6 μs (~2 Gbit/s) record the Ome- series every 5s. 3 flows had no packet loss ~2 Gbit/s, did not process irq, ~45 % kernel mode. Other 2 flows had 25-28% packet loss, process 3-4% so{irq, 96% kernel mode Throughput Gbit/s udpmon_tseries_sum_24nov % packet loss throughput % loss a4 % loss a8 % loss a1 % loss a2 % loss a4 Time during transfer hr 13

14 TCP Achievable Throughput

15 iperf3: TCP throughput Throughput vs TCP buffer size Distribute IRQs over all cores on node 1 Run iperf3 on core 6 Node 1, Firewalls OFF, TCP offload on, TCP cubic stack RTT.4 ms. Delay Bandwidth Product.5 MBytes. As expected rises smoothly to a plateau at.7 Mbytes reaching 75 Gbit/s Throughput constant a{er slow start. No TCP re- transmi ed segments observed (iperf3 and /proc/net/snmp ) BW Gbit/s DTN1-2_A6_TCPbuf_31May Buffer size Mbyte

16 iperf3: TCP throughput Which CPU core and Node? Distribute IRQs over all cores on node 1 Run iperf3 on core 6 node 1, and repeat on core 1 node Firewalls OFF, TCP offload on, TCP cubic stack RTT.4 ms. Delay Bandwidth Product.5 MBytes. Rises smoothly to a plateau at.5 Mbytes Throughput falls by 4 Gbit/s from 75 to 35 Gbit/s No TCP re- transmi ed segments observed (iperf3 and /proc/net/snmp ) BW Gbit/s Core 6 Node 1 Core 1 Node Buffer size Mbyte

17 iperf3: TCP throughput Use cores 6 & 1 on node 1 and node Firewalls OFF, TCP offload on, TCP cubic stack Rises smoothly to the plateau Throughput: 75 Gbit/s both Send & Receive on node 1 6 Gbit/s Send on node Reveive on node 1 35 Gbit/s both Send & Receive on node Very few TCP re- transmi ed segments observed BW Gbit/s DTN1-2_TCPbuf core 6 - core 6 core 1 - core 1 core 1 - core Buffer size Mbyte

18 iperf3: TCP throughput With Firewall ON Run iperf3 on core 6 node 1, TCP offload on, TCP cubic stack RTT.4 ms. Delay Bandwidth Product.5 MBytes. Rises smoothly to a plateau at.5 Mbytes Achievable throughput falls by 7.3 Gbit/s No TCP re- transmi ed segments observed (iperf3 and /proc/net/snmp ) BW Gbit/s No Firewall With Firewall Buffer size Mbyte

19 iperf: TCP throughput mulople flows Distribute IRQs over all cores on node 1 Run iperf on cores 6-11 for both receive and sending Firewalls ON, TCP offload on, TCP cubic Total throughput: increase 6 à 86 2 à 3 flows 98 Gbit/s for 4 & 5 flows then starts to fall % re- tx 2, 3 flows 1-4 % re- tx 4, 5 flows 1-3 % re- tx 8, 1 flows Individual flows can vary by ± 5 Gbit/s BW Gbit/s BW Gbit/s BW Gbit/s DTN1-2_A6-11_P2_TCPbuf_5Jun Buffer size Mbyte DTN1-2_A6-11_P4_TCPbuf_5Jun Buffer size Mbyte DTN1-2_A6-11_P8_TCPbuf_5Jun Buffer size Mbyte BW Gbit/s BW Gbit/s BW Gbit/s DTN1-2_A6-11_P3_TCPbuf_5Jun Buffer size Mbyte DTN1-2_A6-11_P5_TCPbuf_5Jun16 Buffer size Mbyte DTN1-2_A6-11_P8_TCPbuf_5Jun Buffer size Mbyte

20 Network Tuning for 1 Gigabit Ethernet Hyper threading Turn off in the BIOS Wait states Disable / minimise use of c- states. Use the BIOS or at boot Ome ( but I could not find out how in my BIOS!) Power saving Core Frequency Set governor performance $numactl H Set cpufreq to maximum $lspci tv Depends on scaling_driver: acpi- cpufreq allows se ng cpuinfo_cur_freq to max intel_pstate does not but seems fast anyway NUMA Check and select CPU cores in the node with Ethernet interfaces a ached $numactl H $lspci tv

21 Network Tuning for 1 Gigabit Ethernet IRQs #systemctl stop irqbalance.service Turn off the irqbalance service prevents balancer from changing the affinity scheme. Set affinity of the NIC IRQs to use CPU cores on the node with PCIe 1 per CPU. For UDP seems best NOT to use the CPU cores used by the apps. Interface parameters Ensure interrupt coalescence is ON 3 μs, 16 μs, more? Ensure Rx & Tx checksum offload is ON Ensure tcp- segmentaoon- offload is ON MTU Set IP MTU 9 Bytes #cat /proc/irq/<irq>/smp_affinity #echo 4 > /proc/irq/183/smp_affinity #ethtool C <i/f> rx-usecs 8 #ethtool K <i/f> rx on tx on Best in files eg ifcfg_ethx mtu=9

22 Network Tuning for 1 Gigabit Ethernet Queues Set txqueuelen transmit Q (I used 1 but 1, reccomended Set netdev_max_backlog say 25 Q between interface and IP stack Kernel parameters net.core.rmem_max net.core.wmem_max net.ipv4.tcp_rmem net.ipv4.tcp_wmem (min / default / max) net.ipv4.tcp_mtu_probing (jumbo frames) net.ipv4.tcp_congesoon_control BeJer to choose fewer high speed cores Best in file /etc/sysctl.conf See also h p:// docs/prod_so{ware/ Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf Esnet FasterData h ps://fasterdata.es.net/network- tuning/

23 A Look at the Disk Sub- system

NVMe Disks Non VolaOle Memory express a scalable host

Block IO based lockless block layer Shorter data path

24 NVMe Disks Non VolaOle Memory express a scalable host controller interface. Designed for SSD a ached to PICe PCIe cards or 2.5 drives. Block IO based lockless block layer Shorter data path bypasses costly AHCI / SCSI layers Latency & CPU cycles reduced > 5% SCSI 6 μs 19,5 cycles NVMe 2.8 μs 9,1 cycles Parallelism - per CPU HW queues:

25 NVMe Disk Performance RAID with 2 disks IRQs distributed over all cores on both nodes Run disk_test on core 2 Node Measure sequenoal read and write disk- memory rates as funcoon file size 2 disks in RAID xfs file system Drop at file size ~3 Gbytes to 27 Gbit/s read and 15 Gbit/s write 7 7 DTN2_a2_R-2d_filescan_7Jun16 DTN2_a2_filescan_6Jun16 14 DTN2_a2_R-2d_filescan_7Jun16 Throughput Gbit/s 6 5 Throughput Gbit/s Gbit/s r Gbit/s w Gbit/s r Gbit/s w Throughput GBytes/s GBytes r GBytes w File size File size GBytes GBytes File size GBytes

26 NVMe Disk Performance 1 NVMe disk IRQs distributed over all cores on both nodes Run disk_test on core 2 Node Measure sequenoal read and write disk- memory rates as funcoon file size xfs file system Read Gbit/s Write Gbit/s 1 Disk ~ yes read < write! RAID 2disk DTN2_a8_D1-1d_filescan_7Jun16 7 DTN2_a2_R-2d_filescan_7Jun16 Throughput Gbit/s Gbit/s r Gbit/s w Throughput Gbit/s Gbit/s r Gbit/s w File size GBytes File size GBytes

27 Disk Tuning Steps for Discussion Really a to- do next list RAID with more disks note the CPU loads makefs.xfs parameters Stripe size 256k? mkfs.xfs - f - l version=2 - i size=124 - n size= d su=256k,sw=22 - L myname Read ahead blockdev - - setra Requests echo 512 > /sys/block/sda/queue/nr_requests Scheduler echo deadline > /sys/block/sda/queue/scheduler Other so{ware RAID products Measure data transfers

28 Summary for Discussion UDP flows harder than expected working with Mellanox Use of CPU cores seems criocal for both UDP and TCP Shown that TCP performance is good Explored some of the aspects that impact on Throughput Started to understand NVMe disk behaviour help from Boston Labs UK Just starong to run globus and grid{p Lets open the discussion

Thank you Richard- Hughes Jones Richard.Hughes- Jones@geant.

The research leading to these results has received funding from the

29 Thank you Richard- Hughes Jones Richard.Hughes- GEANT Limited on behalf of the GN4 Phase 1 project (GN4-1). The research leading to these results has received funding from the European Union s Horizon 22 research and innovaoon programme under Grant Agreement No (GN4-1). 29

30 Setup at Boston Labs 1 Gbit Ethernet NIC A lot of help from Boston Labs (London UK) Supermicro X1DRT- P motherboard Two 1 core 2.3 GHz Intel Xeon E5-265 v3 Haswell processors Mellanox ConnectX more GE cmd_throughput_lite.pl NIC 16 lane PCI- e As many interrupts as cores Centos 6.7 with the el6 kernel. IniOally Hyper Threading On 4 CPUs! NIC 3

31 What is udpmon? So{ware package for invesogaong end host and network performance, using UDP/IP frames. Programs work in client- server pairs to: Transmit streams of sequenced UDP packets at regular, carefully controlled intervals. Can vairy frame size and frame transmit spacing. Receive and check the sequence & Oming of the packets. IdenOfy if packets lost in the end host or network. Allows measurement of: Request- response latency. Achievable UDP bandwidth, packet loss, packet ordering, ji er. Packet dynamics & packet loss pa erns. Quality of the connecoon path and its stability.

32 The client- server pairs udpmon_bw_mon à udpmon_resp Achievable UDP bandwidth, packet loss, packet ordering, ji er Packet dynamics & packet loss pa erns udpmon_req à udpmon_resp Request- response latency udpmon_send à udpmon_recv Quality of the connecoon path and its stability Time series of achievable UDP bandwidth, packet loss

33 Achievable UDP Throughput Measurements Send a controlled stream of UDP frames spaced at regular intervals with 64 bit sequence numbers & send Ome stamp. Record the packet receive Ome. Zero stats set concurrent lockout Sender Receiver OK done Send data frames at regular intervals Time to send Inter-packet time (Histogram) Time to receive Get remote statistics Signal end of test Time n bytes Wait time Number of packets Send statistics back: No. received No. lost + loss pattern OK done No. out-of-order No. lost in network CPU load No. interrupts & SNMP Tx, Rx times & 1-way delay time

Networking: Data Transfer Nodes and Tuning

Networking: Data Transfer Nodes and Tuning Richard Hughes-Jones GÉANT 1 GEANT Limited on behalf of the GN4 Phase 2 project (GN4-2). The research leading to these results has received funding from the European