EVPath Performance Tests on the GTRI Parallel Software Testing and Evaluation Center (PASTEC) Cluster

Size: px

Start display at page:

Download "EVPath Performance Tests on the GTRI Parallel Software Testing and Evaluation Center (PASTEC) Cluster"

Constance Kennedy
5 years ago
Views:

1 EVPath Performance Tests on the GTRI Parallel Software Testing and Evaluation Center (PASTEC) Cluster Magdalena Slawinska, Greg Eisenhauer, Thomas M. Benson, Alan Nussbaum College of Computing, Georgia Institute of Technology, Atlanta, Georgia or Georgia Tech Research Institute, Atlanta, Georgia April 9, Abstract The document presents the performance results of running EVPath mtest/trans_test on the Georgia Tech Research Institute (GTRI) PASTEC (Parallel Software Testing and Evaluation Center) cluster [2] in a two-node setup over the EVPath sockets transport and in a single-node setup over the sockets and enet transport. In certain situations, the bandwidth obtained by EVPath was compared to the throughput achieved with the netperf benchmark [3]. The PASTEC cluster is owned by the School of Electrical and Computer Engineering at the Georgia Institute of Technology, and operated by the Sensors and Electromagnetics Laboratory of the GTRI. I. EXPERIMENT DESCRIPTION The experiments were run with EVPath trans_test and for certain experiments benchmarked with the independent tool netperf [3]. A. EVPath trans_test The trans_test is a part of the EVPath messaging constructor package and can be found at evpath-build/evpath/source/mtest/trans_test.c. The trans_test measures one-way delivered bandwidth by a particular transport. At the beginning the short start message is sent that initiates timing upon receive. Next, the actual messages are being sent specified by the -size parameter; the message is divided into vectors. After sending msg_count messages, the timing is terminated. The bandwidth is calculated based on the number of bytes transferred according to the formula: bandwidth [Mbps] = size [bytes] msg_count 8/time [sec]. (1) The experiments were conducted in the following setups: single-node over sockets transport (c4-xx, c3-xx) single-node over enet transport (c4-xx, c3-xx) two-nodes over sockets transport (c4-xx, c3-xx) The c4-xx nodes are equipped with 1 gigabit Ethernet NIC, and c3-xx nodes with specifications provided in Table I and the kernel parameter values for c3-xx nodes in Table II. The processors on c4-xx and c3-xx are different, Intel(R) Xeon(R) CPU X GHz and Intel(R) Core(TM) i7-3960x CPU 3.30GHz, respectively. More detail in this regard is provided in Table III. The EVPath enet transport is a reliable UDP transport based on an open source package ENet [1] that aims at providing a thin, yet robust network communication layer on top of UDP. The open source ENet package provides optional reliable, in-order delivery of packets. It does not support higher level networking features such as authentication, encryption, etc. All experiments with trans_test were conducted with the following command:

2 2 TABLE I: NIC specifications on c3-xx nodes. Port Bandwidth Vendor and Model Driver Ver Firmware Ver em1 1 GbE Broadcom adapter; model BCM sb p72p1 10 GbE Myricom adapter; model 10G-PCIE-8B p8p1 40 GbE Mellanox ConnectX-3 (part no. MCX354-FBCT) TABLE II: The kernel parameter values (via sysctl) on c3-xx nodes. Kernel Parameter Value net.core.optmem_max net.core.rmem_default net.core.rmem_max net.core.wmem_default net.core.wmem_max net.ipv4.tcp_mem net.ipv4.tcp_rmem net.ipv4.tcp_wmem /trans_test -transport $TRANSPORT -size $BYTES -vectors 1 -msg_count $MSG_COUNT \ -reuse_write_buffers 1 -take_receive_buffer 0 -timeout 60 The $TRANSPORT parameter was used to select the relevant transport, i.e., sockets or enet. The $BYTES parameter, which describes the size of the message in bytes, varied from 1024 to The $MSG_COUNT parameter was experimentally determined: on c4-xx nodes it was selected to ensure the duration of a single test to about 30 seconds, and for c3-xx it varied for different message sizes c.a. <= 3 seconds and c.a. 10 seconds. The two-node experiment was executed over ssh: 1) On c4-xx nodes an additional option -ssh $NODE was used to connect to the appropriate remote node; the CM_INTERFACE variable was not set; trans_test selected the default NIC. 2) On c3-xx nodes the -n option was used for trans_test because the -ssh option did not work due to the lack of reverse DNS for NICs on c3-xx nodes; the -n option outputs the explicit contact information that needs to be provided to trans_test on the remote node as an input parameter; TABLE III: The processor s specifications for c4-xx and c3-xx nodes as reported by /proc/cpuinfo. The nodes were running in powersave mode using the intel_pstate driver. In that cpuinfo snapshot, the CPU cores are downclocked, but they will have ramped up in clock frequency during the tests. The separate tests executed with iperf showed that switching the cores to performance mode did not make an appreciable difference; it previously helped to use performance mode with the older ACPI driver, but the intel_pstate driver seems to be more responsive in terms of dynamic clock adjustment. Characteristic c4-xx c3-xx vendor_id GenuineIntel GenuineIntel cpu family 6 6 model model name Intel(R) Xeon(R) CPU X GHz Intel(R) Core(TM) i7-3960x CPU 3.30GHz cpu MHz (see a comment in Table s Caption) cache size KB KB siblings cpu cores 6 6 cpuid level bogomips clflush size cache_alignment address sizes 40 bits physical, 48 bits virtual 46 bits physical, 48 bits virtual

3 3 #./trans_test client./trans_test -transport $TRANSPORT -size $MSG_SIZE -vectors 1 \ -msg_count $MSG_COUNT -reuse_write_buffers 1 -take_receive_buffer 0 \ -timeout 60 -n #./trans_test server on a remote node./trans_test -transport $TRANSPORT -size $MSG_SIZE -vectors 1 -msg_count $MSG_COUNT\ -reuse_write_buffers 1 -take_receive_buffer 0 -timeout 60 \ -n AAIAAJTJ8o29ZQAAATkCmAILqMA= 3) The selection of a NIC was performed via the environment variable CM_INTERFACE that was preceding the trans_test command. Specifically, it was set to p8p1, p72p1, em1, as e.g. indicated by ifconfig -a on c3-xx. For instance CM_INTERFACE=p8p1./trans_test -transport $TRANSPORT -size $MSG_SIZE -vectors 1 \ -msg_count $MSG_COUNT -reuse_write_buffers 1 -take_receive_buffer 0 \ -timeout 60 -n Apart from the single-node experiment with netperf on a c3-xx, each experiment was repeated five times, and the average was calculated and presented in figures and tables. B. Benchmark netperf For runs on c3-xx nodes the independent benchmark netperf [3], a benchmarking tool to measure the socket throughput, was executed. On a single node netperf was executed only once. Since the installation of netperf was accomplished in the user s space a client-server netperf version was used netserver was executing on the server node and netperf was running on the client node. The netserver was run as:./netserver The netperf was executed with the following command (the test_duration parameter indicates in seconds for the intended duration of the test):./netperf -H remote-hostname -l test_duration -- -m mesg_size II. RESULTS The results are presented in Fig. 1, and Fig. 2, for a single-node experiment, and for a two-node experiment, respectively. Table IV and Table V show the performance comparison between EVPath and netperf for c3-xx nodes, for the single-node experiment, and for the two-node experiment, respectively. It seems that in order to improve the performance results significantly the network driver parameters need to be modified, specifically an increase in the receive ring buffer size from the default 1024 to the maximum 8192 in the 40 GbE driver resulted in performance boost from ~10 Gbps to Gpbs for all test programs, i.e., netperf, iperf (the iperf results are not included in this study), and EVPath. A. Single-Node Experiment Results The single node experiment over the EVPath sockets transport for c4-xx shows that the bandwidth achieves the greatest value equal 45,692 Mbps for the 1 MiB message size, and then decreases to 26,000 Mbps for the 16 MiB message. It is a 43% decrease in comparison to the observed peak bandwidth. EVPath achieves the highest bandwidth of 65,000 Mbps for the 1 MiB message size in the single-node experiment for c3-xx over 1GbE NIC (em1), and it also deteriorates about 41% to 38,500 Mbps at the 16 MiB message size in comparison to the observed peak throughput.

4 4 Single node: sockets transport Bandwidth [Mbps] 60,000 40,000 20, KiB 4KiB 16KiB 64KiB 256KiB 1MiB 4MiB 16MiB Message size [KiB] or [MiB] c4-05 EVPath c3-00 em1 EVPath c3-00 p72p1 EVPath c3-00 p8p1 EVPath c3-00 em1 netperf c3-00 p72p1 netperf c3-00 p8p1 netperf (a) A single node over the sockets transport. For numerical values please refer to Table VI, Table VII, Table VIII, Table IX, and Table X. The netperf experiments were run only once per each NIC and lasted 30 seconds. 160 Single node: enet transport Bandwidth [Mbps] c4-05 EVPath c3-00 em1 EVPath c3-00 p72p1 EVPath c3-00 p8p1 EVPath 1KiB 4KiB 16KiB 64KiB 256KiB 1MiB 4MiB 16MiB Message size [KiB] or [MiB] (b) A single node over the enet transport. For numerical values please refer to Table XI, Table XII, Table XIII, and Table XIV. Fig. 1: Single node performance tests for Pastec over the sockets and enet; averaged over 5 experiments apart from netperf in Fig. 1a that was executed only once per each NIC and lasted 30 seconds; EVPath message count varies to allow for duration of the test for about 30 seconds. The c4-xx NIC is 1GbE, c3-xx em1 1GbE, p72p1 10GbE, and p8p1 40GbE. The observed EVPath trends are consistent with the throughput reports obtained from netperf, i.e., the greatest bandwidth of 54,289 Mbps over em1 on c3-xx was observed for the 1 MiB message size, and then 17% bandwidth deterioration for the 16 MiB message. The single node experiment over the enet transport demonstrates a trend similar to the single node over sockets experiment (to some extent). However, the highest bandwidth, 146 Mbps, is achieved at the 256 KiB message

5 5 TABLE IV: Average EVPath bandwidth versus netperf for a single c3-xx node over 1 GbE, 10 GbE, and 40 GbE. The percentage is calculated according to the formula: B EVPath B netperf 100. Bandwidth 40 GbE Bandwidth 10 GbE Bandwidth 1 GbE Message size EVPath [Mbps] netperf [Mbps] [%] EVPath [Mbps] netperf [Mbps] [%] EVPath [Mbps] netperf[mbps] [%] 1KiB 4,138 12, ,165 13, ,159 13, KiB 9,055 30, ,066 31, ,058 31, KiB 30,526 38, ,354 43, ,333 38, KiB 44,279 41, ,174 43, ,060 42, KiB 58,724 52, ,074 52, ,249 51, MiB 64,704 61, ,967 54, ,999 54, MiB 61,006 51, ,315 51, ,340 50, MiB 38,561 45, ,505 45, ,516 45, size, and then it starts decreasing to about 116 Mbps for 16 MiB on c4-xx. On the c3-xx node for all tested NICs, the bandwidth peaked at about 143 Mbps for the 1 MiB message (for 256 KiB it was very close and peaked at about 142 Mbps), and then deteriorated to about 136 [Mpbs] for 16 MiB. The bandwidth deterioration at 16 MiB message sizes on the single node over the enet transport is not so dramatic as in the case of the single-node sockets experiment, and it is about 20% in comparison to the highest bandwidth observed over 1GbE NIC on c4-xx, and 5% compared to the highest bandwidth achieved over 1GbE, 10GbE, and 40GbE on c3-xx. The bandwidth deterioration is surprising; the anticipated behavior is that the bandwidth should saturate at a certain message size and remain constant with the increase of the message size. The observed deterioration might result from the fact that the memory bandwidth on chips has been exceeded above a certain message size, which might indicate that some of the data stays within the cache hierarchy and thus some DRAM operations are avoided altogether. Once the data can no longer live within the L3 cache, performance becomes limited by the maximum bandwidth to memory. The single node over sockets transport allows to achieve c.a. 313x greater bandwidth than the single node over enet on c4-xx, i.e., 45,692 Mbps vs. 146 Mbps, respectively, and 456x on c3-xx, 65,000 Mbps vs. 143 Mbps. On a single c3-xx node the usage of CM_INTERFACE does not have any impact, as the results obtained for em1, p72p1, p8p1 are very similar. This is also confirmed by netperf the obtained throughput does not depend on the selected NIC. Although the throughput achieved over p8p1 for 1MiB by netperf is about 14% better than over p72p1 or em1; however, it has to be taken into account that there was only one sample of netperf run. On a single c3-xx node the EVPath performance is % of the netperf performance for 1 KiB 64 KiB message sizes, % of the netperf performance for 256 KiB 4 MiB, and 85% of the netperf performance for 16 MiB (Table IV). The duration of the tests has impact on the bandwidth achieved by netperf. In general, netperf reports better bandwidth for a default, 10 seconds test run, than for 30 seconds. The netperf tests lasted 30 seconds each, similarly to EVPath trans_test in this setup; although, it is not possible to configure EVPath to last for precisely 30 seconds, the execution time was chosen empirically by manipulating of the message count parameter for a particular message size to last for about 30 seconds. This might be one of the reasons why netperf reports higher bandwidths. The higher single-node performance for the c3-xx nodes may just be due to having newer CPUs (see Table III). The CPUs in c3-xx have significantly higher maximum memory bandwidth than those in c4-xx. The memory speed should not be a factor, as c4-xx and c3-xx are nominally the same speed (DDR3 1333MHz), although from different manufactures. In fact, we believe that the peak bandwidth results exceed the maximum memory bandwidth on those chips, which would seem to indicate that some of the data stays within the cache hierarchy and thus some DRAM operations are avoided altogether. As we stated earlier, this might also explain the degradation in results above a certain message size. Once the data can no longer live within the L3 cache, performance becomes limited by the maximum bandwidth to memory.

6 6 B. Two-Node Experiment Results Two nodes: sockets transport, 1GbE Bandwidth [Mbps] c4-05->c4-09 EVPath c3-01->c3-00 em1 EVPath c3-01->c3-00 em1 netperf 1KiB 4KiB 16KiB 64KiB 256KiB 1MiB 4MiB 16MiB Message size [KiB] or [MiB] (a) Two nodes over the sockets transport over 1GbE Ethernet interface. For numerical values please refer to Table XV, Table XVI, and Table XVII. Two nodes: sockets transport, p72p1 10GbE and p8p1 40GbE 20,000 Bandwidth [Mbps] 15,000 10,000 5,000 0 c3-00->c3-01 p8p1 EVPath c3-01->c3-00 p72p1 EVPath c3-00->c3-01 p8p1 netperf c3-01->c3-00 p72p1 netperf 1KiB 4KiB 16KiB 64KiB 256KiB 1MiB 4MiB 16MiB Message size [KiB] or [MiB] (b) Two nodes over the sockets transport over p8p1 and p72p1 interfaces. For numerical values please refer to Table XX, Table XVIII, Table XXI, and Table XIX. Fig. 2: Performance tests for two nodes c3-01->c3-00 or c3-00->c3-01 for Pastec over the sockets; average over 5 experiments; duration of the EVPath tests varies up to about 20 seconds for em1 and 15 seconds for p8p1, and p72p1; duration of the netperf is 3 seconds or 10 seconds. For 40 GbE the test direction had an impact that was demonstrated by both test programs. The achieved bandwidth c3-00->c3-01 was twice as high as for c3-01->c3-00. In this report only c3-00->c3-01 results over 40 GbE are included. The sockets transport for c4-xx in a two-node setup is stable for all message sizes and is about 941 Mbps. For

7 7 TABLE V: The average EVPath performance versus the average netperf performance on a two-node setup over 1 GbE, 10 GbE, and 40 GbE; c3-00->c3-01. The percentage is calculated according to the formula: B EVPath B netperf 100. Bandwidth 40 GbE Bandwidth 10 GbE Bandwidth 1 GbE Message size EVPath [Mbps] netperf [Mbps] [%] EVPath [Mbps] netperf [Mbps] [%] EVPath [Mbps] netperf[mbps] [%] 1KiB 2,250 8, ,355 9, KiB 4,738 12, ,409 9, KiB 10,923 21, ,871 9, KiB 15,445 21, ,902 9, KiB 20,698 21, ,868 9, MiB 19,047 19, ,714 9, MiB 18,758 19, ,647 9, MiB 17,891 19, ,556 9, c3-xx nodes, the observed peak bandwidth was 884 Mbps; 9,902 Mbps; 20,698 Mbps for 1 GbE (em1), 10 GbE (p72p1), and 40 GbE (p8p1), respectively. The results obtained for c3-xx were consistent with netperf reports 851 Mbps; 9,901 Mbps; 21,653 Mbps over em1, p72p1, and p8p1, respectively. There is about 100 Mbps difference in the achieved bandwidth between 1 GbE NICs on c4-xx and c3-xx. The lower performance over 1 GbE is likely due to a lower quality Broadcom-based NIC in the c3-xx nodes relative to the Intel chipset based NIC in c4-xx. The c3-xx nodes use a consumer ASRock motherboard whereas the c4-xx nodes use server-class Supermicro boards. For c3-xx nodes, the bandwidth deteriorates from 884 Mbps for the 1 KiB message size to 840 Mbps at the 16 KiB message size, and increases to 849 Mbps for 16 MiB message size over 1 GbE NIC. This behavior is similar to the one reported by netperf. However, we have observed the opposite behavior over 10 GbE and 40 GbE c3-xx NICs. The bandwidth reported by EVPath increases up to 9,800 Mbps for 10 GbE and stabilizes at the 16 KiB message size, although netperf performance remains above 9,000 Mbps through the entire message size range. However, for 40 GbE netperf demonstrates a similar behavior to EVPath, although it achieves its close-to-maximum bandwidth at the 16 KiB message size and remains in the range 19,000 22,000 Mbps, whereas EVPath bandwidth stabilizes at the 256 KiB message size and remains in the range from about 18,000 Mbps to less than 21,000 Mbps. Each of test programs achieves at maximum about 50% of available bandwidth over 40 GbE. The EVPath throughput is 94 98% of the netperf performance starting from the 256 KiB message size (Table V). For c3-xx, the bandwidth over 10 GbE NIC reported by EVPath is 24% and 55% of the netperf performance for message sizes 1 KiB and 4 KiB, respectively, and % for 16 KiB 16 MiB (Table V). For c3-xx, the throughputs over 1 GbE NIC reported by netperf and EVPath are almost identical, apart from message sizes 1 KiB and 4 KiB for which EVPath achieves 4+%, and 3% better bandwidth than netperf, respectively (Table V). III. ACKNOWLEDGMENTS This work was supported in part by the GTRI Independent Research and Development (IRAD) under the contract I , GTRI-SEAL-S2APO-DO-ATL ( A). REFERENCES [1] enet website. March [2] Parallel Software Testing and Evaluation Center at Georgia Tech Research Institute. March [3] Rick Jones. Netperf Homepage. March Release

8 8 Tables with numerical values. IV. APPENDIX TABLE VI: The EVPath experiment summary for a single node (c4-05); averaged over 5 samples; sockets transport. 1KiB , KiB , KiB , KiB ,530 5, KiB ,124 2, MiB ,692 2, MiB , MiB , TABLE VII: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; sockets transport; interface em1. 1KiB , KiB , KiB , KiB , KiB , MiB , MiB ,340 1, MiB ,

9 9 TABLE VIII: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; sockets transport; interface p72p1. 1KiB , KiB , KiB , KiB ,174 1, KiB , MiB , MiB ,315 1, MiB , TABLE IX: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; sockets transport; interface p8p1. 1KiB , KiB , KiB , KiB ,279 1, KiB , MiB , MiB ,006 1, MiB , TABLE X: The experiment summary for a single node (c3-00); only one sample for each NIC; sockets transport; netperf results; each test ran for 30 sec; received size socket was determined automatically by netperf and was bytes; send size socket was also determined automatically by netperf at bytes. Message size Bandwidth em1 [10 6 bits sec ] bits Bandwidth p72p1 [106 sec ] 1KiB 13,429 13,434 12,999 4KiB 31,302 31,549 30,609 16KiB 38,213 43,558 38,164 64KiB 42,095 43,489 41, KiB 51,793 52,255 52,338 1MiB 54,289 54,256 61,704 4MiB 50,704 51,000 51,591 16MiB 45,136 45,710 45,597 bits Bandwidth p8p1 [106 sec ] TABLE XI: The EVPath experiment summary for a single node (c4-05); averaged over 5 samples; enet transport. 1KiB KiB KiB KiB KiB MiB MiB MiB

10 10 TABLE XII: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; enet transport; interface em1. 1KiB KiB KiB KiB KiB MiB MiB MiB TABLE XIII: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; enet transport; interface p72p1. 1KiB KiB KiB KiB KiB MiB MiB MiB TABLE XIV: The EVPath experiment summary for a single node (c3-00); averaged over 5 samples; enet transport; interface p8p1. 1KiB KiB KiB KiB KiB MiB MiB MiB TABLE XV: The EVPath experiment summary for two nodes; the test started at c4-05 and ssh-ed to c4-09; averaged over 5 samples; sockets transport. 1KiB KiB KiB KiB KiB MiB MiB MiB

11 11 TABLE XVI: The EVPath experiment summary for two nodes (c3-01 -> c3-00); averaged over 5 samples; the socket transport; interface em1. 1KiB KiB KiB KiB KiB MiB MiB MiB TABLE XVII: The netperf experiment summary for two nodes (c3-01 -> c3-00); averaged over 5 samples; received size socket was determined automatically by netperf and was bytes; send size socket was also determined automatically by netperf at bytes; the socket transport; interface em1. Message size Time [secs] Bandwidth [Mbps] Std. dev. σ σ Bandwidth 1KiB KiB KiB KiB KiB MiB MiB MiB TABLE XVIII: The EVPath experiment summary for two nodes (c3-01 -> c3-00); averaged over 5 samples; the socket transport; interface p72p1. 1KiB , KiB , KiB , KiB , KiB , MiB , MiB , MiB , TABLE XIX: The netperf experiment summary for two nodes (c3-01 -> c3-00); averaged over 5 samples; received size socket was determined automatically by netperf and was bytes; send size socket was also determined automatically by netperf at bytes; the socket transport; interface p72p1. Message size Time [secs] Bandwidth [Mbps] Std. dev. σ σ Bandwidth 1KiB , KiB , KiB , KiB , KiB , MiB , MiB , MiB ,

12 12 TABLE XX: The EVPath experiment summary for two nodes (c3-00 -> c3-01); averaged over 5 samples; the socket transport; interface p8p1. 1KiB , KiB , KiB , KiB ,445 1, KiB , MiB , MiB , MiB , TABLE XXI: The netperf experiment summary for two nodes (c3-00 -> c3-01); averaged over 5 samples; received size socket was determined automatically by netperf and was bytes; send size socket was also determined automatically by netperf at bytes; the socket transport; interface p8p1. Message size Time [secs] Bandwidth [Mbps] Std. dev. σ σ Bandwidth 1KiB , KiB , KiB , KiB , KiB , MiB , MiB , MiB ,

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep 46PaQ Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep 2005 1 Today s talk Overview Current Status and Results Future Work