Data transfer over the wide area network with a large round trip time

Journal of Physics: Conference Series Data transfer over the wide area network with a large round trip time To cite this article: H Matsunaga et al 1 J. Phys.: Conf. Ser. 219 656 Recent citations - A two phased service oriented broker for replica selection in data grids (2SOB) Rafah M. Almuttairi et al - Rafah M. Almuttairi et al View the article online for updates and enhancements. This content was downloaded from IP address 148.251.232.83 on 16/1/18 at :29

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 Data transfer over the wide area network with a large round trip time H Matsunaga, T Isobe, T Mashimo, H Sakamoto and I Ueda International Center for Elementary Particle Physics, the University of Tokyo, Tokyo, 113-33 Japan E-mail: matunaga@icepp.s.u-tokyo.ac.jp Abstract. A Tier-2 regional center is running at the University of Tokyo in Japan. This center receives a large amount of data of the ATLAS experiment from the Tier-1 center in France. Although the link between the two centers has 1Gbps bandwidth, it is not a dedicated link but is shared with other traffic, and the round trip time is 29ms. It is not easy to exploit the available bandwidth for such a link, so-called long fat network. We performed data transfer tests by using GridFTP in various combinations of the parameters, such as the number of parallel streams and the. In addition, we have gained experience of the actual data transfer in our production system where the Disk Pool Manager (DPM) is used as the Storage Element and the data transfer is controlled by the File Transfer Service (FTS). We report results of the tests and the daily activity, and discuss the improvement of the data transfer throughput. 1. Introduction In order to analyze a large amount of data produced in the experiments at the Large Hadron Collider (LHC) at CERN, the Worldwide LHC Computing Grid (WLCG) has been created to allow the data analysis to be performed using the distributed computing centers around the world, based on the data Grid environment. International Center for Elementary Particle Physics (ICEPP) of the University of Tokyo is one of the ATLAS collaborating institutes and operates a Tier-2 regional center with the aim of meeting the demands of the Japanese physicists in ATLAS. In the ATLAS computing model, a Tier-2 site should play a role in the user analysis as well as the Monte Carlo simulation data production. The data transfer activity at a Tier-2 site is dominated by replication of (real or simulated) data from the Tier-1 for the user analysis, while for the simulation data production the transfer rate is much lower. It should be noted that fast data transfer is essential for efficient data analysis. Each ATLAS Tier-2 site is associated with only one Tier-1 site which is usually geographically close (e.g. in the same country or same region), but in case of the ICEPP Tier-2, the Tier-1 site is CC-IN2P3 in Lyon, France, which is very far from Tokyo. It is well known that it is difficult to achieve high data transfer rate with Transmission Control Protocol (TCP) over the long distance network with a large bandwidth. The aim of this paper is to study performance for the network with a long latency and a large bandwidth between Japan and Europe, and understand the current limitations or bottlenecks. We present results of the data transfer tests with varying conditions and real-life experience in the Tier-2 production system, and also discuss possible improvement. c 1 IOP Publishing Ltd 1

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 2. Data Transfer over WAN Standard tool for the data transfer between the Grid sites or for downloading from the Grid sites is GridFTP[1]. GridFTP is included in most of the storage management systems, such as the Disk Pool Manager (DPM)[2] or dcache[3]. It uses TCP as the transport layer protocol. In TCP, data transfer rate is roughly given by window size / RTT, where RTT is the round trip time. Therefore, to maximize data transfer rate, one should increase or number of parallel streams or files. As for the network link, the ICEPP Tier-2 center connects to SINET3[4], the Japanese National Research and Education Network (NREN), through the University router. The SINET connects to GEANT2[5], the European academic network, at the MANLAN (Manhattan Landing) in New York City. Although the path is shared with other traffic, the bandwidth is 1Gbps from the ICEPP site to GEANT2 network. Furthermore, RENATER[6], which is the French NREN, provides 1Gbps link from GEANT2 to CC-IN2P3, hence 1Gbps is available for the whole path between ICEPP and CC-IN2P3. The RTT of the path is 29 ms, and the Bandwidth-Delay Product (BDP) is 1Gbps 29ms = 3MB, which is needed to fully use 1Gbps bandwidth with a single TCP stream. 3. Test setup We have set up Linux PCs at CERN and ICEPP. The CERN ICEPP route is slightly different within Europe from the CC-IN2P3 ICEPP route, and the RTT are almost the same. This CERN ICEPP route is also important for us because many Japanese physicists stationed at CERN copy data between the ICEPP regional center and their local resources at CERN. At CERN, the traffic between the test PCs goes through the High Throughput Access Route (HTAR) to bypass the CERN firewall. The HTAR bandwidth is limited to 1Gbps. The Linux PC at CERN is dual CPU Xeon L54 server with 32GB of RAM and Intel 1GbE Network Interface Card (NIC), running SLC4.7 x86 64. The data area is provided by the hardware RAID (3ware). At ICEPP, the Linux PC is dual CPU Xeon 51 server with 8GB of RAM and a Chelsio 1GbE NIC, running SLC4.7 x86 64. The data disk is the external RAID (Infortrend), attached with 4Gb Fibre Channel. This server is the same as the disk server used at the Tier-2 site. The Linux kernel is 2.6.9-78..13.EL.cernsmp included in SLC4.7, but for the PC at CERN, vanilla 2.6.27 kernel is also tried because it has improved network code, in particular CUBIC TCP implementation, in addition to BIC TCP which is the default in 2.6.9 kernel. The following parameters are set for Linux kernel and NIC: net.ipv4.tcp_sack= net.ipv4.tcp_dsack= net.ipv4.tcp_timestamp= net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_rmem=96 8738 16777216 net.ipv4.tcp_wmem=96 8738 16777216 net.core.rmem_max=16777216 net.core.wmem_max=16777216 net.core.netdev_max_backlog=1 txqueuelen=1 The first three parameters have been introduced for the extension of TCP. Even if they are enabled, no performance improvement is expected in many cases and we disable them in this test. tcp no metrics save is enabled not to cache the parameters in the previous TCP connections. is enlarged up to 16MiB. For the data area, we use XFS as the filesystem. The disk access speed depends on the kernel version, but we observe >1MB/s for both reading and writing on the CERN server, and 2/s for writing and 1/s for reading on the ICEPP server. 2

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 In the following tests, the sender node is always at CERN, while the receiver node is at ICEPP. 4. Iperf test Before performing disk-to-disk transfer test, we measure network throughput in the memory-tomemory mode by using iperf[7] program with TCP to check the pure network condition without the disk IO. Figure 1 shows network throughputs from CERN to ICEPP using iperf for various number of parallel streams (1, 2, 4 and 8) and window sizes (2, 4, 8 and 16MiB). To obtain the results, we measure multiple times and then take an average disregarding a small number (fixed fraction) of the worst measurements, which are likely due to the network congestion with other traffic by accident. As can be seen, the measured throughputs are proportional to number of streams or window size in case of <1MB/s, and 1Gbps (125MB/s) limit is achieved when the window size and/or number of streams are large enough. We see no difference between the kernel version (2.6.9 and 2.6.27) in the iperf results. In Figure 2, transfer rates with one stream are shown as a function of time. These are measured by running tcpdump on the receiver node, and t = is defined as the arrival of the first data packet. We see the slow-start phase of TCP in the first several seconds and then a constant rate in the congestion avoidance phase with small fluctuation. iperf iperf (1 stream, kernel 2.6.27) 1 1 1 1 8 8 1 2 3 4 5 6 7 8 9 1 8 1 1 Time (s) Figure 1. Network throughput with iperf from CERN to ICEPP, for varying window sizes and number of streams. Figure 2. One stream thoughput as a function of time. Sender node at CERN runs kernel 2.6.27. 5. GridFTP test Data transfer from disk to disk between the remote hosts is carried out simulating the actual use case. We use two versions of GridFTP in the Globus Toolkit 3.2.1 and 4.2.1, specifically a GridFTP server (version 1.17 or 3.15) runs on the receiver node at ICEPP, and a client, globusurl-copy, (version 3.6 or 4.14) is issued on the sender node at CERN. The GSI authentication is used, but the transfer rate is measured only after the authentication phase. File size is 4TB in most cases and 1TB in some slow cases. Figure 3 shows results of the data transfer rates from CERN to ICEPP for Globus Toolkit 3.2.1 or 4.2.1 and the Linux kernel at CERN 2.6.9 or 2.6.27. and TCP windows sizes, which are given in the command line options of globus-url-copy, are also changed as done in the iperf tests. Compared with the iperf results, the throughputs are clearly worse for higher rate points, probably because the disk IO speed is limited (in particular in reading from the slow disk of the CERN server) even with the multiple streams, or the speed is not very constant. Newer Linux kernel improves the transfer rates to some extent, and with this kernel there is little difference between the GridFTP versions. On the other hand, with 2.6.9 kernel 3

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 better performance is seen with the newer GridFTP. Overall, the throughputs are limited at /s due to slower disk IO on the sender node. gridftp (GT 3.2.1, kernel 2.6.9) gridftp (GT 3.2.1, kernel 2.6.27) 1 1 1 1 8 8 1 2 3 4 5 6 7 8 9 1 gridftp (GT 4.2.1, kernel 2.6.9) 1 2 3 4 5 6 7 8 9 1 gridftp (GT 4.2.1, kernel 2.6.27) 1 1 1 1 8 8 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 Figure 3. Results of data transfer with GridFTP from CERN to ICEPP, for varying window sizes and number of streams. Linux kernel is 2.6.9 (left) or 2.6.27 (right), and the Globus Toolkit is 3.2.1 (top) or 4.2.1 (bottom). Figure 4 shows a throughput of a file transfer with single stream as a function of time. In this case, one can see a drop in rates in the middle of the transfer, which is soon recovered as seen at the start-up of the transfer. Rate fluctuation is larger than the iperf result even in the constant phase. Results for a multiple stream transfer are shown in Figure 5. In this measurement, most streams, out of 8, are well balanced, but the aggregated rate is more unstable than one stream case due to heavier load on the disk IO. Interestingly, we occasionally see slow speed at about 1 seconds after the data transfer, which may be caused by the disk (RAID) characteristics. gridftp: per stream (1 stream,, kernel 2.6.9) 5 gridftp: per stream (8 streams,, kernel 2.6.27) Total *.5 5 3 3 1 1 8 1 1 1 1 18 Time (s) 1 3 5 Time (s) Figure 4. Result of throughput vs. time. (1 stream, 8MiB window size, kernel 2.6.9, Globus Toolkit 4.2.1). Packet loss happens during the data transfer. Figure 5. Data transfer rates per stream in a file transfer (8 streams, 4MiB window size, Globus Toolkit 4.2.1). Each rate of the streams and its sum (.5) are shown. Figure 6 shows results for the parallel file transfers with the same configuration. For this measurement, all files are in the same filesystem on both sender and receiver nodes. In case of 2 or 4 concurrent files, the total throughputs are nearly 1MB/s which seems to be limited by the local disk read, and small performance difference are seen between Linux kernel versions. 4

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 gridftp (GT 4.2.1, 32 MB, 8 streams) 1 1 8 Linux kernel 2.6.9 2.6.27.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Number of concurrent files Figure 6. Results for multiple file transfers using Linux kernel 2.6.9 or 2.6.27 on the sender node. (32MiB window size, 8 streams, Globus Toolkit 4.2.1). 6. Production system At the ICEPP Tier-2 site, DPM is used as the Storage Element. In the current configuration, it consists of one headnode and 13 disk servers. The headnode runs several services which manage name space and disk pool with a MySQL database backend, while the actual data transfer is performed by the disk server running the GridFTP server. The hardware and the operating system is the same as the ICEPP test server described above, but some parameters are different, in particular the maximum window size is 2MiB. The GridFTP software is provided by the DPM and the version of the GridFTP server is 2.3 (originally included in Globus Toolkit 4..3). For each disk server, 5 external RAID boxes are attached; 2 RAIDs are attached to a 4Gb Fibre Channel port and other 3 RAIDs are attached to another port. One XFS filesystem (6TB) is created for each RAID disk. In ATLAS, data transfer is managed by the Distributed Data Management (DDM)[8] system, which uses the File Transfer Service (FTS)[9] for the bulk data transfer and register files to the catalogs. The FTS controls the file transfer by using the third party transfer of the GridFTP between the Grid sites. In our usual case of the data transfer between CC-IN2P3 and ICEPP sites, DDM services are running at CERN, and FTS and LCG File Catalog (LFC)[1] are operated at CC-IN2P3. Therefore, compared with the test condition in the previous sections, the efficiency of data transfer is lowered by the overhead of the DDM, FTS, DPM services, and also GSI authentication in each file transfer. With FTS, a channel should be established between the Storage Elements of the remote sites and for each channel one can set number of concurrent files and number of GridFTP streams. Our current settings are files and 1 streams, respectively, which was determined by rough optimization. Figure 7 shows a snapshot of aggregated data traffic measured at the disk servers at ICEPP Tier-2. At that time, there were 6 disk servers at ICEPP and more than 3 disk servers of dcache at CC-IN2P3. A peak rate of 5MB/s has been observed when large files (3.5GB each) were transferred and other activities were low at both sites. The data transfer rate depends largely on the ATLAS and WLCG activities, and it is bursty rather than constant, and a bulk transfer usually lasts some minutes to several hours. As of writing this, we have observed 5MB/s data transfers from CC-IN2P3 several times. Figure 7. Data transfer rate from CC-IN2P3 to ICEPP in the production system. A peak rate of 5MB/s was observed. 7. Conclusions For the ICEPP Tier-2 site in Tokyo, the data transfer is a critical issue because the site mostly receives data from CC-IN2P3 Tier-1 site in Lyon, France, and also copies data from/to the 5

17th International Conference on Computing in High Energy and Nuclear Physics (CHEP9) IOP Publishing Journal of Physics: Conference Series 219 (1) 656 doi:1.188/1742-6596/219/6/656 local resources at CERN. The connection from ICEPP in Tokyo to Europe is not a private or dedicated network, and the bandwidth is 1Gbps with a large RTT (29ms to CC-IN2P3 or CERN). We have tested the network performance between ICEPP and CERN (1Gbps) with and without disk access using PCs. In the memory to memory test, the throughput are scalable with the window size or number of streams, and can easily reach the 1Gbps limit with modern hardware and software. In the disk to disk test, however, it is difficult to achieve the similar throughput even with multiple streams and multiple files. Judging from the obtained results, faster disk access and/or parallel streams with different hardware will be very important for good performance. In the comparison of Linux kernel versions 2.6.9 and 2.6.27, the new kernel leads to better performance and the reason may be improvement in the network or local disk access implementation, or both. Concerning the GridFTP version, the difference is less than that of the kernel version but some improvement can be expected. It has also been demonstrated that data transfer rates between ICEPP and CC-IN2P3 can exceed 5MB/s with many servers in the production system. This could be increased more in the future with new software and hardware as well as the system tunings, but it is still unclear that the connection still have a margin of bandwidth, especially in France, because CC-IN2P3 also sends data to other French Tier-2 sites (simultaneously) and a part of the path is shared with us. The optimization of the system parameters such as window size should be carefully performed because the disk server supports data access from many LAN clients in addition to WAN data transfer. We will study this LAN access in future to find the best settings for our DPM disk servers. Acknowledgments We would like to thank National Institute of Informatics (NII), Information Technology Center of the University of Tokyo, and Computing Research Center of High Energy Accelerator Research Organization (KEK) for setting up and managing the network infrastructure. Thanks go to J. Tanaka (ICEPP) for his help in setting up test PCs and HTAR route at CERN. We are also grateful to CC-IN2P3 staff for their cooperation and support for data transfer in the production system. References [1] Globus GridFTP. http://www.globus.org/toolkit/docs/latest-stable/data/gridftp/ [2] Disk Pool Manager. https://twiki.cern.ch/twiki/bin/view/lcg/dpmgeneraldescription [3] dcache. http://www.dcache.org/ [4] SINET3. http://www.sinet.ad.jp/ [5] GEANT2. http://www.geant2.net/ [6] RENATER. http://www.renater.fr/ [7] iperf. http://sourceforge.net/projects/iperf [8] Miguel Branco et al., J. Phys. Conference Series 119, 617 (8) [9] File Transfer Service. https://twiki.cern.ch/twiki/bin/view/egee/fts [1] LCG File Catalog. https://twiki.cern.ch/twiki/bin/view/lcg/lfcgeneraldescription 6