Disk-to-Disk network transfers at 100 Gb/s

Similar documents
DYNES: DYnamic NEtwork System

Data transfer over the wide area network with a large round trip time

CMS Data Transfer Challenges and Experiences with 40G End Hosts

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Programmable Information Highway (with no Traffic Jams)

Canadian Networks for Particle Physics Research 2011 Report to the Standing Committee on Interregional Connectivity, ICFA Panel January 2012

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

File Access Optimization with the Lustre Filesystem at Florida CMS T2

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

Ronald van der Pol

Physicists Set New Record for Network Data Transfer 13 December 2006

Internet data transfer record between CERN and California. Sylvain Ravot (Caltech) Paolo Moroni (CERN)

Data Transfers Between LHC Grid Sites Dorian Kcira

10 Gbit/s Challenge inside the Openlab framework

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Ronald van der Pol

High Throughput WAN Data Transfer with Hadoop-based Storage

ARISTA: Improving Application Performance While Reducing Complexity

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

Current Status of the Ceph Based Storage Systems at the RACF

November 1 st 2010, Internet2 Fall Member Mee5ng Jason Zurawski Research Liaison

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

Gigabyte Bandwidth Enables Global Co-Laboratories

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The ATLAS-Canada network

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Improving Packet Processing Performance of a Memory- Bounded Application

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Experiences with 40G/100G Applications

ATLAS operations in the GridKa T1/T2 Cloud

Towards Network Awareness in LHC Computing

Achieve Optimal Network Throughput on the Cisco UCS S3260 Storage Server

AmLight supports wide-area network demonstrations in Super Computing 2013 (SC13)

Pacific Wave: Building an SDN Exchange

LHC and LSST Use Cases

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Performance of popular open source databases for HEP related computing problems

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

ANSE: Advanced Network Services for [LHC] Experiments

Network Reliability. Artur Barczyk California Institute of Technology Internet2 Spring Member Meeting Arlington, April 25 th, 2011

16GFC Sets The Pace For Storage Networks

GÉANT Open Service Description. High Performance Interconnectivity to Support Advanced Research

THE ATLAS DATA ACQUISITION SYSTEM IN LHC RUN 2

CMS High Level Trigger Timing Measurements

Striped Data Server for Scalable Parallel Data Analysis

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

PCI Express SATA III RAID Controller Card with Mini-SAS Connector (SFF-8087) - HyperDuo SSD Tiering

DELL EMC READY BUNDLE FOR MICROSOFT EXCHANGE

Ronald van der Pol

Data oriented job submission scheme for the PHENIX user analysis in CCJ

Accelerating Ceph with Flash and High Speed Networks

5 August 2010 Eric Boyd, Internet2 Deputy CTO

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed Systems. June 2007

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Understanding the T2 traffic in CMS during Run-1

Reference Architecture Microsoft Exchange 2013 on Dell PowerEdge R730xd 2500 Mailboxes

Storage Resource Sharing with CASTOR.

Birds of a Feather Presentation

JMR ELECTRONICS INC. WHITE PAPER

SurFS Product Description

Technical Presales Guidance for Partners

GÉANT Open Service Description. High Performance Interconnectivity to Support Advanced Research

Block Storage Service: Status and Performance

Future Routing Schemes in Petascale clusters

Accelerate Applications Using EqualLogic Arrays with directcache

InfiniBand Networked Flash Storage

ATTO Celerity 32Gb Gen 6 Fibre Channel Host Bus Adapters High-Performance Shared Workflows for 4K and 8K Video

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Experience with PROOF-Lite in ATLAS data analysis

Use of containerisation as an alternative to full virtualisation in grid environments.

New Approaches to Optical Packet Switching in Carrier Networks. Thomas C. McDermott Chiaro Networks Richardson, Texas

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

On enhancing GridFTP and GPFS performances

10GE network tests with UDP. Janusz Szuba European XFEL

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

A first look at 100 Gbps LAN technologies, with an emphasis on future DAQ applications.

FC-NVMe. NVMe over Fabrics. Fibre Channel the most trusted fabric can transport NVMe natively. White Paper

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

IBM ProtecTIER and Netbackup OpenStorage (OST)

The ALICE Glance Shift Accounting Management System (SAMS)

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

Monitoring ARC services with GangliARC

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Parallel File Systems for HPC

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

4 Port PCI Express 2.0 SATA III 6Gbps RAID Controller Card with HyperDuo SSD Tiering

Multi-class Applications for Parallel Usage of a Guaranteed Rate and a Scavenger Service

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

NVMe Over Fabrics (NVMe-oF)

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Red Hat Ceph Storage and Samsung NVMe SSDs for intensive workloads

Transcription:

Journal of Physics: Conference Series Disk-to-Disk network transfers at 100 Gb/s To cite this article: Artur Barczyk et al 2012 J. Phys.: Conf. Ser. 396 042006 View the article online for updates and enhancements. Related content - Identifying Gaps in Grid Middleware on Fast Networks with the Advanced Networking Initiative Dave Dykstra, Gabriele Garzoglio, Hyunwoo Kim et al. - Data transfer over the wide area network with a large round trip time H Matsunaga, T Isobe, T Mashimo et al. - Dimensioning storage and computing clusters for efficient high throughput computing E Accion, A Bria, G Bernabeu et al. This content was downloaded from IP address 148.251.232.83 on 17/04/2018 at 17:42

Disk-to-Disk network transfers at 100 Gb/s Artur Barczyk α, Ian Gable β, Marilyn Hay γ, Colin Leavett-Brown β, Iosif Legrand α, Kim Lewall β, Shawn McKee δ, Donald McWilliam γ, Azher Mughal α, Harvey Newman α, Sandor Rozsa α, Yvan Savard β, Randall J. Sobie β, Thomas Tam ɛ, Ramiro Voicu α α: California Institute of Technology, Pasadena, CA, USA β: University of Victoria, BC, Canada γ: BCNET, Vancouver, BC, Canada δ: University of Michigan, Ann Arbor, MI, USA ɛ: CANARIE Inc, Ottawa, ON, Canada E-mail: newman@hep.caltech.edu E-mail: rsobie@uvic.ca Abstract. A 100 Gbps network was established between the California Institute of Technology conference booth at the Super Computing 2011 conference in Seattle, Washington and the computing center at the University of Victoria in Canada. A circuit was established over the BCNET, CANARIE and Super Computing (SCInet) networks using dedicated equipment. The small set of servers at the endpoints used a combination of 10GE and 40GE technologies, and SSD drives for data storage. The configuration of the network and the server configuration are discussed. We will show that the system was able to achieve disk-to-disk transfer rates of 60 Gbps and memory-to-memory rates in excess of 180 Gbps across the WAN. We will discuss the transfer tools, disk configurations, and monitoring tools used in the demonstration. 1. Introduction The ATLAS [1] and CMS [2] experiments located at the LHC [3] have accumulated in excess of 100 Petabytes of data since 2010. The analysis of the data from these experiments follows the LHC Computing Model that was initially based a on rigid hierarchical structure where Tier 2 centres exchange traffic primarily with their regional Tier 1 centre. Today the LHC Computing model is evolving to an agile peer-to-peer model which makes efficient use of compute and storage resources by exploiting high bandwidth networks [4]. In this model, the Tier 1-2 centers can directly access data from any other center. This new paradigm is being enabled by 100 Gbps networking technology that is being deployed in the cores of major research and education network providers such as ESnet, GEANT, CANARIE and Internet2. The LHC experiments need to be ready to exploit the current generation of 100 Gbps networks and the coming era of Terabit/s network transfers. In this paper we show how large data sets can be rapidly transferred using high-speed network technologies. In particular, we present the results of a demonstration staged during the Super Computing 2011 (SC11) Conference in Seattle Washington. A 100 Gpbs network was established between the University of Victoria (UVic) computing centre and the California Institute of Technology (Caltech) booth at SC11 in Seattle. A bi-directional throughput of 186 Gbps memory-to-memory and a single direction Published under licence by IOP Publishing Ltd 1

UVic Data Centre Brocade MLXe-4 100 GE VicTX, Victoria Ciena OME 6500 100 G SCinet, Seattle Ciena OME 6500 11 km 202 km Optical Transport BCNet, CANARIE 10 x10 GE 100 GE CalTech Booth Brocade MLXe-4 10 x Dell R710 with 10 GE NICs 12 x 10GE 3 x QSFP Breakout Cables PCIe Gen 2 10 GE NIC servers PCIe Gen 3 40 GE NIC servers 10 GE 40 GE Dell F10 Z9000 Figure 1. The 100 G circuit established between the UVic Data Centre in Victoria Canada and the Caltech Booth at the Seattle Convention Centre. Further detail of machines in the Caltech booth is available in Figure 2 and in Section 3. throughput of 60 Gbps disk-to-disk (using Solid State Disk (SSD) technology) was achieved during the conference. This throughput was achieved with roughly half a standard 42U 19-inch rack of Linux servers at each location. We describe the network and the server systems used at the Caltech booth and the UVic computing centre. The results presented in this work were obtained in the one week period during the SC11 exhibition. 2. Network Design A point-to-point 100 Gbps circuit was established between the UVic Data Centre and the Seattle Convention Centre over a total distance of 212 km using production segments of the BCNet and CANARIE networks. Figure 1 shows a schematic of the network. A Brocade MLXe-4 with LR4 100G optic was located in the UVic Data Centre connecting via BCNET to a Ciena OME 6500 located in the Victoria Transit Exchange (VicTX) in downtown Victoria. From there the circuit is carried across CANARIE and BCNET via an OTU4 link to a second OME 6500 Located in SCinet network in the Seattle Convention Centre, and then to an MLXe-4 (also with LR4 optic) located in the Caltech conference booth. Each MLXe-4 was equipped with two 8 port 10GE line cards. 3. End Site Systems The focus of the UVic end system was to provide a stable platform for testing the more experimental end system installed on the exhibition floor at SC. The UVic cluster consisted of 10 identically configured Dell R710 servers each with Intel X520 DA network cards and six 2

Scinet Ciena OME6500 100GE Port 1/1 100GE Brocade MLXe-4 40GE Ethernet 1/1-1/8, 2/1-2/4 10GE 12 x 10GE 0/0 0/52 0/63 Dell / Force10 Z9000 64 68 72 76 80 84 88 4 94 98 96..99 100..103 104..107 108..111 20 24 SC 29 SC 30 SM Gen2 SM Gen2 SC 1-4 SM Gen2 SC 31 SC 7-8 SC 9-12 Disk-1 SM Gen3 SC 32 SC 13-16 SC 17-18 Disk-2 SM Gen3 SC 33 Disk-3 SM Gen3 SC 34 Disk-4 Server Server SC 24 SC 25 Figure 2. The end system at the Caltech booth on the exhibition floor. 240 GB OCZ SSDs. The six SSD drives were configured in a RAID-0 configuration 1 using a hardware RAID controller with the XFS file system. Each RAID controller was configured with 1 MB stripe size (max available) and write-through algorithm. Scientific Linux 6.1 [6] was configured on the machines with common kernel TCP optimizations [7]: increased TCP buffers, txqueuelen increase (10000 from 1000) and htcp congestion control algorithm. The Linux kernel disk IO scheduler was configured to noop from the typical deadline scheduler. This configuration was found to give optimum performance between servers connected directly to the MLXe-4. The transfers were made using the high performance data transfer tool FDT [8] developed by Caltech and Polytechnica University (Bucharest). Each host pair was able to achieve a throughput that ranges from 9.49 to 9.54 Gbps. The system deployed to the Caltech booth consisted of a mix of PCIe Gen 2 servers with 10 GE NICs and PCIe Gen 3 servers with 40 GE NICs supplied by Dell and Supermicro as shown in Figure 2. Most of the equipment making up the Caltech booth was available only days before the start of the conference, and some delivered after the start of the conference exhibition. Therefore, limited time was available to benchmark the systems performance before deploying the systems as part of the demonstration. Three Supermicro systems were configured with 40 GE Mellanox CX3 NICs and 16 OCZ 120 GB SSDs (sc32 - sc34 in Figure 2). Four Supermicro storage servers were used with Areca RAID controllers installed and connected with 1 RAID-0 would not typically be used in a production environment, but given the limited hardware available this was a good choice for maximum performance. 3

Figure 3. Total traffic at SC week. Traffic In is traffic from UVic to SC and traffic out is traffic from SC to UVic Total integrated traffic for the week was 4.4 PB. Each coloured band represents the contribution from a single machine. external JBODs chassis. Each JBOD was loaded with 24 x 2TB disks, configured as RAID-0 (sc1-sc4 in Figure 2). Each of these storage servers used PCIe GEN2 NIC from Mellanox and was connected to Dell-Force10 Z9000 switch. The tcp and kernel tuning parameters were identical to those described at UVic. All RAID controllers were configured to have large stripe sizes (1 MB), and the XFS file system was used. 4. Results The 100G circuit was established from the conference to the UVic data centre on November 13th with first data flowing on the afternoon of November 14th. The program for the exhibition week proceeded over three distinct phases. First, maximum uni-directional memory-to-memory traffic, then bi-directional memory-to-memory traffic and finally disk-to-disk throughput from UVic to SC. The evolution of the data transfers can be seen in Figure 3. Notable features include the start of bi-directional memory-to-memory flow (morning Nov 15th) and the large disk-to-disk flow starting the evening of Nov 16th. Once the circuit was established we were able to quickly achieve over 98 Gbps sustained uni-directionally over the circuit with no packet drops. All network switching equipment was remarkably stable and presented no problems in configuration. After attempting bi-directional maximum throughput we observed a decrease in throughput to roughly 60 Gbps in (UVic to SC) and 80 Gbps out (SC to UVic). This decrease was eliminated by changing the load balancing algorithm of the 12 port-channel Link Aggregation Group (LAG) between the Brocade MLXe-4 and Dell-Force10 Z9000 to round robin rather then hash based. After this fix we were able to achieve a stable throughput of 98 Gbps in and 88 Gbps out for a combined total of 186 Gbps. The next focus of experimentation was to use as few as possible 40GE PCIe Gen 3 machines to receive 100 Gbps. As can be seen in Figure 4, data flows were gradually consolidated such that two Supermicro PCIe Gen 3 machines were receiving 30 Gbps each and 2 PCIe Gen 2 4

Figure 4. The traffic in section of the plot shows an ever smaller number of machines receiving memory-to-memory transfers. Near 15:30 we see 100 Gbps being received by only 4 machines on the show floor. The two machine each receiving 30 Gbps (top right) are PCI Gen 3 machines with 40 GE Melanox NICs. Figure 5. Disk-to-disk transfers occurring from UVic to the SC show floor Nov 17-18. Total transfer speed peaked above 60 Gbps around 22:00 Nov 16th. The sharp drops occurring Nov 16 were the results of Linux Kernel panics. The large drop at 4:30 on Nov 17 was the result of the failure of a 16 disk RAID array on PCI Generation 3 machine. Dell machines were receiving 20 Gbps. The maximum achievable throughput for a PCIe Gen 2 system with 40 GE nic was 24 Gbps (this was also demonstrated by the Caltech team at SC10). The final and most challenging part of the demonstration was to achieve the maximum 5

possible disk-to-disk throughput between the two end systems. Figure 5 shows the achieved disk-to-disk throughput for the 12 hour period starting at 20:30 on November 16th. Each host at UVic was loaded with 10 GB files to roughly 80% capacity, and a script was established to copy those same series of files repeatedly to the hosts using FDT client and server with 6 parallel streams. Total disk-to-disk throughput peaked at 60 Gbps. A number of Linux kernel panics in the XFS module were encountered early in the transfer. Each machine suffering the panic was rebooted and the transfers restarted. Caltech booth machines used the 3.0.1 UltraLight kernel [9] rather than the RedHat provided 2.6.32 series kernels in order to get improved performance with the latest hardware. The source of the kernel panics was never resolved, but their occurrence was unsurprising given the relatively untested nature of the kernel and file system combination at the time. The frequency of the kernel panics was reduced by dropping to two parallel streams for those systems experiencing the panics. One 16 disk SSD RAID array failed near 04:30 November 17 because of a drive failure. Because the array was in RAID-0 configuration (no redundancy) the copy operation was unable to continue. One PCIe Gen 3 with 16 SSD was able to sustain continuously 12.5 Gbps continuous write (orange bar in Figure 5). The performance of the systems degraded after many repeated writes to the same system. 5. Conclusion The SC11 demonstration achieved its goal of clearing the way to Terabit/sec data transfers by showing that a modest set of systems is able to efficiently utilize a 100 Gbps circuit near 100 % of its capacity. The latest generation of servers (based on the recently released PCIe Gen 3 standard) equipped with 40GE interface cards and RAID arrays with high-speed SSD disks allowed the team to reach a stable throughput of 12.5 Gbps from network to disk per 2U server. A total disk-to-disk throughput between the two sites of 60 Gbps was achieved in addition to 186 Gbps to total bi-directional memory-to-memory throughput. It is important to underline that pre-production systems with relatively few SSDs were used during this demo, and no in-depth tuning was performed due to the limited amount of time in the preparation. We therefore expect these numbers will improve further, and approach the 40GE line rate within the next year. 6. Acknowledgements The generous support in kind of our industrial partners Ciena, Brocade, Dell, Mellanox, Supermicro and Color-Chip is acknowledged. We would like to acknowledge the support of the Natural Sciences and Engineering Research Council of Canada, the National Science Foundation, and the US Department of Energy. References [1] The ATLAS Collaboration et al 2008 The ATLAS Experiment at the CERN Large Hadron Collider JINST 3 S08003 doi:10.1088/1748-0221/3/08/s08003 [2] The CMS Collaboration et al 2008 The CMS experiment at the CERN LHC JINST 3 S08004 doi:10.1088/1748-0221/3/08/s08004 [3] Evans L and Bryant P 2008 LHC Machine JINST 3 S08001 doi:10.1088/1748-0221/3/08/s08001 [4] Bos K and Fisk I 2010 The Bos-Fisk Paper, http://lhcone.web.cern.ch/node/19 [5] Newman H 2011 A New Generation of Networks and Computing Models for High Energy Physics in the LHC Era J. Phys.: Conf. Ser. 331 012004 doi:10.1088/1742-6596/331/1/012004 [6] The Scientific Linux Distribution www.scientificlinux.org [7] The ESNet Linux Host Tuning Guide http://fasterdata.es.net/host-tuning/linux/ [8] Maxa Z, Ahmed B, Kcira D, Legrand I, Mughal A, Thomas M and Voicu R 2011 Powering physics data transfers with FDT J. Phys. Conf. Ser. 331 052014 doi:10.1088/1742-6596/331/5/052014 [9] The UltraLight Linux Kernel http://ultralight.caltech.edu/web-site/ultralight/workgroups/ network/kernel/kernel.html 6