CMS Data Transfer Challenges and Experiences with 40G End Hosts

Similar documents
Towards Network Awareness in LHC Computing

Data Transfers Between LHC Grid Sites Dorian Kcira

ANSE: Advanced Network Services for [LHC] Experiments

WLCG Network Throughput WG

Pacific Wave: Building an SDN Exchange

LHC and LSST Use Cases

BigData Express: Toward Predictable, Schedulable, and High-Performance Data Transfer. BigData Express Research Team November 10, 2018

High Throughput WAN Data Transfer with Hadoop-based Storage

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Disk-to-Disk network transfers at 100 Gb/s

Internet data transfer record between CERN and California. Sylvain Ravot (Caltech) Paolo Moroni (CERN)

DYNES: DYnamic NEtwork System

THOUGHTS ON SDN IN DATA INTENSIVE SCIENCE APPLICATIONS

IRNC:RXP SDN / SDX Update

Dell EMC 100GE SDN using OpenDaylight (Beryllium) Dell Networking Data Center Technical Marketing September 2016

LHC Open Network Environment. Artur Barczyk California Institute of Technology Baton Rouge, January 25 th, 2012

Experiences with 40G/100G Applications

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

AmLight supports wide-area network demonstrations in Super Computing 2013 (SC13)

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Programmable Information Highway (with no Traffic Jams)

Traffic Optimization for ExaScale Science Applications

US LHCNet: Transatlantic Networking for the LHC and the U.S. HEP Community

Cluster Setup and Distributed File System

5 August 2010 Eric Boyd, Internet2 Deputy CTO

Fermilab WAN Performance Analysis Methodology. Wenji Wu, Phil DeMar, Matt Crawford ESCC/Internet2 Joint Techs July 23, 2008

Application Acceleration Beyond Flash Storage

TOWARDS REMOTE ACCESS TO VIRTUALIZED TELECOM RESEARCH INFRASTRACTURS

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

SENSE: SDN for End-to-end Networked Science at the Exascale

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

Analytics of Wide-Area Lustre Throughput Using LNet Routers

International OpenFlow/SDN Test Beds 3/31/15

Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer

TALK THUNDER SOFTWARE FOR BARE METAL HIGH-PERFORMANCE SOFTWARE FOR THE MODERN DATA CENTER WITH A10 DATASHEET YOUR CHOICE OF HARDWARE

Improving Packet Processing Performance of a Memory- Bounded Application

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

SDN Peering with XSP. Ezra Kissel Indiana University. Internet2 Joint Techs / TIP2013 January 2013

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Network Management & Monitoring

SNAP Performance Benchmark and Profiling. April 2014

Send documentation comments to You must enable FCIP before attempting to configure it on the switch.

Oracle IaaS, a modern felhő infrastruktúra

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Building a Digital Bridge from Ohio

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

ClearStream. Prototyping 40 Gbps Transparent End-to-End Connectivity. Cosmin Dumitru! Ralph Koning! Cees de Laat! and many others (see posters)!

perfsonar Host Hardware

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Parallel Storage Systems for Large-Scale Machines

Experiments on TCP Re-Ordering March 27 th 2017

Communication System Design Projects

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

16GFC Sets The Pace For Storage Networks

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

November 1 st 2010, Internet2 Fall Member Mee5ng Jason Zurawski Research Liaison

Network Design Considerations for Grid Computing

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

Communication System Design Projects. Communication System Design:

Data transfer over the wide area network with a large round trip time

Internet2 and IPv6: A Status Update

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Ultra high-speed transmission technology for wide area data movement

10 Gbit/s Challenge inside the Openlab framework

MaDDash: Monitoring and Debugging Dashboard

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Network and Host Design to Facilitate High Performance Data Transfer

Empowering SDN SOFTWARE-BASED NETWORKING & SECURITY FROM VYATTA. Bruno Barba Systems Engineer Mexico & CACE

CPMD Performance Benchmark and Profiling. February 2014

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Achieving the Science DMZ

Large scale commissioning and operational experience with tier-2 to tier-2 data transfer links in CMS

MILC Performance Benchmark and Profiling. April 2013

The Future of High-Performance Networking (The 5?, 10?, 15? Year Outlook)

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

ARISTA: Improving Application Performance While Reducing Complexity

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Austrian Federated WLCG Tier-2

ICN for Cloud Networking. Lotfi Benmohamed Advanced Network Technologies Division NIST Information Technology Laboratory

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft. Presented by Manfred Alef Contributions of Jos van Wezel, Andreas Heiss

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

BigData Express: Toward Predictable, Schedulable, and High-performance Data Transfer. Wenji Wu Internet2 Global Summit May 8, 2018

IRNC PI Meeting. Internet2 Global Summit May 6, 2018

PoS(ISGC 2014)021. Integrating Network-Awareness and Network-Management into PhEDEx. Tony Wildish. ANSE Collaboration

Increasing the Throughput of Network Appliances through Virtualization

Brocade. The New IP. Riccardo Bernasconi Territory account manager 2015 BROCADE COMMUNICATIONS SYSTEMS, INC. INTERNAL USE ONLY

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Introduction. Executive Summary. Test Highlights

Status and Trends in Networking at LHC Tier1 Facilities

Ronald van der Pol

perfsonar Deployment on ESnet

Connectivity Services, Autobahn and New Services

Transcription:

CMS Data Transfer Challenges and Experiences with 40G End Hosts NEXT Technology Exchange Advanced Networking / Joint Techs Indianapolis, October 2014 Azher Mughal, Dorian Kcira California Institute of Technology

Caltech Tier2 Center Servers are spread across 3 Data Centers within campus Total Nodes including storage servers : 300 CPU Cores: 5200 Storage Space : Raw : 3.1 PetaBytes Usable : 1.5-2 PetaBytes Hadoop Storage Replication Factor : 50% GridFTPs are gateway to the rest of the CMS grid traffic Greater use of FDT with CMS PhEDEx and SDN for LHC Run2 Data Center remodeled in late 2013

Data Center Connectivity 100G uplink to CENIC (NSF CC-NIE award) with 10GE as a backup. 40G Inter building connectivity. Vendor neutral, switching hardware from Brocade, Dell and Cisco. Active Ports: 8 x 40GE ~40 x 10GE ports ~500 x 1GE ports Core switches support OpenFlow 1.0 (OF 1.3 by 4 th Qtr 2014). Lauritsen IPAC CACR GridFTP Server-1 Data Center Fabric (10/40G) GridFTP Server-2 100G Backbone GridFTP Server-3 OSCARS / OESS Controller in production since 2009. Peering with Internet2. DYNES connected storage node. NSI ready using OSCARS NSI Bridge. Compute + Storage Clusters 40G 10G

100G LHCONE Connectivity, IP Peerings CC-NIE award helped purchase of 100G equipment. Strong support from CENIC Tier2 connected with Internet2 LHCONE VRF since April 2014 In case of link failure, Primary 100G path fails over to the 10G link CENIC R&E Peering Direct IP peering with UFL over AL2S and FLR IPv4+IPv6 Ongoing point-to-point performance analysis experiments with Nebraska through Internet2 Advanced Layer2 Services (AL2S) PacWave Peering IPv4+IPv6 Caltech VRF UFL IPv4 IPv4 Internet2 VRF

perfsonar LHCONE Tests Participating in US CMS Tier2 and LHCONE perfsonar mesh tests. Separate perfsonar instances for OWAMP (1G) and BWCTL (10G) Uses TCP applications where RTT plays a major role in bandwidth throughput (single stream). [Not a factor in other tests with FDT at high throughput shown in this talk.] [root@perfsonar ~]# ping perfsonar-de-kit.gridka.de 64 bytes from perfsonar-de-kit.gridka.de (192.108.47.6): icmp_seq=1 ttl=53 time=172 ms [root@perfsonar ~]# ping ps-bandwidth.lhcmon.triumf.ca 64 bytes from ps-bandwidth.lhcmon.triumf.ca (206.12.9.1): icmp_seq=2 ttl=58 time=29.8 ms

Capacity vs Application Optimization CMS Software Components Primer GridFTP PhEDEx Bookkeeping for CMS Data Sets. Knows the End points and manages high level aspects of the transfers (e.g. file router). FTS Negotiates the transfers among end sites/points and initiates transfers through the GridFTP servers. SRM Selects the appropriate GridFTP Server (mostly round-robin). Current workhorse or grid middleware for the transfers between end sites. Or, an interface between the storage element and the wide area network.

Capacity vs Application Optimization US CMS mandated a 20Gbps disk-to-disk test rate using PhEDEx load tests. (https://twiki.cern.ch/twiki/bin/view/cmspublic/uscmstier2upgrades) The software stack is not ready out of the box to achieve higher throughputs among sites. Requires lot of tunings. Benchmarking the individual GridFTP servers to understand the limits. GridFTP uses 3 different file checksum algorithms for each file transfer. Consumes lot of CPU cycles. During the first round of tests, FTS (the older version) used 50 transfers in parallel. This limit was removed in the August release.

Capacity vs Application Optimization 10 Gbps gridftps behave very well up to 88% capacity steady Peaking at 96% capacity Optimal CPU/Memory consumption Dual 10Gbps 20G Total = 10G Local in + 10G WAN out

Capacity vs Application Optimization Increases transfers observed during the LHCONE ANA integration. LHCONE -> ANA 100G Manual Ramp up Manual ramp up was a test because remote PhEDEx sites were not subscribing enough transfers to FTS links (e.g. Caltech - CNAF)

Capacity vs Application Optimization Tier2 traffic flows, Peaks of 43G over AL2S with 20G to CNAF (Bologna) alone.

Testing High Speed Transfers Logical Layout from CERN (USLHCNet) to Caltech (Pasadena) Caltech CERN 80Gbps Internet2 AL2S CERN / Amsterdam 40Gbps Sandy1 (client) Storage Server Maximum traffic allowed (80Gbps) 40Gbps Sandy3 (client) 11

Testing High Speed Transfers Servers, OS, Transfer Tools Storage Server: RHEL 7.0, Mellanox OFED (2.2-1.0.1) SuperMicro (X9DRX+-F) Dual Intel E5-2690-v2 (IVY Bridge) 64GB DDR3 RAM Intel NVMe SSD drives (Gen3 PCIe) [1.7 Gbytes/sec each in the lab]. Two Mellanox 40GE VPI NICs Client Systems: SL 6.5, Mellanox OFED SuperMicro (X9DR3-F) Dual Intel E5-2670 (Sandy Bridge) 64GB DDR3 RAM One Mellanox 40GE VPI NIC Transfer software: RDMA FTP http://ftp100.cewit.stonybrook.edu/rftp/ Operates using RDMA or TCP RDMA Mode: Breaks source files in parallel data streams TCP Mode: Single stream for each source file Written in C 12

Testing High Speed Transfers Data writing over the SSD Drives, Destination server in Caltech After tunings, With RFTP and large RTT, single TCP stream was still limited to 360Mbps. So ~200 streams were used; still achieved 70 Gbps stable throughput. AL2S is a shared infrastructure. During these transfers, LA-Phoenix segment showed 97.03Gbps. CPU (E5 2670 V2) is a bottleneck when using TCP at this rate and number of flows (network & I/O processes compete). 97.03 Gbps Disk write at 70 Gbps 13

OESS SDN Testing Between CMS Grid Sites Based on the work of Samir Cury (Caltech) Using the Open Exchange Software Suite APIs for tests on the production network Faster transfers between LHC sites will require better choice of paths, and sometimes multiple paths across Internet2 Avoid overloading paths Make sure transfers do not interfere with each other Load balancing: Each site has 2-3 paths and can alternate between them, choosing the less utilized one in each instance Illustration of the sdn agent choosing between paths

Multipath Plugins for OpenDaylight SDN Controller Based on the work of Julian Bunn (Caltech) Goal is to support application level intelligent path selection for high bandwidth international transfers of LHC data Based on initial work at Caltech with the Floodlight SDN/OpenFlow controller Working with OpenDaylight Hydrogen release Using a variety of hardware and virtual switchgear, including Brocade, Pica, HP, Padtec, Mininet Virtual Internet2 testbed used for Multipath/ODL tests

Multipath Plugins for OpenDaylight SDN Controller CERN ODL Testbed featuring Pica8 and HP switches, Sandy Bridge hosts

Multipath Plugins for OpenDaylight SDN Controller We have developed two plugins for ODL: Multipath: implements path calculators (e.g. modified Dijkstra) and a variety of path selectors, for example: Shortest path Random path Round Robin path Maximum Available Bandwidth path Path with fewest flows Path with highest capacity Multipath.Northbound: provides REST endpoints for proactive control of routing, as well as monitoring. For example: Select path calculator Set up flow entries in switches between two hosts (according to current path calculator) Query flow traffic, topology of network, link activation

Multipath Plugins for OpenDaylight SDN Controller These plugins, as well as multilayer dynamic circuits under software control, will be tested at ~1Tbps (with servers using many 40GE interfaces) during SuperComputing 2014

Caltech @ SuperComputing 2014

Conclusions / Moving Forward The LHC experiments (and other major science programs) will need to be aware of the impact of their large flows across the R&E networks in the US, Europe and across the Atlantic. As shown, with modern day off the shelf equipment, 100GE paths can be fully loaded relatively easily. Caltech achieved 20 Gbps+ in production-mode, using its Tier2 in a first set of trials (CC-NIE 100G, Internet2 ALS, ANA-100); this led to 43G peak on AL2S. We are integrating an OpenDaylight multi-path controller (based on a Floodlight controller in the OLiMPS project) with CMS mainstream data transfer management system PhEDEx using OSCARS circuits: to create either parallel paths to the same destination (to help avoid backbone congestion) or Choosing paths to destinations by looking at the load on the WAN. Benchmarking next generation CPUs and memory, keeping the software stack tuned and knowing its limitations under different set of application requirements. SDN Multipath demonstrations over the WAN and on the show floor will be showcased during the Supercomputing Conference 2014 in New Orleans.

Acknowledgments DYNES: NSF 0958998 ANSE: NSF 1246133 OliMPS: DOE DE-SC0007346 Cisco Research: 2014-128271 US LHCNet Fermilab subaward # 608666 under DOE # DE-AC02-07CH11359 Caltech Tier2: Princeton subaward # 00002009 under NSF award # 1121038