WLCG Network Throughput WG

Similar documents
perfsonar: A Look Ahead Andrew Lake, ESnet Mark Feit, Internet2 October 16, 2017

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

Installation & Basic Configuration

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Introduction to. Network Startup Resource Center. Partially adopted from materials by

Introduction to perfsonar. RIPE SEE5, Tirana, Albania Szymon Trocha Poznań Supercomputing and Networking Center, Poland April 2016

perfsonar Going Forward Eric Boyd, Internet2 Internet2 Technology Exchange September 27 th 2016

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

The perfsonar Project at 10 Years: Status and Trajectory

CMS Data Transfer Challenges and Experiences with 40G End Hosts

DICE Network Diagnostic Services

perfsonar psui in a multi-domain federated environment

DICE Diagnostic Service

Deployment of a WLCG network monitoring infrastructure based on the perfsonar-ps technology

Philippe Laurens, Michigan State University, for USATLAS. Atlas Great Lakes Tier 2 collocated at MSU and the University of Michigan

perfsonar Deployment on ESnet

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

Connectivity Services, Autobahn and New Services

Network Expectations for Data Intensive Science

ANSE: Advanced Network Services for [LHC] Experiments

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Connect. Communicate. Collaborate. Click to edit Master title style. Using the perfsonar Visualisation Tools

Intercontinental Multi-Domain Monitoring for LHC with perfsonar

DYNES: DYnamic NEtwork System

Analytics Platform for ATLAS Computing Services

Monitoring and Analysis

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

CERN Network activities update

Network Management & Monitoring

MaDDash: Monitoring and Debugging Dashboard

Developing Networking and Human Expertise in Support of International Science

SOFTWARE DEFINED NETWORKS. Jonathan Chu Muhammad Salman Malik

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

Software-Defined Networking from Serro Solutions Enables Global Communication Services in Near Real-Time

Fermilab WAN Performance Analysis Methodology. Wenji Wu, Phil DeMar, Matt Crawford ESCC/Internet2 Joint Techs July 23, 2008

Traces. The main idea: Use traceroutes to create a graph model of the network, use graph traversal algorithms to determine proximity of nodes

Experiments on TCP Re-Ordering March 27 th 2017

Configuring Cisco IOS IP SLAs Operations

Networks & protocols research in Grid5000 DAS3

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

SD-WAN Deployment Guide (CVD)

5 August 2010 Eric Boyd, Internet2 Deputy CTO

Programmable Information Highway (with no Traffic Jams)

Q-Balancer Range FAQ The Q-Balance LB Series General Sales FAQ

ICFA SCIC Network Monitoring Report

CHAPTER 3 GRID MONITORING AND RESOURCE SELECTION

Enabling a SuperFacility with Software Defined Networking

CCIE SP Operations Written Exam v1.0

perfsonar Update Jason Zurawski Internet2 March 5, 2009 The 27th APAN Meeting, Kaohsiung, Taiwan

L1.5VPNs: PACKET-SWITCHING ADAPTIVE L1 VPNs

Active Measurement of Data-Path Quality in a Non-cooperative Internet

Trisul Network Analytics - Traffic Analyzer

SEVONE END USER EXPERIENCE

Delay Controlled Elephant Flow Rerouting in Software Defined Network

GÉANT Mission and Services

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

NGFW Security Management Center

IRNC:RXP SDN / SDX Update

THOUGHTS ON SDN IN DATA INTENSIVE SCIENCE APPLICATIONS

IT Infrastructure. Transforming Networks to Meet the New Reality. Phil O Reilly, CTO Federal AFCEA-GMU C4I Symposium May 20, 2015

10 Reasons your WAN is Broken

please study up before presenting

Best Practice - VPN Performance Testing

Link-Layer Layer Broadcast Protocol for SpaceWire

The new perfsonar: a global tool for global network monitoring

The Abilene Observatory and Measurement Opportunities

Testing & Assuring Mobile End User Experience Before Production Neotys

90 % of WAN decision makers cite their

The Data Replication Bottleneck: Overcoming Out of Order and Lost Packets across the WAN

Installation and Deployment

Internet Technology. 15. Things we didn t get to talk about. Paul Krzyzanowski. Rutgers University. Spring Paul Krzyzanowski

Design Issues in Traffic Management for the ATM UBR+ Service for TCP over Satellite Networks: Report II

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

Scheduling Computational and Storage Resources on the NRP

Not all SD-WANs are Created Equal: Performance Matters

ECE 697J Advanced Topics in Computer Networks

A Deployable Framework for Providing Better Than Best-Effort Quality of Service for Traffic Flows

PRODUCT BRIEF Cubro Vitrum Management Suite PRODUCT BRIEF. 1

Pricing Intra-Datacenter Networks with

QoS on Low Bandwidth High Delay Links. Prakash Shende Planning & Engg. Team Data Network Reliance Infocomm

AGENDA Introduction Pivotal Cloud Foundry NSX-V integration with Cloud Foundry New Features in Cloud Foundry Networking NSX-T with Cloud Fou

SaaS Providers. ThousandEyes for. Summary

Configuring Cisco IOS IP SLAs Operations

IP SLAs Overview. Finding Feature Information. Information About IP SLAs. IP SLAs Technology Overview

Configuring Cisco IOS IP SLA Operations

Presented by: Fabián E. Bustamante

The European DataGRID Production Testbed

Application Layer Switching: A Deployable Technique for Providing Quality of Service

International Networking in support of Extremely Large Astronomical Data-centric Operations

Software-Defined Networking (SDN) Overview

perfsonar ESCC Indianapolis IN

Software Defined Networks and OpenFlow. Courtesy of: AT&T Tech Talks.

Protecting Your SaaS Investment: Monitoring Office 365 Performance

OR /2017-E. White Paper KARL STORZ OR1 FUSION IP. Unified Communication and Virtual Meeting Rooms WHITE PAPER

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

IT HealthCheck Feature List

Innovative Solutions. Trusted Performance. Intelligently Engineered. Comparison of SD WAN Solutions. Technology Brief

Towards Network Awareness in LHC Computing

NGFW Security Management Center

Transcription:

WLCG Network Throughput WG Shawn McKee, Marian Babik for the Working Group HEPiX Tsukuba 16-20 October 2017

Working Group WLCG Network Throughput WG formed in the fall of 2014 within the scope of WLCG operations with the following objectives: Ensure sites and experiments can better understand and fix networking issues Measure end-to-end network performance and use the measurements to single out on complex data transfer issues Improve overall transfer efficiency and help us determine the current status of our networks Core activities include: Deployment and operations of perfsonar infrastructure Gain visibility into how our networks operate and perform Network analytics Improve our ability to fully utilize the existing network capacity Network performance incidents response team Provide support to help debug and fix network performance issues https://twiki.cern.ch/twiki/bin/view/lcg/networktransfermetrics HEPiX Fall 2017 2

News perfsonar 4.0.1 released on Aug 16 Many pscheduler fixes and improvements New ports/firewall openings needed MA issues seen after deployment perfsonar 4.1 plan getting in shape Bringing new features, but also dropping SL6 support OSG Network Measurement Service Configuration and monitoring now in production Documentation changes New Grafana dashboards New analytical capabilities with ELK+jupyter HEPiX Fall 2017 3

Current perfsonar Deployment https://www.google.com/fusiontables/datasource?docid=1qt4r17heufkvnqhju24niptz66xauyeibwwh5kpa#map:id=3 Initial deployment coordinated by WLCG perfsonar TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG HEPiX Fall 2017 4

LHCONE Mesh 4 April 2017 HEPiX Fall 2017 5

LHCONE Mesh 10 Oct 2017 HEPiX Fall 2017 6

LHCOPN Mesh 4 April 2017 HEPiX Fall 2017 7

LHCOPN Mesh 10 Oct 2017 Bandwidth testing suffers from an issue related to 4.0.1 upgrade, which is being investigated in collaboration with the developers HEPiX Fall 2017 8

perfsonar 4.0.1/4.0.2 Minor feature and bugfix release Full support for CentOS7, Ubuntu 16 and Debian 9 pscheduler Archiver will start working in streaming mode Ability to use custom port (default is 443) New port opening requirement pscheduler uses https/443 as controller channel, which means it needs to be accessible from any site Dynamic list of WLCG perfsonar nodes is available in case opening needs to be restricted New SNMP plugins Collect SNMP data (router utilization) and store them along with the other measurements HEPiX Fall 2017 9

Important Dates October 17, 2017 perfsonar 3.5 end-of-life Web100 kernel end-of-life November 2017 perfsonar 4.0.2 final release Q1 2018 perfsonar 4.1 release; CentOS6 packages will not be released Q3 2018 (6 months after 4.1) perfsonar 4.0 end-of-life CentOS6 no longer supported HEPiX Fall 2017 10

perfsonar 4.1 Plans pscheduler Resource management port pools Pre-emptive scheduling support improving client response time New plugins Network traffic capture Application-level (e.g. http response time) Retirement of bwctl TWAMP support (two-way active measuremnt) ping alternative of owamp routers/switches can participate in the tests Docker support HEPiX Fall 2017 11

Network Measurement Platform The diagram on the right provides a high-level view of how WLCG/OSG is managing perfsonar deployments, gathering metrics and making them available for use. HEPiX Fall 2017 12

Grafana/LHCOPN LHCOPN overview and detailed view (per site) Shows utilization of the two CERN routers Exact same information as shown by netstat.cern.ch HEPiX Fall 2017 13

Grafana/E2E perfsonar Combined view on throughput, latency, traceroutes source to 3 different destinations can be shown both direct and reverse Shows exact same data as you would find in psmad or toolkit web pages HEPiX Fall 2017 14

Grafana/pS Inter-regional Provides an overview of how the site performs in different regions Uses domain name wildcards to match regions Throughput and re-transmissions are done in the same way Granularity of testing is determined by meshes, which are experimentbased HEPiX Fall 2017 15

Grafana/Network Performance Experimental dashboard showing side-by-side comparison of perfsonar data, LHCOPN traffic and FTS transfers Also available with ESNet traffic data HEPiX Fall 2017 16

Improving Transfer Efficiency One of the main goals of the WG to identify and help fix network issues in our infrastructure Reminder: Dedicated support unit exists to help sites and experiments with network-relates issues Analysis of the issues has significantly improved in terms of both speed and accuracy, but only if end-sites have perfsonar Recent cases: T0 to JINR - Resolved by JINR by putting in place new Moscow - Dubna link and fixing routing asymmetries CBPF to CNAF, PIC and IN2P3 MTU stepdown issue, resolved in collaboration with RNP NCP/Pakistan performance IPv4 QoS suspected, IPv6 works fine SARA/IC IPv6 connectivity (not performance) issue resolved in collaboration with GEANT and surfsara Oxford to multiple locations performance improved before root cause was found In addition, several cases that were shown to be not related to network HEPiX Fall 2017 17

Plans perfsonar deployment targeting 4.1 release Upgrade campaign planned All sites will be asked to re-install - should be an easy process since data is already archived centrally We ll take the opportunity to introduce new deployment models and requirements in order to help sites with the planning New documentation in the works Will include details on new deployment models, updated requirements, firewall setup, etc. OSG Network Measurement Platform Commission RabbitMQ transport and new storages Move alerting/notifications into production Analytics Improve Grafana dashboards for better time-series visualization Adding more information to stream/messaging API Traffic data from GEANT, Internet2, etc., if possible? Integrating new plugins and additional attributes that became available (TCP Cwnd) HEPiX Spring 2017 18

Summary We have a working infrastructure in place to monitor and measure our networks perfsonar provides lots of capabilities to understand and debug our networks Work on new applications is underway Notifications/alerting Predictive capabilities Current utilization and capacity planning Evaluating network performance of commercial clouds We (OSG and WLCG) welcome feedback on how to further improve Questions or Comments? HEPiX Fall 2017 19

References Network Documentation https://www.opensciencegrid.org/bin/view/documentation/networ kinginosg (old) https://opensciencegrid.github.io/networking/ (NEW in progress) Measurement Archive (MA) guide http://software.es.net/esmond/perfsonar_client_rest.html Modular Dashboard and OMD Prototypes http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/wlcgperfsonar/check_mk OSG Production MaDDash http://psmad.grid.iu.edu/maddash-webui/ Datastore http://psds.grid.iu.edu/esmond/perfsonar/archive/ Mesh-config in OSG https://meshconfig.grid.iu.edu/auth/ (NEW) Monitoring https://psetf.grid.iu.edu/etf/check_mk/ (NEW) Grafana dashboards http://monit-grafana-open.cern.ch/?orgid=16 (NEW) HEPiX Fall 2017 20

Bacjup Backup slides HEPiX Fall 2017 21

Latency and packet loss matters Source Campus Performance is poor when RTT exceeds ~10 ms R&E Backbone Performance is good when RTT is < ~10 ms Destination Campus S D 0.0046% loss (1 out of 22k packets) on 10G link with 1ms RTT: 7.3 Gbps with 51ms RTT: 122Mbps with 88ms RTT: 60 Mbps (factor 80) Regional Regional Switch with small buffers HEPiX Fall 2017

Packet ordering and jitter Source Campus S R&E Backbone Network introducing delays and out of order packets Destination Campus D At 70ms RTT on 10G link, 60 seconds test with 1% re-ordering, 0.2 ms jitter: 8.45 Gbps with 1% re-ordering, 1ms jitter: 1.1 Gbps Regional Regional HEPiX Fall 2017

Throughput predictions Throughput measurements are expensive so done at low frequency. Delays and packet loss rate are cheap. Idea is to use delays and packet loss rate to predict maximum possible throughput. Mathis formula is used to model impact of packet loss and latency on throughput Rate < (MSS/RTT)*(1 / sqrt(p)) MSS segment size RTT round trip time p packet loss Packet (re)ordering and jitter to be added as well HEPiX Fall 2017 24

ATLAS Network Analytics Diagram shows the flow End-to-end+perfSONAR data both available to jointly analyze Kibana can be used to get customized views http://cl-analytics.mwt2.org:5601 More details at: http://tinyurl.com/gt92zwb HEPiX Fall 2017 25

Playing with SDN in ATLAS A group of people in the US from AGLT2, MWT2, SWT2 and NET2 are planning to explore SDN in ATLAS Working with the LHCONE point-to-point effort as well The plan is to deploy Open vswitch on ATLAS production systems at these sites (http://openvswitch.org/ ) IP addresses will be move to virtual interfaces No other changes; verify no performance impact Traffic can be shaped accurately with little CPU cost The advantage is the our data sources/sinks become visible and controllable by OpenFlow controllers like OpenDaylight Follow tests can be initiated to provide experience with controlling networks in the context of ATLAS operations. For more details talk to Rob Gardner or Shawn McKee HEPiX Fall 2017 26

Quick View of Latency Mesh HEPiX Fall 2017 27

Future Directions The WLCG efforts at CERN are being reorganized and this is an opportunity to chart future directions for the working group We have a number of areas (projects; see next slide) we are considering and we need to understand where these efforts should be housed (Stay in WG, move to GDB, to LHCONE) There is a review of the working group scheduled for April 28 th during the WLCG Operations Coordination meeting. HEPiX Fall 2017 28

Possible Future Project Areas Title: LHCONE Traffic engineering Areas: LHCONE, routing, debugging, network orchestration Title: LHCONE L3VPN Looking Glass Areas: LHCONE, monitoring, debugging Title: Integration of network and transfer metrics to optimize experiments workflows Areas: FAX/Phedex, Rucio, perfsonar, DIRAC Title: Advanced notifications/alerting for network incidents Areas: WAN, Advanced Notifications/Alerting, perfsonar, Hadoop/Spark Title: Network performance of the commercial clouds Areas: Clouds, WAN connectivity, WAN performance (perfsonar), establishing and testing network equipment at the cloud provider (VPN) Title: Software Defined Network Production Testbed Areas: WAN, SDN, LHCONE/LHCOPN, Storage/Data nodes HEPiX Fall 2017 29