WLCG Network Throughput WG Shawn McKee, Marian Babik for the Working Group HEPiX Tsukuba 16-20 October 2017
Working Group WLCG Network Throughput WG formed in the fall of 2014 within the scope of WLCG operations with the following objectives: Ensure sites and experiments can better understand and fix networking issues Measure end-to-end network performance and use the measurements to single out on complex data transfer issues Improve overall transfer efficiency and help us determine the current status of our networks Core activities include: Deployment and operations of perfsonar infrastructure Gain visibility into how our networks operate and perform Network analytics Improve our ability to fully utilize the existing network capacity Network performance incidents response team Provide support to help debug and fix network performance issues https://twiki.cern.ch/twiki/bin/view/lcg/networktransfermetrics HEPiX Fall 2017 2
News perfsonar 4.0.1 released on Aug 16 Many pscheduler fixes and improvements New ports/firewall openings needed MA issues seen after deployment perfsonar 4.1 plan getting in shape Bringing new features, but also dropping SL6 support OSG Network Measurement Service Configuration and monitoring now in production Documentation changes New Grafana dashboards New analytical capabilities with ELK+jupyter HEPiX Fall 2017 3
Current perfsonar Deployment https://www.google.com/fusiontables/datasource?docid=1qt4r17heufkvnqhju24niptz66xauyeibwwh5kpa#map:id=3 Initial deployment coordinated by WLCG perfsonar TF Commissioning of the network followed by WLCG Network and Transfer Metrics WG HEPiX Fall 2017 4
LHCONE Mesh 4 April 2017 HEPiX Fall 2017 5
LHCONE Mesh 10 Oct 2017 HEPiX Fall 2017 6
LHCOPN Mesh 4 April 2017 HEPiX Fall 2017 7
LHCOPN Mesh 10 Oct 2017 Bandwidth testing suffers from an issue related to 4.0.1 upgrade, which is being investigated in collaboration with the developers HEPiX Fall 2017 8
perfsonar 4.0.1/4.0.2 Minor feature and bugfix release Full support for CentOS7, Ubuntu 16 and Debian 9 pscheduler Archiver will start working in streaming mode Ability to use custom port (default is 443) New port opening requirement pscheduler uses https/443 as controller channel, which means it needs to be accessible from any site Dynamic list of WLCG perfsonar nodes is available in case opening needs to be restricted New SNMP plugins Collect SNMP data (router utilization) and store them along with the other measurements HEPiX Fall 2017 9
Important Dates October 17, 2017 perfsonar 3.5 end-of-life Web100 kernel end-of-life November 2017 perfsonar 4.0.2 final release Q1 2018 perfsonar 4.1 release; CentOS6 packages will not be released Q3 2018 (6 months after 4.1) perfsonar 4.0 end-of-life CentOS6 no longer supported HEPiX Fall 2017 10
perfsonar 4.1 Plans pscheduler Resource management port pools Pre-emptive scheduling support improving client response time New plugins Network traffic capture Application-level (e.g. http response time) Retirement of bwctl TWAMP support (two-way active measuremnt) ping alternative of owamp routers/switches can participate in the tests Docker support HEPiX Fall 2017 11
Network Measurement Platform The diagram on the right provides a high-level view of how WLCG/OSG is managing perfsonar deployments, gathering metrics and making them available for use. HEPiX Fall 2017 12
Grafana/LHCOPN LHCOPN overview and detailed view (per site) Shows utilization of the two CERN routers Exact same information as shown by netstat.cern.ch HEPiX Fall 2017 13
Grafana/E2E perfsonar Combined view on throughput, latency, traceroutes source to 3 different destinations can be shown both direct and reverse Shows exact same data as you would find in psmad or toolkit web pages HEPiX Fall 2017 14
Grafana/pS Inter-regional Provides an overview of how the site performs in different regions Uses domain name wildcards to match regions Throughput and re-transmissions are done in the same way Granularity of testing is determined by meshes, which are experimentbased HEPiX Fall 2017 15
Grafana/Network Performance Experimental dashboard showing side-by-side comparison of perfsonar data, LHCOPN traffic and FTS transfers Also available with ESNet traffic data HEPiX Fall 2017 16
Improving Transfer Efficiency One of the main goals of the WG to identify and help fix network issues in our infrastructure Reminder: Dedicated support unit exists to help sites and experiments with network-relates issues Analysis of the issues has significantly improved in terms of both speed and accuracy, but only if end-sites have perfsonar Recent cases: T0 to JINR - Resolved by JINR by putting in place new Moscow - Dubna link and fixing routing asymmetries CBPF to CNAF, PIC and IN2P3 MTU stepdown issue, resolved in collaboration with RNP NCP/Pakistan performance IPv4 QoS suspected, IPv6 works fine SARA/IC IPv6 connectivity (not performance) issue resolved in collaboration with GEANT and surfsara Oxford to multiple locations performance improved before root cause was found In addition, several cases that were shown to be not related to network HEPiX Fall 2017 17
Plans perfsonar deployment targeting 4.1 release Upgrade campaign planned All sites will be asked to re-install - should be an easy process since data is already archived centrally We ll take the opportunity to introduce new deployment models and requirements in order to help sites with the planning New documentation in the works Will include details on new deployment models, updated requirements, firewall setup, etc. OSG Network Measurement Platform Commission RabbitMQ transport and new storages Move alerting/notifications into production Analytics Improve Grafana dashboards for better time-series visualization Adding more information to stream/messaging API Traffic data from GEANT, Internet2, etc., if possible? Integrating new plugins and additional attributes that became available (TCP Cwnd) HEPiX Spring 2017 18
Summary We have a working infrastructure in place to monitor and measure our networks perfsonar provides lots of capabilities to understand and debug our networks Work on new applications is underway Notifications/alerting Predictive capabilities Current utilization and capacity planning Evaluating network performance of commercial clouds We (OSG and WLCG) welcome feedback on how to further improve Questions or Comments? HEPiX Fall 2017 19
References Network Documentation https://www.opensciencegrid.org/bin/view/documentation/networ kinginosg (old) https://opensciencegrid.github.io/networking/ (NEW in progress) Measurement Archive (MA) guide http://software.es.net/esmond/perfsonar_client_rest.html Modular Dashboard and OMD Prototypes http://maddash.aglt2.org/maddash-webui https://maddash.aglt2.org/wlcgperfsonar/check_mk OSG Production MaDDash http://psmad.grid.iu.edu/maddash-webui/ Datastore http://psds.grid.iu.edu/esmond/perfsonar/archive/ Mesh-config in OSG https://meshconfig.grid.iu.edu/auth/ (NEW) Monitoring https://psetf.grid.iu.edu/etf/check_mk/ (NEW) Grafana dashboards http://monit-grafana-open.cern.ch/?orgid=16 (NEW) HEPiX Fall 2017 20
Bacjup Backup slides HEPiX Fall 2017 21
Latency and packet loss matters Source Campus Performance is poor when RTT exceeds ~10 ms R&E Backbone Performance is good when RTT is < ~10 ms Destination Campus S D 0.0046% loss (1 out of 22k packets) on 10G link with 1ms RTT: 7.3 Gbps with 51ms RTT: 122Mbps with 88ms RTT: 60 Mbps (factor 80) Regional Regional Switch with small buffers HEPiX Fall 2017
Packet ordering and jitter Source Campus S R&E Backbone Network introducing delays and out of order packets Destination Campus D At 70ms RTT on 10G link, 60 seconds test with 1% re-ordering, 0.2 ms jitter: 8.45 Gbps with 1% re-ordering, 1ms jitter: 1.1 Gbps Regional Regional HEPiX Fall 2017
Throughput predictions Throughput measurements are expensive so done at low frequency. Delays and packet loss rate are cheap. Idea is to use delays and packet loss rate to predict maximum possible throughput. Mathis formula is used to model impact of packet loss and latency on throughput Rate < (MSS/RTT)*(1 / sqrt(p)) MSS segment size RTT round trip time p packet loss Packet (re)ordering and jitter to be added as well HEPiX Fall 2017 24
ATLAS Network Analytics Diagram shows the flow End-to-end+perfSONAR data both available to jointly analyze Kibana can be used to get customized views http://cl-analytics.mwt2.org:5601 More details at: http://tinyurl.com/gt92zwb HEPiX Fall 2017 25
Playing with SDN in ATLAS A group of people in the US from AGLT2, MWT2, SWT2 and NET2 are planning to explore SDN in ATLAS Working with the LHCONE point-to-point effort as well The plan is to deploy Open vswitch on ATLAS production systems at these sites (http://openvswitch.org/ ) IP addresses will be move to virtual interfaces No other changes; verify no performance impact Traffic can be shaped accurately with little CPU cost The advantage is the our data sources/sinks become visible and controllable by OpenFlow controllers like OpenDaylight Follow tests can be initiated to provide experience with controlling networks in the context of ATLAS operations. For more details talk to Rob Gardner or Shawn McKee HEPiX Fall 2017 26
Quick View of Latency Mesh HEPiX Fall 2017 27
Future Directions The WLCG efforts at CERN are being reorganized and this is an opportunity to chart future directions for the working group We have a number of areas (projects; see next slide) we are considering and we need to understand where these efforts should be housed (Stay in WG, move to GDB, to LHCONE) There is a review of the working group scheduled for April 28 th during the WLCG Operations Coordination meeting. HEPiX Fall 2017 28
Possible Future Project Areas Title: LHCONE Traffic engineering Areas: LHCONE, routing, debugging, network orchestration Title: LHCONE L3VPN Looking Glass Areas: LHCONE, monitoring, debugging Title: Integration of network and transfer metrics to optimize experiments workflows Areas: FAX/Phedex, Rucio, perfsonar, DIRAC Title: Advanced notifications/alerting for network incidents Areas: WAN, Advanced Notifications/Alerting, perfsonar, Hadoop/Spark Title: Network performance of the commercial clouds Areas: Clouds, WAN connectivity, WAN performance (perfsonar), establishing and testing network equipment at the cloud provider (VPN) Title: Software Defined Network Production Testbed Areas: WAN, SDN, LHCONE/LHCOPN, Storage/Data nodes HEPiX Fall 2017 29