perfsonar ESCC Indianapolis IN

Similar documents
perfsonar Update Jason Zurawski Internet2 March 5, 2009 The 27th APAN Meeting, Kaohsiung, Taiwan

perfsonar Deployment on ESnet

NM-WG Specification Adoption in perfsonar. Aaron Brown, Internet2, University of Delaware Martin Swany University of Delaware, Internet2

Performance Update 10 pounds of stuff in a 5 pound bag

Performance Update. Jeff Boote Senior Network Software Engineer Internet2 Martin Swany Assistant Professor University of Delaware

The perfsonar Project at 10 Years: Status and Trajectory

Philippe Laurens, Michigan State University, for USATLAS. Atlas Great Lakes Tier 2 collocated at MSU and the University of Michigan

Internet2 Technology Update. Eric Boyd Deputy Technology Officer

WLCG Network Throughput WG

Programmable Information Highway (with no Traffic Jams)

DICE Network Diagnostic Services

ESnet Network Measurements ESCC Feb Joe Metzger

Connectivity Services, Autobahn and New Services

DICE Diagnostic Service

Intercontinental Multi-Domain Monitoring for LHC with perfsonar

Deploying distributed network monitoring mesh for LHC Tier-1 and Tier-2 sites

Introduction to. Network Startup Resource Center. Partially adopted from materials by

Installation & Basic Configuration

John Hicks Internet2 - Network Research Engineer perfsonar

perfsonar: A Look Ahead Andrew Lake, ESnet Mark Feit, Internet2 October 16, 2017

Introduction to perfsonar. RIPE SEE5, Tirana, Albania Szymon Trocha Poznań Supercomputing and Networking Center, Poland April 2016

Virtual Circuits Landscape

5 August 2010 Eric Boyd, Internet2 Deputy CTO

perfsonar psui in a multi-domain federated environment

Internet2: Presentation to Astronomy Community at Haystack. T. Charles Yun April 2002

Fermilab WAN Performance Analysis Methodology. Wenji Wu, Phil DeMar, Matt Crawford ESCC/Internet2 Joint Techs July 23, 2008

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Deployment of a WLCG network monitoring infrastructure based on the perfsonar-ps technology

THE GLOBUS PROJECT. White Paper. GridFTP. Universal Data Transfer for the Grid

AutoBAHN Provisioning guaranteed capacity circuits across networks

Challenges of Big Data Movement in support of the ESA Copernicus program and global research collaborations

The Abilene Observatory and Measurement Opportunities

GÉANT Mission and Services

Internet2 DCN and Dynamic Circuit GOLEs. Eric Boyd Deputy Technology Officer Internet2 GLIF Catania March 5, 2009

October 5 th 2010 NANOG 50 Jason Zurawski Internet2

Achieving the Science DMZ

Wide-Area Networking at SLAC. Warren Matthews and Les Cottrell (SCS Network Group) Presented at SLAC, April

Connect. Communicate. Collaborate. Click to edit Master title style. Using the perfsonar Visualisation Tools

CMS Data Transfer Challenges and Experiences with 40G End Hosts

EGEE (JRA4) Loukik Kudarimoti DANTE. RIPE 51, Amsterdam, October 12 th, 2005 Enabling Grids for E-sciencE.

How PERTs Can Help With Network Performance Issues

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

The European DataGRID Production Testbed

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

Integration of Network Services Interface version 2 with the JUNOS Space SDK

CHAPTER 3 GRID MONITORING AND RESOURCE SELECTION

GÉANT L3VPN Service Description. Multi-point, VPN services for NRENs

31270 Networking Essentials Focus, Pre-Quiz, and Sample Exam Answers

Agenda. Sill, Alan Texas Tech University Workshop Introduction 8:30 AM Recently Released OGF Standards; US Federal and Other Roadmap Efforts

Research and Education Networking Ecosystem and NSRC Assistance

The Bro Cluster The Bro Cluster

perfsonar Going Forward Eric Boyd, Internet2 Internet2 Technology Exchange September 27 th 2016

EUMEDCONNECT3 and European R&E Developments

MaDDash: Monitoring and Debugging Dashboard

The Science DMZ: Evolution

Vasilis Maglaris. Chairman, NREN Policy Committee - GÉANT Consortium Coordinator, NOVI FIRE Project

DYNES: DYnamic NEtwork System

Multi-domain Internet Performance Sampling and Analysis Tools

Abilene: An Internet2 Backbone Network

ALICE Grid Activities in US

A Step Towards Automated Event Diagnosis Stanford Linear Accelerator Center. Adnan Iqbal, Yee-Ting Li, Les Cottrell Connie A. Log.

Status of KISTI Tier2 Center for ALICE

ESnet Update. Summer 2010 Joint Techs Columbus, OH. Steve Cotter, ESnet Dept. Head Lawrence Berkeley National Lab

Embedded Network Systems. Internet2 Technology Exchange 2018 October, 2018 Eric Boyd Ed Colone

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Developing Networking and Human Expertise in Support of International Science

FEDERICA Federated E-infrastructure Dedicated to European Researchers Innovating in Computing network Architectures

GÉANT Open Service Description. High Performance Interconnectivity to Support Advanced Research

File Transfer: Basics and Best Practices. Joon Kim. Ph.D. PICSciE. Research Computing 09/07/2018

RESEARCH NETWORKS & THEIR ROLE IN e-infrastructures

Presentation of the LHCONE Architecture document

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

SENSE: SDN for End-to-end Networked Science at the Exascale

NM-WG/perfSONAR Topology Schema. Martin Swany

By establishing IGTF, we are seeing

The Software Journey: from networks to visualization

GÉANT : e-infrastructure connectivity for the data deluge

Towards Network Awareness in LHC Computing

HOME GOLE Status. David Foster CERN Hawaii GLIF January CERN - IT Department CH-1211 Genève 23 Switzerland

Configuring the Catena Solution

NCP Computing Infrastructure & T2-PK-NCP Site Update. Saqib Haleem National Centre for Physics (NCP), Pakistan

Experiments on TCP Re-Ordering March 27 th 2017

CCIE SP Operations Written Exam v1.0

Enabling a SuperFacility with Software Defined Networking

ESnet Update Summer 2008 Joint Techs Workshop

GÉANT: Supporting R&E Collaboration

LHC and LSST Use Cases

CS 638 Lab 6: Transport Control Protocol (TCP)

The role of e-infrastructure for the preservation of cultural heritage. Federico Ruggieri Head of Distributed Computing and Storage Deparment of GARR

GN2 JRA5: Roaming and Authorisation

Internet2 and IPv6: A Status Update

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

Authenticated Wireless Roaming via Tunnels

Advanced End- to- End Services in Brazil RNP Research and Development

A Brief Overview of the Science DMZ

CERN network overview. Ryszard Jurga

The grid for LHC Data Analysis

IPv6 in Campus Networks

GÉANT IP Service Description. High Performance IP Services to Support Advanced Research

Transcription:

perfsonar ESCC Indianapolis IN July 21, 2009 Supporting Advanced Scientific Computing Research Basic Energy Sciences Biological and Environmental Research Fusion Energy Sciences High Energy Physics Nuclear Physics Joe Metzger, Brian Tierney ESnet/LBNL

Measurement Recommendations White Paper has been posted at: http://fasterdata.es.net/ps_howto.html This is a draft I am expecting additional community input and will continually refine this document.

Measurement Recommendations Deploy perfsonar tools At Site border: 1 Bandwidth system, 1 latency system & several other services (Utilization, NDT, etc) Near significant network resources Use it to: Find & fix current local problems Identify when they re-occur Set user expectations by quantify your network services

Typical Campus Deployment

Soft Network Failures Soft failures are where basic connectivity functions, but high performance is not possible. TCP was intentionally designed to hide all transmission errors from the user: As long as the TCPs continue to function properly and the internet system does not become completely partitioned, no transmission errors will affect the users. (From IEN 129, RFC 716) Some soft failures only affect high bandwidth long RTT flows. Hard failures are easy to detect & fix, soft failures can lie hidden for years.

Deploying a perfsonar measurement host in under 30 minutes Using the PS Performance Toolkit is very simple Boot from CD Use command line tool to configure what disk partition to use for persistent data Network address and DNS User and root passwords Use Web GUI to configure Select which services to run Select remote measurement points for bandwidth and latency tests Configure Cacti to collect SNMP data from key router interfaces

PS-Performance Toolkit Components ng re? hls Apache hhp Config Tools perfadmin VisualizaGon Tools Support ApplicaGons Provides a perfsonar Web Service Low Level Measurement Tools perfsonar BOUY Bwctl nuhcp iperf psb MA owamp Pinger MA Pinger ping NDT NPAD SNMPMA cacg

Backup Slides The following slides were from the measurement BOF at Joint Techs.

Measurement Recommendations Deploy perfsonar tools At Site border: 1 Bandwidth system, 1 latency system & several other services (Utilization, NDT, etc) Near significant network resources Use it to: Find & fix current local problems Identify when they re-occur Set user expectations by quantify your network services

High Performance Networking Most of the R&E community has access to 10 Gbps networks. Naive users with the right tools should be able to easily get: 200 Mbps/per stream between properly maintained systems 2 Gbps aggregate rates between significant computing resources Most users are not experiencing this level of performance There is widespread frustration involved in the transfer of large data sets between facilities, or from the data s facility of origin back to the researcher s home institution. From the BES network requirements workshop: http://www.es.net/pub/esnet-doc/bes-net-req-workshop-2007- Final-Report.pdf We can increase performance by measuring the network and reporting problems!

Network Troubleshooting is a Multi-Domain Problem Source Campus Education Backbone Destination Campus S D JET Network Regional

Where are common problems? Source Campus Bad Link or Interfaces between Domains Education Backbone Latency Dependant problems inside domains with small RTT Destination Campus S D JET Network Regional

Soft Network Failures Soft failures are where basic connectivity functions, but high performance is not possible. TCP was intentionally designed to hide all transmission errors from the user: As long as the TCPs continue to function properly and the internet system does not become completely partitioned, no transmission errors will affect the users. (From IEN 129, RFC 716) Some soft failures only affect high bandwidth long RTT flows. Hard failures are easy to detect & fix, soft failures can lie hidden for years.

Common Soft Failures Small Queue Tail Drop Switches not able to handle the long packet trains prevalent in long RTT sessions and cross traffic at the same time Un-intentional Rate Limiting Process Switching on Cisco 6500 devices due to faults, acl s, or mis-configuration Security Devices Random Packet Loss Bad fibers or connectors Low light levels due to amps/interfaces failing Duplex mismatch

Local testing will not find some problem. Source Campus Performance is poor when RTT is > 20 ms Education Backbone Performance is good when RTT is < 20 ms Destination Campus S D JET Network Regional Switch with small buffers

Addressing the Problem: perfsonar Developing an open web-services based framework for collecting, managing and sharing network measurements Deploying the framework across the science community Encouraging people to deploy known good measurement points near domain boundaries

What is perfsonar? A collaboration For developing, deploying & utilizing network measurement tools An architecture and protocols A collection of software

perfsonar Terminology perfsonar Collaboration: Collection of groups working on perfsonar tools perfsonar Schemas: OGF standards for perfsonar communications perfsonar Bundle: collection of tools into a release perfsonar MDM: A measurement service coordinated by DANTE perfsonar PS: Perl-based tools perfsonar Performance Toolkit: Bootable CD packaging of several tool perfsonar Bandwidth Services: Active bandwidth probe control (bwctl) perfsonar Latency Services: Active latency probe control (owamp/pinger) perfsonar Measurement Archives: Store and publish results / data perfsonar Analysis Tools: data visualization tools perfsonar Troubleshooting Services: NDT and NPAD perfsonar = all of the above

perfsonar Developers ESnet GEANT Internet2 RNP University of Delaware FERMI Georgia Tech SLAC ARIES BELNET CARNet CESNET DANTE DFN FCCN Consortium GARR GRNET IST POZNAN SupercompuGng Center Red IRIS Renater SURFnet SWITCH UNINETT

perfsonar Deployments Internet2 University of Michigan, Ann Arbor Indiana University Boston University University of Texas Arlington Oklahoma University, Norman Michigan Information Technology Center William & Mary University of Wisconsin Madison Southern Methodist University, Dallas University of Texas Austin Vanderbilt University ESnet Argonne National Lab Brookhaven National Lab Fermilab National Energy Research Scientific Computing Center Pacific Northwest National Lab APAN GLORIAD JGN2PLUS KISTI Korea Monash University, Melbourne NCHC, HsinChu, Taiwan Simon Fraser University Surrey Campus West Burnaby Campus Vancouver

perfsonar Deployments (2) GEANT GARR HUNGARNET PIONEER SWITCH CCIN2P3 CERN CNAF DE-KIT NIKHEF/SARA PIC RAL TRIUMF ASCC Note: These are just the deployments I know about. There are probably more

perfsonar JET deployment The Joint Engineering Team is developing a perfsonar deployment plan 1. Reviewing the network measurement data each network is willing to share, or would like to access 2. Reviewing the perfsonar tools & monitoring functions to evaluate which networks will deploy which ones. First deployments in the nets with open science missions and exchange points

perfsonar Architecture Interoperable network measurement middleware (SOA): Modular Web services-based Decentralized Locally controlled Integrates: Network measurement tools and data archives Data manipulation Information Services Based on: Discovery Topology Authentication and authorization Open Grid Forum Network (OGF) Network Measurement Working Group (NM-WG) schema Currently attempting to formalize specification of perfsonar protocols in a new OGF WG (NMC-WG) Network topology description being defined in the Network Markup Language Working Group (NML-WG)

perfsonar Protocols Web Services based protocols for: Finding measurement services Exchanging measurement data Scheduling measurements We are standardizing these protocols in the OGF

Main perfsonar Services Lookup Service gls Global service used to find services hls Home service for registering local perfsonar metadata Measurement Archives SNMP MA Interface Data psb MA -- Scheduled bandwidth and latency data Measurement Points BWCTL OWAMP PINGER Troubleshooting Tools NDT NPAD Topology Service

Selecting Network Measurements Router Interface Data Utilization, Errors, Discards Border & internal bottleneck links Before & after the security infrastructure Active Bandwidth Measurements Identify Important paths to measure Do you need to test 10G paths? Latency Measurements Identify important paths to measure LAN/Desktop performance NDT & NPAD

perfsonar Software Terminology There are multiple perfsonar services Ie, lookup service, measurement archives, measurement points, authentication, etc There are multiple code trains perfsonar-ps perfsonar MDM perfsonar service bundles Integrated tested releases that may contain services picked from both code trains.

perfsonar PS Primarily written in Perl Emphasis on ease of deployment community driven development & support Mostly US Developers

perfsonar MDM Heavy reliance on Java Some perl as well Emphasis on measurement as a service offering security & access restrictions Mostly European developers

perfsonar Bundles PS-Performance Toolkit Based on perfsonar-ps code train CDROM that automates creating a measurement appliance perfsonar-ps v3.1 Packages of individual perfsonar services perfsonar-mdm v3.1 The basis of the LHCOPN perfsonar MDM service

Selecting a Bundle or Distribution Do you need to support NDT & NPAD, or are you looking for a simple measurement appliance? Consider PS-Performance Toolkit Does your organization have restrictions on OS s and patching for servers supporting external network services? Consider perfsonar PS RPM packages Is publishing data to restricted groups critical, and are you a member of the edugain federation? Consider MDM release

PS-Performance Toolkit Components ng re? hls Apache hhp Config Tools perfadmin VisualizaGon Tools Support ApplicaGons Provides a perfsonar Web Service Low Level Measurement Tools perfsonar BOUY Bwctl nuhcp iperf psb MA owamp Pinger MA Pinger ping NDT NPAD SNMPMA cacg

perfsonar Hardware Requires dedicated hardware (not virtual servers) Copy somebody else, or try before you buy ESnet deployed hardware details at https://performance.es.net/pmp.html Sample host configuration for PS Performance Toolkit http://fasterdata.es.net/ps_howto.html Find somebody with the class of machine that your looking for and ask them how it works!

Typical Campus Deployment

Developing a Measurement Plan What are you going to measure? Achievable bandwidth 2-3 regional destinations 4-8 important collaborators 4-12 times per day to each destination 20 second tests within the NA, longer to EU or Asia Latency OWAMP: ~10 collaborators over diverse paths Pinger: Important collaborators who don t support owamp Interface Utilization & Errors What are you going to do with the results? NAGIOS Alerts Reports to user community Website

Deploying a perfsonar measurement host in under 30 minutes Using the PS Performance Toolkit is very simple Boot from CD Use command line tool to configure what disk partition to use for persistent data Network address and DNS User and root passwords Use Web GUI to configure Select which services to run Select remote measurement points for bandwidth and latency tests Configure Cacti to collect SNMP data from key router interfaces

Measurement Communities The PS Performance Toolkit lets you specify which measurement community you measurement host is meant to service Sample communities: LHC, DOE-SC-LAB, Internet2, ESnet, Climate, etc. This makes it easier to locate other measurement hosts of interest

Example: US Atlas Tier 1 to Tier 2 Center Data transfer problem Couldn t exceed 1 Gbps across a 10GE end to end path that included 5 administrative domains Used perfsonar tools to localize problem Identified problem device An unrelated domain had leaked a full routing table to the router for a short time causing FIB corruption. The routing problem was fixed, but the router started process switching some flows after that. Fixed Rebooting device fixed the symptoms of the problem Better BGP filters configured to prevent reoccurrence (of 1 cause of this particular class of soft faults)

Example: NERSC & OLCF Users were having problems moving data between supercomputer centers One user was: waiting more than an entire workday for a 33 GB input file perfsonar Measurement tools were installed Regularly scheduled measurements were started Numerous choke points were identified & corrected Dedicate wide area transfer nodes were setup Tuned for Wide Area Transfers Now moving 40 TB in less than 3 days

How to Participate Deploy perfsonar Use perfsonar to find & correct the hidden performance problems in your networks.

Firewalls If your server is behind a firewall, you need to open the following ports Open to Global perfsonar Servers Lookup Service -- open port tcp/8095 Open to perfsonar Users SNMP MA -- open port tcp/8065 PingER -- open port tcp/8075 perfsonar-buoy -- open port tcp/8085 bwctl -- open port tcp/4823, edit /usr/local/etc/bwctld.conf, set peer_port to a value, open the tcp port for that value, and edit /usr/local/etc/bwctld.conf, set iperf_port, thrulay_port and nuttcp_port to a specific range, and open the tcp/udp ports for those ranges. owamp -- open port tcp/861, edit /usr/local/etc/owampd.conf, set testports to range, open the udp ports for that range NDT -- open port tcp/3001, open port tcp/3002, open port tcp/3003, open port tcp/7123 NPAD -- open port tcp/8100, open port tcp/8200 Open for local management Apache HTTP Server -- open port tcp/80, open port tcp/443 SSH -- open port tcp/22

Traceroute Visualizer Forward direction bandwidth utilization on application path from LBNL to INFN-Frascati (Italy) traffic shown as bars on those network device interfaces that have an associated MP services (the first 4 graphs are normalized to 2000 Mb/s, the last to 500 Mb/ s) 1 ir1000gw (131.243.2.1) 2 er1kgw 3 lbl2-ge-lbnl.es.net link capacity is also provided 10 esnet.rt1.nyc.us.geant2.net (NO DATA) 11 so-7-0-0.rt1.ams.nl.geant2.net (NO DATA) 12 so-6-2-0.rt1.fra.de.geant2.net (NO DATA) 13 so-6-2-0.rt1.gen.ch.geant2.net (NO DATA) 14 so-2-0-0.rt1.mil.it.geant2.net (NO DATA) 15 garr-gw.rt1.mil.it.geant2.net (NO DATA) 16 rt1-mi1-rt-mi2.mi2.garr.net 4 slacmr1-sdn-lblmr1.es.net (GRAPH OMITTED) 5 snv2mr1-slacmr1.es.net (GRAPH OMITTED) 6 snv2sdn1-snv2mr1.es.net 17 rt-mi2-rt-rm2.rm2.garr.net (GRAPH OMITTED) 18 rt-rm2-rc-fra.fra.garr.net (GRAPH OMITTED) 19 rc-fra-ru-lnf.fra.garr.net (GRAPH OMITTED) 7 chislsdn1-oc192-snv2sdn1.es.net (GRAPH OMITTED) 8 chiccr1-chislsdn1.es.net 20 21 www6.lnf.infn.it (193.206.84.223) 189.908 ms 189.596 ms 189.684 ms 9 aofacr1-chicsdn1.es.net (GRAPH OMITTED)