Monitoring and Control of Large Scale Distributed Systems

Similar documents
An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed Systems. June 2007

Internet2 Meeting September 2005

Grid monitoring with MonAlisa

MONitoring Agents using a Large Integrated Services Architecture. Iosif Legrand California Institute of Technology

Monitoring the ALICE Grid with MonALISA

Iosif Legrand. California Institute of Technology

MONALISA: A MONITORING FRAMEWORK FOR LARGE SCALE COMPUTING SYSTEMS

Network Design Considerations for Grid Computing

Data transfer over the wide area network with a large round trip time

Data Transfers Between LHC Grid Sites Dorian Kcira

Multi-threaded, discrete event simulation of distributed computing systems

Cisco Configuration Engine 2.0

MASV Accelerator Technology Overview

High bandwidth, Long distance. Where is my throughput? Robin Tasker CCLRC, Daresbury Laboratory, UK

ANSE: Advanced Network Services for [LHC] Experiments

CHAPTER 3 GRID MONITORING AND RESOURCE SELECTION

DYNES: DYnamic NEtwork System

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

CS 5520/ECE 5590NA: Network Architecture I Spring Lecture 13: UDP and TCP

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

FUJITSU Software Interstage Information Integrator V11

Functional Requirements for Grid Oriented Optical Networks

Networking interview questions

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Towards Network Awareness in LHC Computing

Networking for Data Acquisition Systems. Fabrice Le Goff - 14/02/ ISOTDAQ

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

NetAlly. Application Advisor. Distributed Sites and Applications. Monitor and troubleshoot end user application experience.

Operating Systems. 16. Networking. Paul Krzyzanowski. Rutgers University. Spring /6/ Paul Krzyzanowski

BMC Remedyforce Discovery and Client Management. Frequently asked questions

Overview. About CERN 2 / 11

High Throughput WAN Data Transfer with Hadoop-based Storage

Port Usage Information for the IM and Presence Service

Presentation_ID. 2002, Cisco Systems, Inc. All rights reserved.

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Network Design. Overview. CDS with Vaults and Streamers CHAPTER

Cisco SRW Port Gigabit Switch: WebView Cisco Small Business Managed Switches

Network Management & Monitoring

Enterprise print management in VMware Horizon

Port Usage Information for the IM and Presence Service

TCP and BBR. Geoff Huston APNIC. #apricot

Constant monitoring of multi-site network connectivity at the Tokyo Tier2 center

SaaS Providers. ThousandEyes for. Summary

Cisco Wide Area Application Services: Secure, Scalable, and Simple Central Management

Veeam Cloud Connect. Version 8.0. Administrator Guide

Fundamental Questions to Answer About Computer Networking, Jan 2009 Prof. Ying-Dar Lin,

Deploy. A step-by-step guide to successfully deploying your new app with the FileMaker Platform

Chapter 13 TRANSPORT. Mobile Computing Winter 2005 / Overview. TCP Overview. TCP slow-start. Motivation Simple analysis Various TCP mechanisms

TCP and BBR. Geoff Huston APNIC

UW-ATLAS Experiences with Condor

High Performance Ethernet for Grid & Cluster Applications. Adam Filby Systems Engineer, EMEA

Cisco Virtual Networking Solution for OpenStack

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Features. HDX WAN optimization. QoS

System Description. System Architecture. System Architecture, page 1 Deployment Environment, page 4

TCP and BBR. Geoff Huston APNIC

BlackBerry Enterprise Server for IBM Lotus Domino Version: 5.0. Administration Guide

CS 344/444 Computer Network Fundamentals Final Exam Solutions Spring 2007

PROOF-Condor integration for ATLAS

Adaptive Cluster Computing using JavaSpaces

Digital Communication Networks

SOLO NETWORK (11) (21) (31) (41) (48) (51) (61)

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Hardware Requirements

ThousandEyes for. Application Delivery White Paper

TCP Tuning Domenico Vicinanza DANTE, Cambridge, UK

Advanced Computer Networks. Flow Control

Programmable Information Highway (with no Traffic Jams)

Cisco SRW208 8-Port 10/100 Ethernet Switch: WebView Cisco Small Business Managed Switches

Symbols INDEX > 12-14

Cisco Small Business Managed Switches

Service Mesh and Microservices Networking

CSE 4215/5431: Mobile Communications Winter Suprakash Datta

Optimize and Accelerate Your Mission- Critical Applications across the WAN

CMSC 611: Advanced. Interconnection Networks

CTS2134 Introduction to Networking. Module 09: Network Management

SOLO NETWORK (11) (21) (31) (41) (48) (51) (61)

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Overcoming Obstacles to Petabyte Archives

Europeana Core Service Platform

CERN openlab Summer 2006: Networking Overview

Status of KISTI Tier2 Center for ALICE

1. What is a Computer Network? interconnected collection of autonomous computers connected by a communication technology

Cluster Setup and Distributed File System

Ultra high-speed transmission technology for wide area data movement

System Requirements. Things to Consider Before You Install Foglight NMS. Host Server Hardware and Software System Requirements

How Teridion Works. A Teridion Technical Paper. A Technical Overview of the Teridion Virtual Network. Teridion Engineering May 2017

Small Enterprise Design Profile(SEDP) WAN Design

Cisco SGE Port Gigabit Switch Cisco Small Business Managed Switches

Overview of the Cisco Service Control Value Added Services Feature

Computer Networks. More on Standards & Protocols Quality of Service. Week 10. College of Information Science and Engineering Ritsumeikan University

School of Engineering Department of Computer and Communication Engineering Semester: Fall Course: CENG415 Communication Networks

Chapter 5.6 Network and Multiplayer

ECSE-6600: Internet Protocols Spring 2007, Exam 1 SOLUTIONS

Outline 9.2. TCP for 2.5G/3G wireless

Next Steps Spring 2011 Lecture #18. Multi-hop Networks. Network Reliability. Have: digital point-to-point. Want: many interconnected points

Advanced Computer Networks

Mobile Communications Chapter 9: Mobile Transport Layer

Transcription:

Monitoring and Control of Large Scale Distributed Systems Enrico Fermi International School of Physics Grid and Cloud Computing 26 July 2014 Iosif Legrand Caltech / CERN 1

The MONARC - LHC Computing Model Experiment Tier 1 ~PByte/sec Online System 10 40 to 100 Gbps ~500-1500 MBytes/sec CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier1 Tier1 Tier1 Tier1 Tier 3 Physics data cache 10 to N X 10 Gbps Workstations Institute Institute Tier 2 Institute Institute 1 to 10 Gbps Tier 4 Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center 11 Tier1 and 160 Tier2 and Tier3 Centers Tens of Petabytes by year for each experiment Iosif Legrand July 2014 2

Moving toward a Cloud Model Tier 0 Tier 0 Tier 1 Tier 1 Tier 1 Tier 1 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 Tier 2 A dynamic scheme to access data requires significantly more knowledge about the data distribution state of the servers and the network infrastructure. The information should be available to large number of jobs in near real-time 3 Iosif Legrand June 2013 Iosif Legrand July 2014 3

Monitoring Information is necessary for System Design, Control, Optimization, Debugging and Accounting Computing Models Modeling & Simulations ACCOUNTING Optimization Algorithms MONITORING ~ REAL TIME Information Create resilient Distributed Systems Control and Operational support ALARMS DEBUGGING Iosif Legrand July 2014 4

Monitoring Distributed Systems An essential part of managing large scale, distributed data processing facilities, is a monitoring system that is able to monitor computing facilities, storage systems, networks and a very large number of applications running on these systems in near-real time. The monitoring information gathered for all the subsystems is essential for design, modelling, debugging, accounting and the development of higher level services, that provide decision support and some degree of automated decisions and for maintaining and optimizing workflow in large scale distributed systems. Iosif Legrand July 2014 5

Network Layers OSI and TCP/IP Layered Models Iosif Legrand July 2014 6 6

Transport Layer Provide logical communication between application processes running on different hosts The transport layer is responsible for process-to-process delivery Datagram messaging service (UDP) It does not add anything to the services of IP except to provide process-toprocess communication instead of host-to-host communication. Reliable, in-order delivery (TCP) Connection set-up Discarding of corrupted packets Retransmission of lost packets Flow control Congestion control Iosif Legrand July 2014 7

SNMP - Simple Network Management Protocol Is a UDP-based network protocol With SNMP you can monitor hardware devices and software applications. It is also possible to control and configure these devices. Common devices managed by SNMP include Computer hosts Routers Switches IP telephones Printers Software application and services ( Data Bases,Web Servers, Batch Systems..) Iosif Legrand July 2014 8

Main SNMP components Three key components: Management Information Base (MIB) SNMP Agent Network Management System (NMS) Network Management System GET / WALK Asynchronous communication TRAP SNMP Agents Iosif Legrand July 2014 9

SNMP Agent An SNMP agent on a managed device exposes status information as variables. Various data can be made available about the device using SNMP for example: System name Free memory Processor usage Uptime Traffic on each interface Can be easily extended to report new parameters SNMP agent can also accept requests from clients to perform active administration and to modify managed devices configuration Agent status information and active administration commands are defined in MIB files Iosif Legrand July 2014 10

SNMP MIB - Management information base files The very basic component of the structure used in case of SNMP is an object. Every information that can be queried through SNMP is looked in terms of an object. They organized in a hieratical structure and every object has an associated unique ID (known as OID). A group of objects form a MIB. MIB File that defines variables that can be read or set on the managed device using SNMP Defines information about the managed device that can be polled for using SNMP clients Defines active management settings that can be set or changed to alter the device configuration using SNMP Iosif Legrand July 2014 11 1 directory(1 ) 1 1 system(1) iso(1) 3 2 org(3) 6 dod(6 ) 1 mgmt(2) 1 internet(1 ) mib-2(1) 3 4 experimental(3 ) interfaces(2 ip(4) 2 ) 4 private(4 ) tcp(6) 6

Trap Mechanism Trap subscriptions -Asynchronous notification. -SNMP agents can be programmed to send trap messages when a certain set of circumstances ( or thresholds) arise. Managed devices agent will send notifications to listeners The Agent sends notifications to all subscribers Iosif Legrand July 2014 12

SNMP History SNMP version 1 was published in 1988 RFC 1157 SNMP version 2 added additional functionality RFC 1441 (1993) SNMP v3 added security features RFC 3410-3415 (1999) Wildly accepted and there are many implementations for most network devices, computers, printers, home electronic devices, and large scale software products. Iosif Legrand July 2014 13

A Simple Monitoring System script script Data Base PROCESS script Generate plots periodically But such an architecture Does NOT scale Can not be used in large scale geographically distributed systems Can not be used to control, manage and optimize complex flows Iosif Legrand July 2014 14

Philosophy: Mobile Code to build Distributed Services Service Lookup Proxy Service Any well suited protocol for the application Lookup Proxy Service CLIENT Dynamic Code Loading Inspired by the JAVA-JINI Technology Services can be used dynamically Remote Services Mobile Agents Smart Proxies Proxy == RMI Stub Proxy == Entire Service Proxy adjusts to the client Act as a true dynamic service and provide the necessary functionally to be used by any other services that require such information (Jini, interface to WSDL / SOAP) mechanism to dynamically discover all the Service Units" remote event notification for changes in the any system lease mechanism for each registered un Iosif Legrand July 2014 15

Mobile Code - AGENTS Mobile code supports a flexible form of distributed computation where the desired nonlocal computations need not be known in advance at the execution site ADVATGAES Take decisions at the run time Flexibility Dynamic Deployment Easy to reconfigure Iosif Legrand July 2014 16 16

The MonALISA Architecture HL services Regional or Global High Level Services, Repositories & Clients Network of JINI-Lookup Services 17 Agents Proxies MonALISA services Secure & Public Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Distributed Dynamic Registration and Discoverybased on a lease mechanism and remote events Fully Distributed System with no Single Point of Failure Iosif Legrand July 2014 17

MonALISA Service & Data Handling WS Clients and service Web Service WSDL SOAP Data Store Data Cache Service & DB Postgres Lookup Service Data (via ML Proxy) Lookup Service Predicates & Agents Applications Configuration Control (SSL) AGENTS Clients or Higher Level Services Collects any type of information FILTERS / TRIGGERS Monitoring Modules Dynamic Loading Push and Pull 18 Iosif Legrand July 2014 18

Multi-Thread Execution Engine Pool of Threads Execution Engine & Control Timeout ERROR peek Time Error Handling procedures Success Priority Queue for Monitoring tasks Monitor IO Re-Schedule operation Iosif Legrand July 2014 19

Registration / Discovery Admin Access and AAA for Clients 20 Application MonALISA Service Trust keystore Applications MonALISA Service Admin SSL connection MonALISA Service Trust keystore 20 Registration (signed certificate) Lookup Service Lookup Service Discovery Services Proxy Multiplexer Services Proxy Multiplexer AAA services Iosif Legrand July 2014 20 Client (other service) Data Filters & Agents Client authentication Client (other service)

MonALISA collects any type of monitoring information in distributed systems The MonALISA package includes: Local host monitoring (CPU, memory, network traffic, processes and sockets in each state, LM sensors, APC UPSs), log files tailing SNMP generic & specific modules Condor, PBS, LSF and SGE (accounting & host monitoring), Ganglia Ping, tracepath, traceroute, pathload and other network-related measurements TL1, Network devices, Ciena, Optical switches Calling external applications/scripts that return as output the values XDR-formatted UDP messages (ApMon user s defined information). New modules can be easily added by implementing a simple Java interface. Filters can be used to generate new aggregate data. The Service can also react to the monitoring data it receives (actions & alarms). MonALISA can run code as distributed agents for global optimization Iosif Legrand July 2014 21

ApMon Application Monitoring UDP based Library of APIs (C, C++, Java, Perl. Python) that can be used to send any information defined by users or applications to MonALISA services Flexibility, dynamic configuration, high communication performance Automated system monitoring Accounting information APPLICATION App. Monitoring Time;IP;procID parameter1: value parameter2: value... ApMon Config Servlet UDP/XDR Monitoring Data dynamic reloading MonALISA Service MonALISA CPU Usage (%) 70 60 50 40 30 20 10 0 No Lost Packages 0 1000 2000 3000 4000 5000 6000 Messages per second 22 APPLICATION App. Monitoring Mbps_out: 0.52 Status: reading MB_inout: ApMon 562.4 System Monitoring load1: 0.24 processes: 97 pages_in: 83 ApMon Config UDP/XDR Monitoring Data UDP/XDR Monitoring Data MonALISA Service ApMon configuration generated automatically by a servlet / CGI script Iosif Legrand July 2014 22

LISA- Localhost Information Service Agent End To End Monitoring Tool A lightweight Java Web Start application that provides complete monitoring of the end user systems, the network connectivity and can use the MonALISA framework to optimize client applications It is very easy to deploy and install by simply using any browser. It detects the system architecture, the operating system and selects dynamically the binary parts necessary on each system. It can be easily deployed on any system. It is now used on all versions of Windows, Linux, Mac. It provides complete system monitoring of the host computer: CPU, memory, IO, disk, Hardware detection Main components, Audio, Video equipment, Drivers installed in the system A user friendly GUI to present all the monitoring information. Iosif Legrand July 2014 23

LISA- End User / Client Agent Authorization Service discovery Local detection for hardware and software Complete monitoring of the system End to end performance measurements System configuration Act as an active listener the events related with the requests generated by its local applications. The End User Agent implements the secure APIs the applications may use to generate requests and to interact with the Network Service System. It will continuously receive time estimations to complete the request. When it is possible, will help to provide correct system & network configuration 24 to the end systems. Iosif Legrand July 2014 24

Monitoring architecture in ALICE AliEn Job Agent AliEn Job Agent AliEn Job Agent ApMon ApMon ApMon AliEn Job Agent ApMon AliEn Job Agent AliEn CE ApMon ApMon AliEn Job Agent ApMon 25 AliEn CE Cluster Monitor ApMon ApMon MonALISA @Site MonALISA LCG Site LCG Tools Cluster Monitor AliEn SE ApMon ApMon AliEn SE ApMon AliEn TQ ApMon AliEn IS ApMon MonaLisa Repository MonALISA @CERN Long History DB AliEn Optimizers ApMon API Services AliEn Brokers ApMon Iosif Legrand July 2014 25 ApMon MySQL Servers ApMon CastorGrid Scripts ApMon Alerts Actions

Monitoring Grid sites, Running Jobs, Network Traffic, and Connectivity Running Jobs JOBS TOPOLOGY ACCOUNTING Iosif Legrand July 2014 26

Monitoring the Execution of Jobs and the Time Evolution SPLIT JOBS LIFELINES for JOBS Summit a Job Job1 Job 27 Job DAG Job2 Job3 Job 31 Job 32 Iosif Legrand Iosif Legrand July 2014 June 2013 27

Monitoring in ALICE: jobs, resources, services 28 Iosif Legrand July 2014 28

ROUTERS Monitoring Network Topology (L3), Latency, Routers NETWORKS AS Iosif Legrand August 2009 Real 29 Time Topology Discovery & Display Iosif Legrand July 2014 29

ALICE : Global Views, Status & Jobs http://pcalimonitor.cern.ch Iosif Legrand July 2014 30

Real Time Ping Measurements between ALICE Sites Red lines indicate routing problems between the sites Iosif Legrand July 2014 31

Available Bandwidth Measurements. 32 32 Iosif Legrand July 2014 32

Connectivity map between ALIICE siites Red lines indicate routing problemss between the sites Iosif Legrand July 2014 33

Monitoring in ALICE: Xrootd servers Iosif Legrand July 2014 34

IO Traffic IN on the SE- CERN servers Iosif Legrand July 2014 35

USLHCNet: High-speed trans-atlantic network CERN to US FNAL BNL 6 x 10G links 4 PoPs Geneva Amsterdam Chicago New York The core is based on Ciena CD/CI (Layer 1.5) Virtual Circuits Iosif Legrand July 2014 36

Monitoring USLHCNet Topology Topology & Status & Peering Real Time Topology for L2 Circuits 37 Iosif Legrand Iosif Legrand July 2014 June 2013 37

USLHCnet: Accounting for Integrated Traffic Iosif Legrand July 2014 38

Monitoring Links Availability Very Reliable Information AMS-GVA (SURFnet) AMS-GVA (Geant) AMS-NYC (Level3) AMS-NYC (TSystems) Ref @ CERN) CHI-NYC(Level3) GVA-NYC (TSystems) CHI-NYC(NLR1) CHI-GVA(TSystems) CHI-GVA(Level3) CHI-NYC(NLR2) 99.7% 99.6% 99.7% 97.2% 99.4% 95.4% 99.9% 99.5% 96.9% 99.2% P1 Network LINK 39 P1 100% monitoring availability Iosif Legrand July 2014 39

ALARMS and Automatic notifications for USLHCnet Iosif Legrand July 2014 40

Gbps 7 6 5 4 3 Some correlation of BW use with new Lumi. level, Reprocessing for conferences, etc. Short term peaks at 100% of capacity CERN-FNAL Traffic and CMS Instantaneous Luminosity CERN FNAL TRAFFIC Lumi: ~10 33 Lumi: 2X10 33 2 1 Iosif Legrand July 2014 41

ALICE: RTT measurements from CERN to all sites Network conditions can change rapidly Iosif Legrand July 2014 42

Asymmetric Routing South Africa Africa to Europe Europe to Asia Japan Iosif Legrand July 2014 43

Path monitoring for each pair of sites South Africa Africa to Europe Europe to Asia Japan Iosif Legrand July 2014 44

Data Transfer in WAN

MIT Open Courseware Data Networks: Lecture Notes ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-263jdata-communication-networks-fall-2002/ Iosif Legrand July 2014 46

MIT Open Courseware 6.263J Data Networks: Lecture 25 Slide 40

How to efficiently move data in WAN TCP Performance What influences the TCP performance? Available bandwidth Packet Loss Out of order delivery Round-trip Congestion avoidance algorithm TCP setup and tuning Buffers in Switches and routers BW ~ Segment Size RTT * SQRT (Loss Prob) Long distance connections Parallelism Iosif Legrand July 2014 48

FDT Fast Data Transfer FDT is an application for efficient data transfers using parallel streams Easy to use. Written in java and runs on all major platforms. It is based on an asynchronous, multithreaded system which is using the NIO library and is able to: stream continuously a list of files use independent threads to read /write on each physical device transfer data in parallel on multiple TCP streams, when necessary resume a file transfer session allows to "plug-in" external security APIs (SSL, GSI SSH ) Controlled by the MonALISA SC 2009 Iosif Legrand July 2014 49

FDT features The FDT architecture allows to "plug-in" external security APIs and to use them for client authentication and authorization. Supports several security schemes : IP filtering SSH GSI-SSH Globus-GSI SSL User defined loadable modules for Pre and Post Processing to provide support for dedicated MS system, compression FDT can be monitored and controlled dynamically by MonALISA System Iosif Legrand July 2014 50

FDT Throughput TESTS Iosif Legrand July 2014 51

FDT Memory to Memory Tests in WAN ~9.0 Gb/s ~9.4 Gb/s CPUs Dual Core Intel Xenon @ 3.00 GHz, 4 GB RAM, 4 x 320 GB SATA Disks Connected with 10Gb/s Myricom Iosif Legrand July 2014 52

Disk -to- Disk transfers in WAN 1U Nodes with 4 Disks 4U Disk Servers with 24 Disks MB/s NEW YORK GENEVA Reads and writes on 4 SATA disks in parallel on each server Mean traffic ~ 210 MB/s ~ 0.75 TB per hour CERN CALTECH Reads and writes on two 12-port RAID Controllers in parallel on each server Mean traffic ~ 545 MB/s ~ 2 TB per hour Lustre read/ write ~ 320 MB/s between Florida and Caltech Works with xrootd 53 Interface to dcache using the dcap protocol Iosif Legrand July 2014 53

FDT & MonLISA Used at SC 2006 17.7 Gb/s Disk to Disk on 10 Gb/s link used in Both directions from Florida to Caltech 54 Iosif Legrand August 2009 Iosif Legrand July 2014 54

SC2008 Bandwidth Challenge, Austin, TX CIENA Caltech Booths FDT TRANSFERS Ciena, OTU-4 standard link carrying a 100 Gbps payload (or 200 Gbps bidirectional) with forward error correction. 10x10Gbps multiplex over an 80km fibre spool Pick : 199.9 Gb/s Mean 191 Gb/s We transferred : ~ 1PB in ~12 hours Iosif Legrand July 2014 55

FDT on 40Gbps NICs (LAN) 35 Gbps 40 Gbps 4 TCP streams 2 TCP streams 2011 2014 Exactly the same hardware but better drivers and kernel Improved CPU usage Iosif Legrand July 2014 56

FDT used SC2012 on 40Gbps NIC cards Iosif Legrand July 2014 57

FDT used on network tests between CERN to Ottawa (May 2014) Using only two specially configured Linux servers in each location. Each server used in the demonstration was configured with two Mellanox 40GE Network cards. The transfers were performed using only 3 pairs of cards and 25 parallel TCP streams in total. Iosif Legrand July 2014 58

FDT Status The multi-threaded approach in FDT together with the zero copy and the java-nio was a right architecture that works fine with 1, 10 and 40 NIC cards. FDT is internally instrumented (ApMon) and can report a lot of Monitoring information into the MonALISA system. It can also be controlled by MonALISA agents. It is not used for data transfers in HEP Grids, but it is used a lot in Amazon cloud ( ~ several PB /year ) Iosif Legrand July 2014 59

Tuning for high performance data transfers Default TCP send and receive buffer size were initially 64KBytes It should be ~8 32 MB. Auto-tuning of the buffer size were introduced recently and works fine MTU Maximum Transfer Unit Jumbo frames MTU 9000 Firewalls IRQ pinning (also know as IRQ affinity) Congestion control : cubic Iosif Legrand July 2014 60

Network mapping for all ALICE Sites Continuous WAN measurements for 85x85 site matrix MonALISA with FTD Complex topology automatic analysis of network conditions, coupled with SE tests Resulting in Per site list of best set of Storage elements Given to the client for data reading/writing Iosif Legrand July 2014 61

Tuning high performance data transfers The High Speed NICs and TCP congestion control mechanism produces bursty traffic as seen by the network devices. The queue length is in general described by the integral of the difference between the arrival process and the departure process: QL = A t D t t The arrival processes in in general a random process while the departure process can be a deterministic process with a random component. The only case where the arrival rate is proportional with the queue length A t ~ A t t is for the Poisson process. High throughput IO is much better described by chain of bursty arrivals with strong correlations between data queues as well as between the sender and receiver process. Iosif Legrand July 2014 62

Monitoring Network Topology (L3), Latency and Routers for all ALICE sites Real Time Topology Discovery & Display Iosif Legrand July 2014 63

Active Available Bandwidth measurements between all the ALICE grid sites 64 Iosif Legrand October 2010 Iosif Legrand July 2014 64

FDT Throughput measurements among sites Measurements are coordinated by MonALISA services to do not overlap Large variations in the effective throughput Iosif Legrand July 2014 65

Active Available Bandwidth measurements between all the ALICE grid sites 66 Iosif Legrand October 2010 Iosif Legrand July 2014 66

ALICE Network mapping The bandwidth tests, routing, kernel parameters are Available to the site administrators for tuning of local network and host parameters Negotiations with network providers However. the situation is not ideal Network tuning is a difficult task Even well-intended operators sometimes have difficulty responding to inquiries (terminology barrier?) New sites usually need global help from network experts Iosif Legrand July 2014 67

Control in Distributed Systems

Operational Decisions and Actions Based on monitoring information, actions can be taken in ML Service ML Repository Actions can be triggered by Values above/below given thresholds Absence/presence of values Correlation between multiple values Operational actions Alerts e-mail Instant messaging Supervision for Services External commands Event logging Traffic Jobs Hosts Apps Networking Temperature Humidity A/C Power Actions based on local information Sensors ML Service ML Service Local decision s Actions based on global information Global Service Global decision s Iosif Legrand July 2014 69

Control and Optimization Time delays in receiving monitoring data for the control units : give rise to phase lag degenerate system stability and performance Maximize temporal determinism. In general, a time-lag in a feedback loop will result in overshoot and oscillation. These oscillation could fade out, continue or increase to bring the system into an unstable state. Control Services Network DISTRIBUTED RESOURCES Monitoring Information Variable Delay From ms to days FEEDBACK Iosif Legrand July 2014 70

ALICE: Automatic job submission Restarting Services MySQL daemon is automatically Restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is kept full by the automatic submission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services status Trigger: presence/absence of monitored information Iosif Legrand July 2014 71

SeeVOGH (EVO) : Real-Time monitoring for Reflectors and the quality of all possible connections 72 Iosif Legrand July 2014 72

EVO: Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every second. Building in near real time a minimumspanning tree with addition constrains w( T ) = ( v, u) T Resilient Overlay Network that optimize real-time communication w(( v, u)) 73 Iosif Legrand June 2013 Iosif Legrand July 2014 73

Dynamic MST to optimize the Connectivity for Reflectors Frequent measurements of RTT, jitter, traffic and lost packages The MST is recreated in ~ 1 S case on communication problems. Iosif Legrand July 2014 74

LISA- Provides an Efficient Integration for Distributed Systems and Applications It is using external services to identify the real IP of the end system, its network ID and AS Discovers MonALISA services and can select, based on service attributes, different applications and their parameters (location, AS, functionality, load ) Based on information such as AS number or location, it determines a list with the best possible services. Registers as a listener for other service attributes (eg. number of connected clients). Continuously monitors the network connection with several selected services and provides the best one to be used from the client s perspective. Measures network quality, detects faults and informs upper layer services to take appropriate decisions MonALISA Application Service MonALISA Application Service Registration MonALISA Application Service Best Service Lookup Service MonALISA Application Service Lookup Service Iosif Legrand July 2014 75 LISA Discovery

SeeVOGH : Optimize how clients connect to the system for best performance and load balancing Iosif Legrand July 2014 76

Distribute Information from protect communities from network attacks Propagate information to all peers and network devices to block the attackers. Iosif Legrand July 2014 77

Monitoring the Topology and Optical Power on Fibers for Optical Circuits Controlling Port power monitoring 78 78 Glimmerglass Switch Example Iosif Legrand Iosif Legrand July 2014 June 2013 78

On-Demand, End to End Optical Path Allocation >FDT A/fileX B/path/ OS path available Configuring interfaces Starting Data Transfer CREATES AN END TO END PATH < 1s Internet Real time monitoring MonALISA Distributed Service System APPLICATION DATA A MonALISA Service OS Agent B LISA AGENT LISA sets up - Network Interfaces - TCP stack - Kernel parameters - Routes LISA APPLICATION use eth1.2, 79 79 79 Optical Switch LISA Agent Detects errors and automatically recreate the path in less than the TCP timeout Iosif Legrand January 2012 Iosif Legrand July 2014 79

Controlling Optical Planes Automatic Path Recovery 200+ MBytes/sec From a 1U Node USLHCnet CERN Geneva Internet2 Starlight CALTECH Pasadena Manlan FDT Transfer 80 Fiber cut simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN CALTECH) continues uninterrupted TCP fully recovers in ~ 20s Iosif Legrand October 2010 1 2 4 fiber cut emulations 4 Fiber cuts simulations 3 4 80 Iosif Legrand Iosif Legrand July 2014 June 2013 80

APPLICATION >FDT A/fileX B/path/ On-Demand, Dynamic Circuits Channel and Path Allocation 186 Gbps SC 2011 Seattle - UVic. path or channel allocation Configuring interfaces Starting Data Transfer Local VLANs MAP Local VLANs to WAN channels or light paths Recommended to use two NICs -one for management /one for data -- bonding two NICs to the same IP Iosif Legrand July 2014 81

The Need for Planning and Scheduling for Large Data Transfers Sequential In Parallel 2.5 X Faster to perform the two reading tasks sequentially Iosif Legrand July 2014 82

Dynamic priority for FDT Transfers on common segments Priority 4 Priority 2 Priority 8 Iosif Legrand July 2014 83

The Challenge in how to use very large amounts of monitoring information MonALISA collects ~ 6 millions persistent monitoring parameters and ~ 100 millions of volatile parameters per day We can easily access any of these parameters but in general when there problems, finding and understanding them is not so easy Are all of them correct? Do we have all the information we need? In many cases application and services are not ( correctly ) instrumented. Do we collect redundant information? It is important to build correlation engines for most important data flows and in this way to help detect pathological problems for distributed systems Iosif Legrand July 2014 84

Collecting and Storing Very Large Amounts of Monitoring Information CMS Sustain rates of 1000 monitoring values/s and peaks of 8000 values/s ALICE Tens of TB of monitoring data 25 X times more data are collected and used to create the necessary aggregate values ~ 1.2 X 10 11 Monitoring parameters Received in the CMS Dashboard ~1.3 X 10 11 Monitoring parameters Received in the ALICE Repository Iosif Legrand July 2014 85

Load Traffic distribution for all Xrood servers Bad servers Overloaded servers Total of 358 Xrootd servers Load Max ~ 3000 Good servers 86 Iosif Legrand July 2014 86

Good Servers... have problems 87 Iosif Legrand July 2014 87

Trying to find a better way to store and analyze large sets of monitoring information Very large amount of monitoring information is currently collected. The users want more and more monitoring information, but is really difficult to analyze all the data we collect. Deleing older data or keeping only long term mediated values is not really a solution. Wavelets seems to provide an effective way to compress monitoring information and to analyze large, complex time series data. Iosif Legrand July 2014 88

The Fourier Transform Represents a signal into constituent sinusoids of different frequencies F( w) iwt = f ( t) e dt However it does not indicate when different frequencies occur Iosif Legrand July 2014 89

CWT ψ x THE WAVELET TRANSFORM 1 s t ψ * ( τ, s) = Ψ ( τ, s) = x( t) ψ dt Translation (The time location of the window) x Scale Mother Wavelet Scale is generalized local frequency s τ Iosif Legrand July 2014 90

Wavelet Transform Provides the time-frequency representation Capable of providing the time and frequency information simultaneously WT was developed to overcome some resolution related problems of the STFT We pass the time-domain signal from various highpass and low pass filters, which filters out either high frequency or low frequency portions of the signal. This procedure is repeated, every time some portion of the signal corresponding to some frequencies being removed from the signal Amplitude Scale Iosif Legrand July 2014 91

Discrete Wavelet Transform Signal Low-Pass Filter Down- Sample 2X Approximation (A) High-Pass Filter Down- Sample 2X Detail (D) Signal Filter 1 D 1 A single level decomposition puts a signal through 2 complementary S low-pass and high-pass filters The output of the low-pass filter gives the approximation (A) coefficients, A 2 D while the high pass filter gives the 2 detail (D) coefficients A 1 A 3 D 3 D 1 A 1 Filter 2 A 2 Filter 3 D 2 A 3 D 3 Iosif Legrand July 2014 92

Network Traffic in USLHC net 24 H History with high sampling rate 93 Iosif Legrand July 2014 93

Compressing Time Series Data % zeros in the compressed signal Signal Energy Threshold Iosif Legrand July 2014 94

Example of Compression for high sampling network traffic data Sig Energy = 70% Storage = 5% Sig Energy = 90% Storage = 10% Sig Energy = 95% Storage = 18 % Sig Energy = 98% Storage = 42% Iosif Legrand July 2014 95

Self Similarity Structure in the Transformed Space Iosif Legrand July 2014 96

Similar data transfers operations that overlap in the total traffic pattern Main Frequency Iosif Legrand July 2014 97

Detecting Clusters in Multiple Time Series Data Using Wavelets Use a multidimensional grid structure onto data space These multidimensional spatial data objects are represented in an n-dimensional feature space Apply wavelet transform on feature space to find the dense regions in the feature space Apply wavelet transform multiple times which result in clusters at different scales from fine to coarse Iosif Legrand July 2014 98

Two months history for the efficiency on all Alice grid sites High resolution data > 20 000 Values per time series Not so easy to understand Iosif Legrand July 2014 99

Acceding hierarchical clusters for the site efficiency plot Major Advantages: Complexity O(N) Detect arbitrary shaped clusters at different scales Not sensitive to noise, not sensitive to input order Iosif Legrand July 2014 100

Results for cluster classification for sites Iosif Legrand July 2014 101

Monitoring: A listener to the grid orchestra? Can we analyze and use in near real time monitoring data in this form? Iosif Legrand July 2014 102

MonALISA Today Monitoring 60,000 computers > 100 Links Major Nets Tens of Thousands of Grid jobs running concurrently > 14,000 end-to-end network path measurements Using Intelligent Agents Collecting > 6mil persistent parameters in real-time ~ 100 millions of volatile parameters per day Updating ~ 35,000 parameters per second Repository servers 10 mil. users request / year Running 24 X 7 at ~370 Sites Iosif Legrand July 2014 103

The eight fallacies of distributed computing It is fair to say that at the beginning of this project we underestimated some of the potential problems in developing large distributed systems in WAN, and indeed the eight fallacies of distributed computing are very important lessons: 1) The network is reliable. 2) Latency is zero. 3) Bandwidth is infinite. 4) The network is secure. 5) Topology doesn't change. 6) There is one administrator. 7) Transport cost is zero. 8) The network is homogeneous. Iosif Legrand July 2014 104

The MonALISA Experience Unified platform for all the monitoring information Service Oriented Architecture ; Dynamic Discovery Agent model for monitoring modules / filters and actions Functionality to dynamically subscribe for any type of information on the fly Use of a simple and efficient communication approach (problems with RMI, XML. ) Multithread approach is instrumental for the performance and reliability for the system Good Graphical views & representations to present information Simple and efficient approach for storing data Iosif Legrand July 2014 105