Grid monitoring with MonAlisa Iosif Legrand (Caltech) Iosif.Legrand@cern.ch Laura Duta (UPB) laura.duta@cs.pub.ro Alexandru Herisanu (UPB) aherisanu@cs.pub.ro
Agenda What is MonAlisa? Who uses it? A little bit of design Monitoring Traffic and Jobs Network Topology, Latency and Routers Managing Optical Switches and Paths The VRVS and EVO system Summary INFSO-RI-508833 University Politehnica of Bucharest 2
What is MonAlisa? The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks and to analyze and process this information in a distributed way to provide optimization decisions in large scale distributed applications.
What MonAlisa provides Reliable Registration and Discovery for Services and Applications. Monitoring all aspects of complex systems : System information for computer nodes and clusters Network information : WAN and LAN Monitoring the performance of Applications or services The End User Systems Can interact with any other services to provide in near real-time customized / filtered information based on monitoring data Secure, remote administration for services and applications Agents to supervise applications, to restart or reconfigure them, and to notify other services when certain conditions are detected. The MonALISA framework can be used to develop higher level decision ion services, implemented as a distributed network of communicating agents, to perform global optimization tasks. Powerful Graphical User Interfaces
MonAlisa communities Disun CMS-US sites CMS CDF D0 SAR ABILENE backbone GLORIAD STAR ALICE VRVS System RoEduNET backbone INTERNET2 PIPES OSG VRVS ABILENE It has been used for Demonstrations - at: - CMS-DC04 SC 2005 - Bandwith Challenge Record: 131 Gbps CHEP06 GRID3 Terena Netw 2005 ALICE ICFA Wrksh. 05 SC 2004
A little bit of design ML Client ML Repository
MonAlisa System Services Clients, HL services repositories Proxies MonALISA service Global Services or Clients Dynamic load balancing Scalability & Replication Security Distributed Information System. Fully Distributed Discovery Network of JINI-LUSs Dynamic - based on a lease Secure & Public Mechanism and REN
MonAlisa service and Data Handling Client (other service) Web client WEB Service WSDL SOAP Postgres DB MySQL Monitor Data Stores Data Cache Service & DB Registration Lookup Service Communications via the ML Proxy data Lookup Service Discovery Applications MonALISA Service Predicates & Agents Configuration Control (SSL) Client (other service) Java User defined loadable Modules to write /sent data MDS
Registration, Discovery and AAA MonALISA Service Trust keystore MonALISA Service Admin SSL connection MonALISA Service Trust keystore Registration (signed certificate) Lookup Service Lookup Service Discovery Services Proxy Multiplexer Services Proxy Multiplexer AAA services Client (other service) Client authentication Client (other service)
Monitoring traffic and jobs
Monitoring the execution of jobs SPLIT JOBS LIFELINES for JOBS Summit a Job Job1 Job Job DAG Job2 Job3 Job 31 Job 32
Monitoring SPEED MonAlisa Fast Data Transfer 17Gbps full duplex at SC2006
Monitoring Network Topology, Latency, Routers NETWORKS ROUTERS AS Real Time Topology Discovery & Display
On demand Optical Path
Monitoring and Controlling Optical Planes Controlling Port power monitoring Glimmerglass Switch Example
Agents to Create on Demand an Optical Path or Tree 2 Discovery & Secure Connection ML Demon ML Agent MonALISA 3 1 Runs a ML Demon >ml_pathml_path IP1 IP4 copy file IP4 Optical Switch Control & Monitor the switch ML Agent MonALISA Optical Switch Optical Switch ML Agent MonALISA The time to create a path on demand is less than 1s independent of the location and the number of connections 4 ML proxy services used in Agent Communication
Host Monitoring with UltraLight Many network problems are actually endhost problems: misconfigured or underpowered endsystems Host/System TCP Network Settings Device Information The Information LISA application (see Iosif s talk later) was designed to monitor the endhost and its view of the network. For SC 05 we developed a Perl script to gather the relevant host details related to network performance and integrated the script with ApMon (an API for MonALISA) to allow us to publish this data to a MonALISA repository. Information on the system information, TCP configuration and network device setup was gathered and accessible from one site. Future plans are to coordinate this with LISA and deploy this as part of OSG. The Tier-2 centers are a primary target.
Available Bandwidth Measurements Embedded Pathload module.
Enforces measurement fairness Coordination Service for Available Bandwidth Measurements Avoids multiple probes on shared network segments Dynamic configuration of measure-ment timing Logs events Provides service redundancy by using a master-slave model
Monitoring VRVS Reflectors and Communication Topology
Creating a Dynamic, Global, Minimum Spanning Tree to optimize the connectivity A weighted connected graph G = (V,E) with n vertices and m edges. The quality of connectivity between any two reflectors is measured every 2s. Building in near real time a minimumspanning tree T w( T ) = ( v, u ) T w(( v, u ))
Summary MonaLISA is a fully distributed service system with no single point of failure. It provides reliable registration and discovery of services and applications. MonALISA is interfaced with many monitoring tools and is capable to collect information from different applications It allows to analyze and process information locally, using Filters or Agents that are dynamically deployed to provide customized information to other services or clients or to trigger predefined actions. Can be used to control and monitor any other applications. Agents can be used to supervise applications, to restart or reconfigure them, and to notify other services when certain conditions are detected. Provides a secure administration interface which allows to remotely control (start / stop/ reconfigure / upgrade) distributed services or applications. The Agent system in the MonALISA framework can be used to develop higher level services, implemented as a distributed network of communicating agents, to perform global optimization tasks. It proved to be a stable and reliable distributed service system ~180 Sites running MonALISA