A Performance Monitoring System for Large Computing Clusters

Similar documents
System for Large Computing Clusters

A Monitoring System for the BaBar INFN Computing Cluster

Reliable Distribution of Data Using Replicated Web Servers

Distributed Simulation of Large Computer Systems

Valutazione delle prestazioni di Architetture Software con specifica UML tramite modelli di simulazione Moreno Marzolla

SNMP MIBs and Traps Supported

Optimizing Parallel Access to the BaBar Database System Using CORBA Servers

ISTITUTO NAZIONALE DI FISICA NUCLEARE

The BABAR Database: Challenges, Trends and Projections

Benchmarking the ATLAS software through the Kit Validation engine

Monitoring tools and techniques for ICT4D systems. Stephen Okay

SNMP: Simplified. White Paper by F5

SNMP Simple Network Management Protocol

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2


SNMP SIMULATOR. Description

RRDTool: A Round Robin Database for Network Monitoring

SNMP and Network Management

Simulating storage system performance: a useful approach for SuperB?

Mitrefinch. Infrastructure Requirements

SNMP and Network Management

Graphing and statistics with Cacti. AfNOG 11, Kigali/Rwanda

PLANEAMENTO E GESTÃO DE REDES INFORMÁTICAS COMPUTER NETWORKS PLANNING AND MANAGEMENT

Resource Discovery in a Dynamic Grid Environment

Hall D and IT. at Internal Review of IT in the 12 GeV Era. Mark M. Ito. May 20, Hall D. Hall D and IT. M. Ito. Introduction.

APENet: LQCD clusters a la APE

SilverCreek Compare Versions

SNMP. Simple Network Management Protocol Philippines Network Operators Group, March Jonathan Brewer Telco2 Limited New Zealand

AT76.09 Digital Image Processing in Remote Sensing using C Language

Real-Time Monitoring Configuration Utility

Real-Time Monitoring Configuration Utility

RMON on the Workgroup Catalyst Series

Towards Monitoring-as-a-service for Scientific Computing Cloud applications using the ElasticSearch ecosystem

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Outline. SNMP Simple Network Management Protocol. Before we start on SNMP. Simple Network Management Protocol

Talk Outline. Moreno Marzolla. Motivations. How can performances be evaluated?

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Simulation Modeling of UML Software Architectures

ISTITUTO NAZIONALE DI FISICA NUCLEARE

Pharmacy college.. Assist.Prof. Dr. Abdullah A. Abdullah

Autopsy as a Service Distributed Forensic Compute That Combines Evidence Acquisition and Analysis

Chapter 9. introduction to network management. major components. MIB: management information base. SNMP: protocol for network management

theguard! Service Management Center (Valid for Version 6.2 and higher)

CHAPTER. Introduction

Performance and Scalability with Griddable.io

irods usage at CC-IN2P3 Jean-Yves Nief

N E T W O R K M A N A G E M E N T P R I N C I P L E S R E V I E W

COSC 301 Network Management

Parallelizing Loops. Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna.

TELE 301 Network Management

Chapter 1 Computer System Overview

Australian Nuclear Science & Technology Organisation. Upgrade of the ANTARES Computer Control System and our experience of EPICS.

Redesde Computadores(RCOMP)

Service Oriented Performance Analysis

Network Layer: ICMP and Network Management

IBM Lotus Domino 7 Performance Improvements

CS Efficient Network Management. Class 4. Danny Raz

(Tier1) A. Sidoti INFN Pisa. Outline: Tasks and Goals The analysis (physics) Resources Needed

Phire 12.2 Hardware and Software Requirements

Network Management & Monitoring Introduction to SNMP

Storage Optimization with Oracle Database 11g

The INFN Tier1. 1. INFN-CNAF, Italy

Securing Grid Data Transfer Services with Active Network Portals

CPS221 Lecture: Operating System Protection

Europeana DSI 2 Access to Digital Resources of European Heritage

... WebSphere 6.1 and WebSphere 6.0 performance with Oracle s JD Edwards EnterpriseOne 8.12 on IBM Power Systems with IBM i

CISCO SRW208MP-EU SWITCH 8 x 10/100 PoE /100/1000 mini-gbic porttia, WebView/ Max PoE

Fault Management. Overview of Polling and Thresholds CHAPTER

Oracle Advanced Compression: Reduce Storage, Reduce Costs, Increase Performance Bill Hodak Principal Product Manager

LESSON PLAN. Sub. Code & Name : IT2351 & Network Programming and Management Unit : I Branch: IT Year : III Semester: VI.

Network Management & Monitoring Introduction to SNMP

Boost your Portal productivity with Monitoring Studio Express. Bertrand Martin Sentry Software

Network control and management

GFI Product Manual. Deployment Guide

experiment E. Pasqualucci INFN, Sez. of University of Rome `Tor Vergata', Via della Ricerca Scientica 1, Rome, Italy

The ATLAS EventIndex: an event catalogue for experiments collecting large amounts of data

Interaction Translation Methods for XML/SNMP Gateway 1

16-Port Industrial Gigabit Web Smart DIN-Rail Switch TI-G160WS (v1.0r)

Simplicity and minimalism in software development

Multi-threaded, discrete event simulation of distributed computing systems

Iit Istituto di Informatica e Telematica

scc: Cluster Storage Provisioning Informed by Application Characteristics and SLAs

Dell EMC CIFS-ECS Tool

Clustering and Reclustering HEP Data in Object Databases

Network Management. Stuart Johnston 13 October 2011

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Network Management & Monitoring Introduction to SNMP

SilverCreek SNMP Test Suite

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Reduced regulatory compliance violations/fines

Network Management. Raj Jain Raj Jain. Washington University in St. Louis

Chapter 5 Network Layer: The Control Plane

ELFms industrialisation plans

HyperText Transfer Protocol. HTTP Commands. HTTP Responses

Monitoring system LN-VT325. Environmental Infrastructure monitoring

Bright Cluster Manager

CMS users data management service integration and first experiences with its NoSQL data storage

SNMP. Simple Network Management Protocol

Simple Network Management Protocol. Slide Set 8

A RESTful Approach to the OGSA Basic Execution Service Specification

Transcription:

A Performance Monitoring System for Large Computing Clusters Moreno Marzolla marzolla@dsi.unive.it http://www.dsi.unive.it/~marzolla Dip. Informatica, Università Ca' Foscari di Venezia and Istituto Nazionale di Fisica Nucleare, Padova, BaBar Collaboration

Talk Outline Introduction Case Study: Monitoring the BaBar Computing Farm PerfMC: A prototype of an SNMP- Based monitoring application Conclusions Moreno Marzolla, PDP 2003, Genova, Italy 2

Monitoring A monitor is a tool used to observe the activities on a system Collect performance statistics Analyze the data Display the results. Why? Identify frequently used portions of a program; Measure resource utilization to find performance bottlenecks; Characterize the Workload; Find model parameters, validate models, or develop inputs for a model. Moreno Marzolla, PDP 2003, Genova, Italy 3

Distributed Systems Monitoring To monitor a distributed system, the monitor needs itself to be (at least partly) distributed. Monitor Observer Observer Observer Observer Host 1 Host 2 Host 3 Host 4 Moreno Marzolla, PDP 2003, Genova, Italy 4

BaBar Case Study BaBar is a High Energy Physics experiment, studying matterantimatter asimmetry. http://www.slac.stanford.edu/bfroot/ Only a small fraction of events has real information (10-5 -10-6 ) Large amounts of data needed Large computing facilities are required Moreno Marzolla, PDP 2003, Genova, Italy 5

BaBar Farm @ INFN Padova 170 2xPIII 1.26GHz Machines, 1GB Ram, RH Linux 7.2 130 Clients 40 Servers Tape Library with a capacity of ~70TB not compressed; Network switches, UPSes, Environmental conditioning systems,... Moreno Marzolla, PDP 2003, Genova, Italy 6

Monitoring Requirements Hardware Status Machine Crashes, CPU utilization, Disk I/O, Network I/O... Processes status Environmental conditions Humidity, Temperature, UPS status... Does not need to be a real-time monitor The monitoring system should also be: Reasonably Scalable Efficient (low resources requirement) Flexible and customizable Easy to configure Able to operate in batch mode (as a regular UNIX dæmon, no GUI) Moreno Marzolla, PDP 2003, Genova, Italy 7

MRTG http://people.ee.ethz.ch/~oetiker/webtools/mrtg/ Ngop http://www-isd.fnal.gov/ngop/ Some existing monitoring tools There are many of them: Netsaint/Nagios http://www.nagios.org/ Ganglia http://ganglia.sourceforge.net/ RemStats http://silverlock.dgim.crc.ca/remstats/release/index.html Cricket http://cricket.sourceforge.net/ GxSNMP http://www.gxsnmp.org/... (add your own here) Moreno Marzolla, PDP 2003, Genova, Italy 8

So, what's wrong? We examined some publicly available monitoring tools. Unfortunately, many of them were somehow limited: Not scalable Require their own dæmons (observers) running on the monitored hosts Can't install a dæmon on a network switch, or on a tape library Hard to configure In many cases we gave up without being able to try the program. Poorly implemented Heavy use of scripting languages, mixed C/Perl/shell pieces Moreno Marzolla, PDP 2003, Genova, Italy 9

The Observer The Observer must be able to collect statistics on any networked equipment We decided to use SNMP (Simple Network Management Protocol) It is implemented by many vendors It is reasonably simple A very good open source implementation is available on Unix/Linux/Win32 platforms http://net-snmp.sourceforge.net/ Moreno Marzolla, PDP 2003, Genova, Italy 10

PerfMC: a Performance Monitor for Clusters Characteristics: Written in C Asynchronous (nonblocking) parallelized SNMP Polling Uses SNMPv2 Bulk Get requests XML-based configuration file The RRDTool package is used to store data and produce graphs Old data have lower resolution than recent ones Round Robin Databases have known, fixed size Graphing capabilities are provided by the library Dynamic generation of HTML pages using XSLT stylesheets PerfMC has an embedded HTTP server Moreno Marzolla, PDP 2003, Genova, Italy 11

PerfMC Architecture PerfMC Stylesheets In-core Status XML Configuration File <?xml version="1.0" standalone="no"?> <!DOCTYPE monitor SYSTEM "monitor.dtd"> <monitor>... </monitor> SNMP Poller RRD HTTPD XSLT Engine HTML Pages Graphs Host 1 Host 2 Host n Monitored Hosts Moreno Marzolla, PDP 2003, Genova, Italy 12

Example of XML Configuration File <?xml version="1.0" standalone="no"?> <!DOCTYPE monitor SYSTEM "monitor.dtd"> <monitor numconnections="50" pmclogfile="/monitor/pmc.log" httpdlogfile="/dev/null" rrddir="/monitor" htmldir="/monitor/html" pmcverbosity="3" > <host name="localhost" tag= client > <description>this is a sample client machine</description> <miblist> <!-- list of mibs to monitor --> </miblist> <archives> <!-- RRD layout --> </archives> <graphs> <!-- Graph definitions here --> </graphs> </host> </monitor> Moreno Marzolla, PDP 2003, Genova, Italy 13

Sample XML configuration file (MIB) <miblist> <mib id='tempmb' name='.1.3.6.1.4.1.2021.13.16.2.1.3.1'/> <mib id='tempcpu1' name='.1.3.6.1.4.1.2021.13.16.2.1.3.2'/> <!--.iso.org.dod.internet.private.enterprises.ucdavis.systemstats.sscpurawuser.0 --> <mib id='cpuuser' name='.1.3.6.1.4.1.2021.11.50.0' type='counter'/> <!--.iso.org.dod.internet.private.enterprises.ucdavis.systemstats.sscpurawsystem.0 --> <mib id='cpusystem' name='.1.3.6.1.4.1.2021.11.52.0' type='counter'/> <!--.iso.org.dod.internet.private.enterprises.ucdavis.systemstats.sscpurawnice.0 --> <mib id='cpunice' name='.1.3.6.1.4.1.2021.11.51.0' type='counter'/> <!--.iso.org.dod.internet.mgmt.mib-2.interfaces.iftable.ifentry.ifinoctets.2 --> <mib id='net1in' name='.1.3.6.1.2.1.2.2.1.10.2' type='counter'/> <!--.iso.org.dod.internet.mgmt.mib-2.interfaces.iftable.ifentry.ifoutoctets.2 --> <mib id='net1out' name='.1.3.6.1.2.1.2.2.1.16.2' type='counter'/> </miblist> Moreno Marzolla, PDP 2003, Genova, Italy 14

Example of XML Status Dump <?xml version="1.0"?> <hosts> <host name="localhost" status="nr"> <mibs> <mib id="availswap" lastupdated="1018016033">1052248.000000</mib> <mib id="totalswap" lastupdated="1018016033">1052248.000000</mib> <mib id="totalmem" lastupdated="1018016033">917080.000000</mib> <mib id="cachedmem" lastupdated="1018016033">7128.000000</mib> <mib id="buffermem" lastupdated="1018016033">35052.000000</mib> <mib id="sharedmem" lastupdated="1018016033">0.000000</mib> <mib id="freemem" lastupdated="1018016033">833800.000000</mib> <mib id="cpusystem" lastupdated="1018016033">137587.000000</mib> <mib id="cpuuser" lastupdated="1018016033">13581.000000</mib> <mib id="tempcpu2" lastupdated="1018016033">24500.000000</mib> <mib id="tempcpu1" lastupdated="1018016033">25000.000000</mib> <mib id="tempmb" lastupdated="1018016033">33000.000000</mib> </mibs> <graphs> <graph id="hourly.png" title="hourly data"/> </graphs> </host> </hosts> Moreno Marzolla, PDP 2003, Genova, Italy 15

Sample HTML Output Moreno Marzolla, PDP 2003, Genova, Italy 16

Another example Moreno Marzolla, PDP 2003, Genova, Italy 17

Some Plots Database Server Client Machine Moreno Marzolla, PDP 2003, Genova, Italy 18

PerfMC Performances Reasonably low CPU Utilization ( << 5%) Reasonably low Network Utilization (11 KB/s) Not-so low Disk Utilization (1.2 MB/s) Moreno Marzolla, PDP 2003, Genova, Italy 19

Conclusions Monitoring a large computing cluster is a highly nontrivial task. Many available monitoring tools exist, but many of them are not adequate for large distributed systems. We are trying to build a general-purpose SNMP and XML-based monitoring tool. A prototype exists and is working well Moreno Marzolla, PDP 2003, Genova, Italy 20

Future work Alarms are not currently implemented, but are at the top position of the to-do list At the moment there are no serious problems. PerfMC is running on the production cluster ( 170 machines) Clearly cannot scale forever No attention has been put on security Could easily be extended for SNMPv3 Not a priority. Our cluster is on a private network Moreno Marzolla, PDP 2003, Genova, Italy 21

Bibliography W. Stallings, SNMP, SNMPv2, SNMPv3 and RMON 1 and 2, third edition, Addison-Wesley, 1999 R. Jain, The art of computer systems performance analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley and Sons, 1991 Grid Performance Working Group http://www-didc.lbl.gov/gridperf/ RRD Tools Home Page http://people.ee.ethz.ch/~oetiker/webtools/rrdtool BaBar Farm home page (will contain PerfMC) http://bbr-webserv.pd.infn.it:5211/farm/index.html BaBar Farm Monitoring Page http://bbr-monitor.pd.infn.it:5211/monitor/html/index.html Moreno Marzolla, PDP 2003, Genova, Italy 22

More on SNMP GetRequest Management Application GetNextRequest SetRequest SNMP UDP IP GetResponse Trap Network Protocol SNMP Messages LAN/WAN Managed Resources SNMP Managed Objects GetRequest GetNextRequest SetRequest GetResponse Trap SNMP UDP IP Network Protocol W. Stallings, SNMP, SNMPv2, SNMPv3 and RMON 1 and 2, 3rd edition, p. 81 Moreno Marzolla, PDP 2003, Genova, Italy 23

Round Robin Databases Value Drop Old data Recent data Time Moreno Marzolla, PDP 2003, Genova, Italy 24