Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Monitoring for IT Services and WLCG Alberto AIMAR CERN-IT for the MONIT Team 2

Outline Scope and Mandate Architecture and Data Flow Technologies and Usage WLCG Monitoring IT DC and Services Monitoring 3

Scope and Mandate 4

Monitoring - Scope Data Centres Monitoring Monitoring of DCs at CERN and Wigner Hardware, operating system, and services Data Centres equipment, PDUs, temperature sensors, etc. Metrics and logs Experiment Dashboards WLCG Monitoring Sites availability, data transfers, job information, reports Used by WLCG, experiments, sites and users 3

Services Proposed Cover IT DC, Services and WLCG Monitor, collect, visualize, process, aggregate, alarm Metrics and Logs Infrastructure operations and scale Helping and supporting Interfacing new data sources Developing custom processing, aggregations, alarms Building dashboards and reports 6

Data Centres Monitoring (meter) 7

Experiment Dashboards Job monitoring, sites availability, data management and transfers Used by experiments operation teams, sites, users, WLCG 8

WLCG Monitoring - Mandate Regroup monitoring activities hosted by CERN IT Monitoring of Data Centres, WLCG and Experiment Dashboards ETF, HammerCloud testing frameworks Uniform with standard CERN IT practices Management of services, communication, tools Review existing monitoring usage and needs (IT, WLCG, etc.) Investigate, implement established open source technologies Reduce dependencies on in-house software and on few experts Continue support, while preparing the new services 9

Architecture and Data Flow 10

Infrastructure Monitoring Job Monitoring Data mgmt and transfers Data Centres Monitoring Previous Monitoring Data Sources Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers DPM Servers XROOTD Servers CRAB2 CRAB3 WM Agent Farmout Grid Control CMS Connect PANDA WMS ProdSys Nagios VOFeed OIM GOCDB REBUS Transport Flume AMQ Kafka Flume AMQ GLED HTTP Collector z SQL Collector MonaLISA Collector AMQ HTTP GET HTTP PUT Storage &Search HDFS ElasticSearch Oracle ElasticSearch HDFS Oracle ElasticSearch Oracle ElasticSearch Processing & Aggregation Spark Hadoop Jobs GNI Oracle PL/SQL ESPER Spark Oracle PL/SQL ES Queries ESPER Display Access Kibana Jupyter Zeppelin Dashboards (ED) Kibana Zeppelin Real Time (ED) Accounting (ED) API (ED) SSB (ED) SAM3 (ED) API (ED)

Unified Monitoring Data Sources Transport Storage &Search Metrics Manager Lemon Agent XSLS ATLAS Rucio FTS Servers DPM Servers XROOTD Servers CRAB2 CRAB3 WM Agent Farmout Grid Control Flume AMQ z Kafka Hadoop HDFS ElasticSearch InfluxDB Processing & Aggregation CMS Connect PANDA WMS ProdSys Nagios VOFeed OIM GOCDB Spark Hadoop Jobs GNI REBUS Data Access Kibana Grafana Jupyter (Swan) Zeppelin

Flume Kafka sink Flume sinks Data Sources FTS Rucio XRootD Jobs User Data Lemon syslog app log AMQ DB HTTP feed Logs Lemon metrics Flume AMQ Flume DB Flume HTTP Flume Log GW Flume Metric GW Unified Monitoring Architecture Transport Kafka cluster (buffering) * Processing Data enrichment Data aggregation Batch Processing User Jobs Storage & Search HDFS Elastic Search Others (influxdb) Data Access CLI, API User Views ~100 data producers 3.5 TB/day 75k docs/sec 3 days retention in Kafka 13 spark jobs 24/7 13

Data Sources FTS Rucio XRootD Jobs User Data Lemon syslog app log collectd LogStash Unified Data Sources AMQ DB HTTP feed Logs Lemon metrics Metrics (http) Flume AMQ Flume DB Flume HTTP Flume Log GW Flume Lemon Metric GW Flume Metric GW Transport Flume Kafka sink Data is all channeled via Flume via gateways Validated and normalized if necessary (e.g. standard names, date formats) Adding new Data Sources is documented and fairly simple (User Data) 14 Available both for Metrics (IT, WLCG, etc.) and Logs (hw logs, OS logs, syslogs, app logs)

Unified Processing Flume Kafka sink Transport Kafka cluster (buffering) * Flume sinks Flume sinks Processing (e.g. Enrich FTS transfer metrics with WLCG topology from AGIS/GOCdb) User Jobs Flume sinks Proven useful many times 15

Data Processing Stream processing Data enrichment Join information from several sources (e.g. WLCG topology) Data aggregation Over time (e.g. summary statistics for a time bin) Over other dimensions (e.g. compute a cumulative metric for a set of machines hosting the same service) Data correlation Advanced Alarming: detect anomalies and failures correlating data from multiple sources (e.g. data centre topology-aware alarms) Batch processing Reprocessing, data compression, historical data, periodic reports 16

Unified Access Storage & Search Data Access Flume sinks HDFS User Views Flume sinks Elastic Search Flume sinks Influx DB CLI, API Reports Plots Scripts Default dashboards, and can be customized and extended fairly easily Multiple data access methods (dashboards, notebooks, CLI) Dashboards, reports and access via scripts to create new User Views or reports 17

Technologies and Usage Data Sources Storage and Search Data Access 18

MONIT Data Sources FTS Rucio XRootD Jobs User Data Lemon syslog app log collectd LogStash AMQ DB HTTP feed Logs Lemon metrics Metrics (http) MONIT Data is validated and normalized if necessary (e.g. standard names, date formats) Available both for Metrics (IT, WLCG, etc.) and Logs (hw logs, OS, syslogs, app logs) Adding User Data Sources is rather easy Data Sources Components vs Technology Purpose Access Metrics via Collectd Uses collectd plug-ins pre-installed for CERN host and services Log Sources via FLUME Move logs data via a Flume local agent or any HTTP client External Sources via HTTP Endpoint for data from external sources (few, well connected) Open to any source within CERN External Sources via AMQ AMQ Messaging end point Open to sources on WAN 19

MONIT Storage and Search MONIT Storage & Search Elastic Search Influx DB HDFS Searches Dashboards Reports CLI, API Default dashboards, and can be customized and extended fairly easily Provide data access methods (dashboards, notebooks, CLI) Dashboards, reports and access via scripts to create User Views or reports Mainstream and established technologies, do not need experts to develop them Multiple back ends suited for different use cases Storage and Search Technology Purpose Data type Retention Period ElasticSearch Short-term storage and index Raw data, metrics and logs 1 month, with current resources InfluxDB Time series storage Aggregated (1w/raw, 1M/10 m ) 5 years HDFS Long-term archive Raw data, metrics and logs forever 20

MONIT Data Access MONIT Plots Reports User Views CLI, API Scripts Data Access Technology Purpose Data Storage Kibana Full search/filter/discovery of data ElasticSearch Grafana Dashboards optimized for time series plots. Local alarms ElasticSearch, InfluxDB Zeppelin Notebooks for analysis, reports and plots. Native support for Spark HDFS API and CLIs Access from external applications, scripts, Python, etc. HDFS 21

Current Status and Progress 22

Infrastructure(s) - Current Numbers Designed, developed and deployed a new monitoring infrastructure capable of handling CERN IT and WLCG data > 150 VMs, 3.6 TB/day, 6 billion docs /day Maintenance legacy WLCG infrastructure and tools ~ 90 VMs Maintenance legacy ITMON infrastructure and tools (i.e. meter, timber) ~ 150 VMs 23

Infrastructure(s) - Operations Building and tuning the complete infrastructure Supporting existing services Depending on many external services ES, InfluxDB, HDFS Some also new and being set up Securing infrastructure Flume/Kafka/Spark/ES/HDFS Configuring infrastructure (Puppet 4) 24

New WLCG Monitoring 25

Current Data / WLCG WLCG Data Sources FTS XROOTD XROOTD ALICE DDM RUCIO DDM TRACES DDM ACCOUNTING PHEDEX ATLAS JM PANDA/PRODSYS CMS JM - HT CONDOR SAM3 ETF AGIS VOFEED REBUS GOCDB OIM Additional Data Sources ASAP ATLAS CRAB OPS CMS SPACEMON GLIDEINWMS LHCOPN BOINC - LHCATHOME PROTODUNE DAQ WMAGENTS WM ARCHIVE WLCG SPACE ACCOUNTING 26

Current Processing / WLCG Validation and Transformation Fields Verifications (e.g. check timestamp in milliseconds in all doc) Fields Extractions (e.g. extract FTS log link, transfer ID) Fields Computations (e.g. create unique document ID based on other fields Field Normalization: apply common names (e.g. dst_site, dst_country, lowercase, etc.) Enrichment Aggregation Specific Processing Topology Resolution Join raw data with AGIS, VOFeed, REBUS, etc. Binning over time Summary data (e.g. for a given interval) Specific Spark Jobs (e.g. efficiency = success vs. failures) DDM Site avail Job monitoring and accounting FTS, XRootD transfers, rates Sites Availability, profiles (prototype) 27

Current Views: MONIT Portal 28

WLCG Transfers 29

XRootD Transfers 9/25/2017 30

Transfers Overviews (T0, T1, T2) 9/25/2017 31

Site Availability Data 9/25/2017 32

Sites Availability Profiles 9/25/2017 33

Other Data in MONIT for WLCG Examples of other data sources LHCOPN network traffic LHC@HOME (BOINC) WLCG Space Accounting OpenStack at CERN 34

LHCOPN 35

LHCOPN vs WLCG Transfers 9/25/2017 36

WLCG Space Accounting 9/25/2017 37

New IT DC Monitoring 38

New DC Monitoring using Collectd Lemon Agent is the last component in production from the old Lemon/DC Monitoring Moving to collectd collect system and service metrics optimized to handle thousands of metrics modular and portable with hundreds of plugins available easy to develop new plugins in Python/Java/C continuously improving and well documented 39

Data Sources FTS Rucio XRootD Jobs User Data Lemon syslog app log collectd LogStash Collect Data Source AMQ DB HTTP feed Logs Lemon metrics Metrics (http) Flume AMQ Flume DB Flume HTTP Flume Log GW Flume Lemon Metric GW Flume Metric GW Transport Flume Kafka sink Only component to add and use the full MONIT infrastructure All existing IT monitoring will be replaced (meter, notifications, dashbords) 40

Collectd Collectd Data Source Load Plugin Exec Plugin Lemon Plugin Sampling Write_HTTP Plugin HTTP Source Avro Sink Enrichment (e.g. host information, environment) Monitoring Infrastructure Validation Transformation Client on the host 41

Collectd Metrics and Plugins 9/25/2017 42

Replacement Strategy 1. Use an existing collectd plugin (recommended) Straightforward: main logic can be reused Many similarities at API level registermetric() => register_read() storesample() => dispatch() 2. Extend standard collectd plugin Requires development 3. Run lemon sensor using collectd wrapper 43

Alarms GNI (Snow/Email) Collectd Agents Local Alarms HTTP proxy Kafka Spark HTTP proxy Grafana InfluxDB Grafana Alarms = alarm Advanced Alarms 44

Reference and Contact Dashboards (CERN SSO login) monit.cern.ch Feedback/Requests (SNOW) cern.ch/monit-support Documentation cern.ch/monitdocs 45

Additional Info and Backup Slides 47

Notebooks with Zeppelin Extract Data from HDFS or ES 48

Manipulate the data and plot with common languages and tools Python Scala numpy 9/25/2017 49

Notebooks with Swan Extract Data from ES ROOT Python C++ CVMFS 9/25/2017 50