Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Similar documents
Overview. About CERN 2 / 11

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

SPINOSO Vincenzo. Optimization of the job submission and data access in a LHC Tier2

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

Cluster Setup and Distributed File System

Towards Monitoring-as-a-service for Scientific Computing Cloud applications using the ElasticSearch ecosystem

CouchDB-based system for data management in a Grid environment Implementation and Experience

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

1. Introduction. Outline

Realtime visitor analysis with Couchbase and Elasticsearch

Thales PunchPlatform Agenda

The InfluxDB-Grafana plugin for Fuel Documentation

Benchmarking message queue libraries and network technologies to transport large data volume in

The CMS Computing Model

Overview. SUSE OpenStack Cloud Monitoring

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Analytics Platform for ATLAS Computing Services

FROM MONOLITH TO DOCKER DISTRIBUTED APPLICATIONS

ATLAS Experiment and GCE

Analisi Tier2 e Tier3 Esperienze ai Tier-2 Giacinto Donvito INFN-BARI

arxiv: v1 [physics.ins-det] 16 Oct 2017

How we built a highly scalable Machine Learning platform using Apache Mesos

OpenStack Magnum Pike and the CERN cloud. Spyros

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Scale your Docker containers with Mesos

GoDocker. A batch scheduling system with Docker containers

Big Data Analytics Tools. Applied to ATLAS Event Data

arxiv: v1 [cs.cr] 30 Jun 2017

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

DC/OS Metrics. (formerly known as Project Ambrose) Application and Resource Metrics in DC/OS Enterprise. Nick Parker at..

IEPSAS-Kosice: experiences in running LCG site

Using DC/OS for Continuous Delivery

MapR Enterprise Hadoop

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Fluentd + MongoDB + Spark = Awesome Sauce

Datasheet FUJITSU Software Cloud Monitoring Manager V2.0

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.

LHCb Computing Status. Andrei Tsaregorodtsev CPPM

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Xantaro Application Integration. Value Added & Central Services

Streamlining CASTOR to manage the LHC data torrent

The Art of Container Monitoring. Derek Chen

Auto Management for Apache Kafka and Distributed Stateful System in General

Data Movement & Tiering with DMF 7

Flash Storage Complementing a Data Lake for Real-Time Insight

Basic Concepts of the Energy Lab 2.0 Co-Simulation Platform

Data Transfers Between LHC Grid Sites Dorian Kcira

The InfluxDB-Grafana plugin for Fuel Documentation

Linux Clusters Institute: Monitoring. Zhongtao Zhang, System Administrator, Holland Computing Center, University of Nebraska-Lincoln

PROOF-Condor integration for ATLAS

Opportunities A Realistic Study of Costs Associated

Monasca. Monitoring/Logging-as-a-Service (at-scale)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

A Generic Microservice Architecture for Environmental Data Management

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Kibana, Grafana and Zeppelin on Monitoring data

Storing metrics at scale with. Gnocchi. Julien Danjou OpenStack Day France 22 November 2016

FUJITSU Software ServerView Cloud Monitoring Manager V1.1. Release Notes

Harvesting Logs and Events Using MetaCentrum Virtualization Services. Radoslav Bodó, Daniel Kouřil CESNET

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Study of the viability of a Green Storage for the ALICE-T1. Eduardo Murrieta Técnico Académico: ICN - UNAM

Elasticsearch. Presented by: Steve Mayzak, Director of Systems Engineering Vince Marino, Account Exec

PoS(High-pT physics09)036

Monitoring and Analytics With HTCondor Data

The Compact Muon Solenoid Experiment. Conference Report. Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Ivane Javakhishvili Tbilisi State University High Energy Physics Institute HEPI TSU

Europe and its Open Science Cloud: the Italian perspective. Luciano Gaido Plan-E meeting, Poznan, April

Regain control thanks to Prometheus. Guillaume Lefevre, DevOps Engineer, OCTO Technology Etienne Coutaud, DevOps Engineer, OCTO Technology

StreamSets Control Hub Installation Guide

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

Building/Running Distributed Systems with Apache Mesos

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

The LCG 3D Project. Maria Girone, CERN. The 23rd Open Grid Forum - OGF23 4th June 2008, Barcelona. CERN IT Department CH-1211 Genève 23 Switzerland

FUJITSU Software ServerView Cloud Monitoring Manager V1.0. Overview

Zabbix on a Clouds. Another approach to a building a fault-resilient, scalable monitoring platform

MOHA: Many-Task Computing Framework on Hadoop

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Storage Resource Sharing with CASTOR.

Europeana Core Service Platform

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS

The ATLAS EventIndex: Full chain deployment and first operation

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Datacenter replication solution with quasardb

Summary of the LHC Computing Review

Spark and HPC for High Energy Physics Data Analyses

Integrate MATLAB Analytics into Enterprise Applications

Quobyte The Data Center File System QUOBYTE INC.

LHCb Computing Resources: 2018 requests and preview of 2019 requests

Transcription:

Monitoring system for geographically distributed datacenters based on Openstack Gioacchino Vino Tutor: Dott. Domenico Elia Tutor: Dott. Giacinto Donvito Borsa di studio GARR Orio Carlini 2016-2017 INFN Bari Workshop GARR - Rome, 30 May 2018

Index Motivations Project Overview Next steps Parallel derived Activity: Implementing the monitoring system for the ALICE O2 Facility @ CERN 2

Motivations for a new monitoring system The increasing demand of computation resources for scientific purposes is leading to: Datacenters increasing in complexity and size. Taking advantages of new technologies like virtualization and cloud computing. Datacenter cooperation needed in order to accomplish common goals. Geographically distributed datacenters Goal: Increase the computation capability of the overall system. Side effect: Increasing complexity for monitoring and control systems. Project: Developing a monitoring system for geographically distributed datacenters. 3

Project Overview Goal: Reduce the malfunction time Advances features are requested: Anomaly detection Root Cause Analysis Graph data modeling Fully informative monitoring data are collected: Service monitoring (HTTP server, DBs, ) Openstack and middleware monitoring Hardware monitoring (servers, disks, disk controllers, network devices, PDU, ) 4

Project Testbed: ReCaS Bari ReCaS Bari Datacenter More than 13.000 cores 7.1 PB Disk Storage 2.5 PB Tape storage HPC Cluster composed of 20 servers Dedicated network link: 10Gbps x2 to GARR, 20Gbps to Naples and 20Gbps to Bologna Cloud platform: OpenStack Batch system: HTCondor 184 Worker Nodes 350+ network connections Local Monitoring System: Zabbix Including ALICE and CMS Tier2s 5

Project Architecture Syslog Flume Source Zabbix HTCondor OpenStack InfluxDB Kafka Flume Sinks ES HDFS Mesos Neo4j Key words: BigData Open Source Graph data modeling Machine Learning 6

Project Architecture Data Sources: Syslog: System processes and service information. Syslog HTCondor Zabbix: Computation resource usage, service and Openstack monitoring. Zabbix +40k values sampled every 10 minutes (8GB from 19.07.2016) HTCondor: Scheduler, completed and running job state OpenStack 2-3 millions of lines collected every day (500GB per year ) +750k values sampled every 5 minutes (35GB from 19.07.2016) OpenStack: Information on server, images, flavors, volumes, network devices,. +50k values monitored every hour 7

Project Architecture Metric collectors: Syslog-ng Flume Source Zabbix HTCondor OpenStack Mesos Apache Flume Syslog Source. Python code inserted in Docker-container and executed periodically using Apache Mesos. Apache Flume: a distributed and highly-reliable service for collecting, aggregating and moving large amounts of data in a very efficient way. Apache Mesos: an open-source project to manage computer clusters. Docker: a computer program that performs operating-system-level virtualization also known as containerization. 8

Project Architecture Transport Layer: Syslog-ng Flume Source Zabbix HTCondor OpenStack Kafka Decouple all components. Increase the High Availability of system. Mesos Apache Kafka: an open-source stream-processing software platform, provides a unified, high-throughput, low-latency platform for handling real-time data feeds. 9

Project Architecture Syslog-ng Flume Source Zabbix HTCondor OpenStack Mesos InfluxDB Kafka Flume Sinks ES HDFS Endpoints: Hadoop Distributed File System (HDFS): Long-term storage. InfluxDB-Grafana: Timeseries Dashboards. ElasticSearch-Kibana: Log Dashboards. 10

Project Architecture Timeseries Dashboards InfluxDB: a custom high-performance data store written specifically for time series data. Grafana: Dashboards builder for time-series data. 11

Project Architecture Log Dashboards ElasticSearch: a search engine based on Lucene and provides a distributed and full-text search engine schema-free JSON documents. Kibana: an open source data visualization plugin for Elasticsearch 12

Project Architecture Syslog-ng Flume Source Zabbix HTCondor OpenStack Mesos InfluxDB Kafka Flume Sinks ES HDFS First benefit: Unique graphical interface 13

Project Architecture Riemann: aggregates events from your Alarm dispatcher: servers and applications with a powerful stream processing language. Plugins: Email, Slack. Processes and filters events. Kafka Neo4j Information Structure: Classical monitoring is not enough. Relation information ( Services, network, virtual-physical server, ) Openstack data. Open connections. Neo4j: High Performance native Graph Storage & Processing. Other monitoring data. Advantages: Eliminating the need for joins 14

Project Architecture Information Structure: Subgraph example: Blues nodes: virtual machines. Yellow nodes: network interfaces. Red nodes: networks. MATCH (s:server)-[:net_interface]->(n:nic)-[connected]->(net:network) WHERE s.name = frontend AND net.name = public RETURN n.mac 15

Project Architecture Processing Unit: Kafka Log Analyzer (Streaming and Batch). Anomaly Detection. Data Correlation. Root Cause Analysis. Pattern recognition. Graph Data Modeling. Apache Spark: a fast and general engine for large-scale data processing. Neo4j 16

Project Machine Learning Algorithms Features: Adaptable to the data size. Combination of unsupervised and supervised algorithms. Incremental learning. Knowledge sharing. 17

Next steps Migrate all components in Mesos Improve the Machine Learning algorithms effectiveness Root Cause Analysis algorithm Integration with project management systems (OpenProject, Trello,. ) Actions 18

Monitoring system for the ALICE O2 Facility ALICE is a heavy-ion detector designed to study the physics of strongly interacting matter (the Quark Gluon Plasma) at the CERN Large Hadron Collider (LHC). During the Long Shutdown 2 in the end of 2018, ALICE will start its upgrade to fully exploit the increase in luminosity. The current computer system (Data Acquisition, High-Level Trigger and Offline) will be replaced by a single, common O2 (Online-Offline) system. Some detectors will be read out continuously, without physics triggers. O2 Facility will compress the 3.4 TB/s of raw data to 100 GB/s of reconstructed data: 268 First Level Processors 500 Event Processing Nodes Development of a Monitoring System for ALICE O2 Facility: Modular Stack solution, with components and tools already used and tested in the MonGARR project (approved by the ALICE O2 TB last February) 19

Monitoring system for the ALICE O2 Facility Requirements: Capable of handling O2 monitoring traffic 600 khz Scalable >> 600 khz Low latency Compatible with CentOS 7 Open Source, well documented, actively maintained and supported by developers Impose low storage size per measurement Goals: Real-time Dashboards Historical Dashboards Alarm dispatcher 20

Monitoring system for the ALICE O2 Facility Architecture: s: Monitoring Library CollectD Transport Layer: Apache Flume Time-series Database: InfluxDB Visualization interface: Grafana Alarming component: Riemann Processing component: Apache Spark 21

Monitoring system for the ALICE O2 Facility Next Steps: System Validation using the TPC monitoring data, July 2018 New functionalities will be added (new streaming analysis, alarming, log analysis) System Validation using ITS monitoring data, Dec 2018 22

THANKS FOR YOUR ATTENTION 23

Backup Resource Usage for the monitoring system: 80 CPUs 150GB RAM 3 TB Disk 1.5TB for HDFS in replica 3 600 GB for Kafka nodes No-volatile virtual machine volumes 24

Backup Resource Usage for the most important hosts: ElasticSearch 4 VCPUs 8GB RAM InfluxDB 2(8) VCPUs 4(16)GB RAM 25

Backup Apache Mesos: Cluster: 3x Master (2 CPUs, 4GB RAM, 20GB Disk) 2x Slaves (4 CPUs, 8GB RAM, 20 GB Disk) 1x Load Balancer (2 CPUs, 4GB RAM, 20GB Disk) Frameworks: Chronos Marathon Spark 26

Backup Modular Stack Flume inner architecture 27