Analytics Platform for ATLAS Computing Services

Similar documents
Big Data Analytics Tools. Applied to ATLAS Event Data

Monitoring for IT Services and WLCG. Alberto AIMAR CERN-IT for the MONIT Team

Big Data Tools as Applied to ATLAS Event Data

Elasticsearch & ATLAS Data Management. European Organization for Nuclear Research (CERN)

Data services for LHC computing

Network Analytics. Hendrik Borras, Marian Babik IT-CM-MM

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Monitoring of large-scale federated data storage: XRootD and beyond.

Python Quant Platform

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Striped Data Server for Scalable Parallel Data Analysis

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Machine Learning Software ROOT/TMVA

Big Data Infrastructure at Spotify

The ATLAS Distributed Analysis System

Scheduling Computational and Storage Resources on the NRP

Introduction to Hadoop and MapReduce

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

CouchDB-based system for data management in a Grid environment Implementation and Experience

Scientific data processing at global scale The LHC Computing Grid. fabio hernandez

From raw data to new fundamental particles: The data management lifecycle at the Large Hadron Collider

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

WLCG Network Throughput WG

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

The ATLAS Software Installation System v2 Alessandro De Salvo Mayuko Kataoka, Arturo Sanchez Pineda,Yuri Smirnov CHEP 2015

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Storage for HPC, HPDA and Machine Learning (ML)

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Data Transfers Between LHC Grid Sites Dorian Kcira

Certified Data Science with Python Professional VS-1442

Deep Learning Frameworks with Spark and GPUs

Streamlining CASTOR to manage the LHC data torrent

Big Data Hadoop Stack

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

National College of Ireland Project Submission Sheet 2015/2016 School of Computing

The Wuppertal Tier-2 Center and recent software developments on Job Monitoring for ATLAS

Next-Generation Cloud Platform

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Introduction to MapReduce (cont.)

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

SLATE. Services Layer at the Edge. First Meeting of the National Research Platform Montana State University August 7-8, 2017

AGIS: The ATLAS Grid Information System

Hadoop Online Training

Modern Data Warehouse The New Approach to Azure BI

About Intellipaat. About the Course. Why Take This Course?

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

ATLAS Analysis Workshop Summary

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Architect.

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Construct a High Efficiency VM Disaster Recovery Solution. Best choice for protecting virtual environments

ANSE: Advanced Network Services for [LHC] Experiments

Quobyte The Data Center File System QUOBYTE INC.

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

Monitoring and Analytics With HTCondor Data

ArcGIS Enterprise: Architecture & Deployment. Anthony Myers

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Kibana, Grafana and Zeppelin on Monitoring data

Specialist ICT Learning

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Storage Virtualization. Eric Yen Academia Sinica Grid Computing Centre (ASGC) Taiwan

StorMagic SvSAN: A virtual SAN made simple

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Cluster Setup and Distributed File System

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

Science-as-a-Service

BIG DATA COURSE CONTENT

Delivering unprecedented performance, efficiency and flexibility to modernize your IT

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. NetApp Storage. User Guide

Introducing SUSE Enterprise Storage 5

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. Nutanix. User Guide

Data Analytics Job Guarantee Program

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. Nagios. User Guide

@Pentaho #BigDataWebSeries

Monitoring WLCG with lambda-architecture: a new scalable data store and analytics platform for monitoring at petabyte scale.

Oracle Big Data Discovery

Challenges of Big Data Movement in support of the ESA Copernicus program and global research collaborations

Keras: Handwritten Digit Recognition using MNIST Dataset

MapR Enterprise Hadoop

Big Data Analytics. Description:

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Apache Flink: Distributed Stream Data Processing

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

Introducing Oracle Machine Learning

Achieving Horizontal Scalability. Alain Houf Sales Engineer

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

CERN openlab & IBM Research Workshop Trip Report

Connecting ArcGIS with R and Conda. Shaun Walbridge

Service Oriented Performance Analysis

The webinar will start soon... Elasticsearch Performance Optimisation

Autopsy as a Service Distributed Forensic Compute That Combines Evidence Acquisition and Analysis

Worldwide Production Distributed Data Management at the LHC. Brian Bockelman MSST 2010, 4 May 2010

Hadoop. Introduction / Overview

Intel Distribution for Python* и Intel Performance Libraries

Transcription:

Analytics Platform for ATLAS Computing Services Ilija Vukotic for the ATLAS collaboration ICHEP 2016, Chicago, USA

Getting the most from distributed resources What we want To understand the system To understand interplay of different systems and services Debug systems Run simulation, test models Alerts, sensing services What we need A way to easily get global picture Collect all the data at one place. Be able to cross-reference. Ability to drill down to the most detailed information Programmatic access to all the data Continuously / periodically running services operating on raw / derived information fast enough for a real time feedback

Year ago Data in different storage backends. Mostly Oracle, access closed off. Hard to combine. Oracle data mining tools not used. Each dashboard handmade, each different. Hard to understand/support. Takes days/weeks to get a plot added.

Architecture Main functions Acquisition, filtering and upload of data sources into a repository Hadoop cluster for analysis of multiple data sources to create reduced collections for higher level analytics Serve repository collections in multiple formats to external clients Makes collected sources available for export by external users Host analytics services on the platform such as ElasticSearch, Logstash, Kibana, etc.

Data sources PanDA - a data-driven workload management system for production and distributed analysis processing Rucio - a Distributed Data Management system used to manage accounts, files, datasets, and distributed storage systems. FAX - Federated ATLAS storage system using XRootD protocol. Provides a global namespace, direct access to data from anywhere. PerfSONAR - a widely-deployed test and measurement infrastructure that is used by science networks and facilities around the world to monitor and ensure network performance. FTS - File Transfer Service - the lowest-level data movement service doing point-topoint file transfers. xaod - primary analysis data product.

Resources aianalytics cluster 5 VM nodes Mainly for data ingestion Analytix Hadoop cluster Base load map-reduce Periodic indexing jobs Elasticsearch clusters Runs Kibana Backend for Jupyter Google BigTable Extreme performance Investigating cost Ilija Vukotic, ICHEP 2016, Chicago, USA

Central Flume Collector Listens for JSON messages (from multiple WLCG or ATLAS services) events multiplexed into different memory channels based on header content, and sent to log files and/or HDFS for analysis and/or ElasticSearch for indexing Panda logs FAX summary monitor source channel 1 channel 2 HDFS log files channel 3 ES Rucio logs WAN monitoring

records / s Ilija Vukotic, ICHEP 2016, Chicago, USA

Jupyter an interactive computing environment that enables users to author notebook documents that include: live code, plots, interactive widgets, documentation At University of Chicago Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz x2 Tesla K20c x2 128GB RAM, 4.5TB RAID-5 for /scratch, 1TB RAID-1 for / At CERN SWAN (Service for Web based Analysis) Currently in Beta Code in CERNBox Full set of ML packages installed Root numpy Numpy Scipy Matplotlib Pandas XGboost Scikit Learn Scikit images Theano TensorFlow Keras H5py BLAS / ATLAS / LAPACK libraries Cuda and CuDNN ZLib Ilija Vukotic, ICHEP 2016, Chicago, USA

Data sizes Hadoop-based collections Rucio (42 TB) PanDA Job Archive (2.4 TB) PanDA State change logs (0.5TB) PanDA Logs (4.5 TB) Network data(2-3 GB/hour) FAX cost matrix, traces (20 GB) In Elasticsearch

Analytics Infrastructure - lessons learned Need really good hardware - lot of RAM, SSD caching, large disks Elasticsearch backups while most of the data could be re-indexed, some exists only in ES. Non-negligible learning curve (MR, pig, java, jython) - need a lot of documentation, support, education. It is clear that Elasticsearch indexed data have much more use. Try to index as much as possible of the data.

Monitoring Kibana Very fast, easy to use. A bit counterintuitive to a physicist. Easy to create custom visualizations. Easy to drill-down to even individual records. Can do simple arithmetic only. Limited set of plot types and options. Embeddable visualizations. Ilija Vukotic, ICHEP 2016, Chicago, USA

Monitoring Kibana Rucio account activity

Analytics control center FTS transfers Inter-site network Jobs efficiencies, users, resource utilization Analysis errors Analytics cluster overview PerfSonar Overflow jobs DDM space tokens, biggest users Ilija Vukotic, ICHEP 2016, Chicago, USA

In-depth analysis Examples: Predicting a network link performance Testing different alerting models Modeling data replication algorithms Testing caching models Requirements: order of ten users no expectation of immediate results need to go through much more data than monitoring/alerting new tools/frameworks different hardware (more memory, GPUs) Ilija Vukotic, ICHEP 2016, Chicago, USA

In-depth analysis example xaod accesses analysis Ilija Vukotic, ICHEP 2016, Chicago, USA

Run a model In ES we have detailed monitoring data for all remote data accesses: what file was accessed from where, by whom, how much was read. We use historical data to run different models and parameters of data caching (cache size, cache block size, high/low watermarks, ) and calculate cache hit rate, turnover, etc. Files accessed: 58782 Cache hits: 8828 Data delivered: 39.98 TB Delivered from cache: 3.47 TB

Jupyter-based analyses Understanding event timings for different Processing steps (generation, simulation, reconstruction, ) CPU types Sites Memory usage CPU efficiencies User analysis patterns Network closeness

A network weather service using ES Data sources: PERFSonar - Throughput, latencies, and packet loss data come from OSG network datastore, collected from AMQ. FAX cost Measures remote data access from remote storages to computing site s worker nodes. FTS File transfer service gives all info for each file transferred. Services: Alerts on changes in network performance Prediction of future network performance Simple REST interface and python API ES is able to deliver searches in <100ms @ 100Hz Not limited to ATLAS Ilija Vukotic, ICHEP 2016, Chicago, USA

Machine learning for network awareness optional

Conclusions The new stack of Big Data tools (Hadoop, Flume, Sqoop, Logstash, pig, ES, Kibana, Jupyter) provide a great platform for ATLAS Distributed Computing analytics tasks: Horizontally scalable Performant Simple to develop for Easy to use for an analyzer Fast to make custom dashboards, GUIs Can replace other full fledged services (alerting, reporting, ) Great backend for in-depth analysis, ML