THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Similar documents
ATLAS Data Management Accounting with Hadoop Pig and HBase

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Experiences with the new ATLAS Distributed Data Management System

The DMLite Rucio Plugin: ATLAS data in a filesystem

Distributed Systems 16. Distributed File Systems II

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Big Data Analytics using Apache Hadoop and Spark with Scala

The ATLAS EventIndex: Full chain deployment and first operation

Big Data Architect.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Performance of popular open source databases for HEP related computing problems

Introduction to BigData, Hadoop:-

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

C3PO - A Dynamic Data Placement Agent for ATLAS Distributed Data Management

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

A Review Paper on Big data & Hadoop

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

August Li Qiang, Huang Qiulan, Sun Gongxing IHEP-CC. Supported by the National Natural Science Fund

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Analytics Platform for ATLAS Computing Services

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop Development Introduction

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Cloud Computing & Visualization

Stages of Data Processing

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Cloud Analytics and Business Intelligence on AWS

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Data-Intensive Distributed Computing

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Distributed File Systems II

Prototyping Data Intensive Apps: TrendingTopics.org

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

10 Million Smart Meter Data with Apache HBase

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Webinar Series TMIP VISION

CIB Session 12th NoSQL Databases Structures

A BigData Tour HDFS, Ceph and MapReduce

Popularity Prediction Tool for ATLAS Distributed Data Management

Hadoop. copyright 2011 Trainologic LTD

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

New Approaches to Big Data Processing and Analytics

Big Data with Hadoop Ecosystem

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Typical size of data you deal with on a daily basis

Databases and Big Data Today. CS634 Class 22

Apache Hive for Oracle DBAs. Luís Marques

Big Data Hadoop Stack

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to Hadoop and MapReduce

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Security and Performance advances with Oracle Big Data SQL

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

Streamlining CASTOR to manage the LHC data torrent

Innovatus Technologies

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Distributed Data Management on the Grid. Mario Lassnig

ATLAS DQ2 to Rucio renaming infrastructure

Processing of big data with Apache Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

Using space-filling curves for multidimensional

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Facebook

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Shark: Hive (SQL) on Spark

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Oracle Big Data Connectors

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Backup and Recovery for Hadoop: Plans, Survey and User Inputs

KNIME for the life sciences Cambridge Meetup

CSE 444: Database Internals. Lecture 23 Spark

Hadoop Online Training

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Chapter 5. The MapReduce Programming Model and Implementation

MapReduce and Friends

MapReduce, Hadoop and Spark. Bompotas Agorakis

Evolving To The Big Data Warehouse

Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt

Transcription:

1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB 2013 Workshop Europe

Overview 2 Distributed Data Management (DDM) Overview Architecture Relational Database Management System (RDBMS) & DDM Experiences NoSQL & DDM ATLAS DDM use cases Operational experiences Present & Future: Rucio Conclusions

DDM Background 3 The Distributed Data Management project manages ATLAS data on the grid The current system is Don Quijote 2 (DQ2) 144 Petabytes 600k datasets 450 million files 800 active users 130 sites Data Taking Total Grid Usage (PB) Long Shutdown 1 (LS1)

DDM Implementation: DQ2 4 Production system DQ2 Analysis system Central Services: Data export DQ2 Clients End-user Physics meta-data system DQ2 In production since 2004 2007-2011: Many updates Common Modular Framework repository content accounting notification Site Services: transfer WLCG: FTS, SRM, LFC, VOMS. replica subscription popularity Info & Conf monitoring deletion Oracle Database consistency OPEN SCIENCE GRID LHC COMPUTING GRID NORDUGRID RDBMS Oracle Critical dependency Proven technology Expertise@cern Great for enforcing data integrity Tool of choice for Online transaction processing applications (OLTP)

Oracle & DDM 5 Constant optimization over the years has allowed us to scale avg. 300 Hz, physical write 7 MB/s, physical read 30 MB/s Improvement with I/O, concurrency, front-end deployment, time partitioning, tablespace management, denormalization for performance, etc. Developing and maintaining an high availability service with Oracle requires some good knowledge and expertise Bind variables, index hint, oracle optimizer, pl/sql, execution plan, etc.

NoSQL & DDM 6 Data warehousing use cases & applications relevant for NoSQL Lot of data No transactions and relaxed consistency Multi dimensional queries Three technologies evaluated MongoDB Cassandra Hadoop Many more available, but these were chosen with the following things in mind Large community available and widely installed In production in several larger companies with respectable data sizes Potential commercial support

Data Models 7! Explicit row-key! Native datatypes! Everything indexable! Implicit row-keys! Data is byte streams! Column Families group row-keys! Implicit row-key! Data is byte streams! Row-keys group Column Families! Row-keys are sorted

Technology Selection 8 Installation/ Configuration MongoDB Cassandra Hadoop/HBase Download, unpack, run Download, unpack, configure, run Distribution, Complex config Buffered read 256 250 000/sec 180 000/sec 150 000/sec Random read 256 20 000/sec 20 000/sec 20 000/sec Relaxed write 256 10 000/sec 19 000/sec 9 000/sec Durable Write 256 2 500/sec 9 000/sec 6 000/sec Analytics Limited MapReduce Hadoop MapReduce MapReduce, Pig, Hive Durability support Full Full Full Native API Binary JSON Java Java Generic API None Thrift Thrift, REST

Technology: 9 MongoDB Cassandra Hadoop/HBase Installation/ Configuration Download, unpack, run Download, unpack, configure, run Distribution, Complex config Buffered read 256 250 000/sec 180 000/sec 150 000/sec Random read 256 20 000/sec 20 000/sec 20 000/sec Relaxed write 256 10 000/sec 19 000/sec 9 000/sec Durable Write 256 2 500/sec 9 000/sec 6 000/sec Analytics Limited MapReduce Hadoop MapReduce MapReduce, Pig, Hive Durability support Full Full Full Native API Binary JSON Java Java Generic API None Thrift Thrift, REST Hadoop is a framework for distributed data processing (not only a database) with many components: HDFS (distributed filesystem), MapReduce (distributed processing of large data sets), HBase (distributed data base for structured storage), Hive(SQL frontend), Pig: dataflow language for parallel execution,

Use Cases : Log File Aggregation 10 HDFS is mounted as a POSIX filesystem via FUSE Daily copies of all the ATLAS DDM log files are aggregated in a single place 8 months of logs accumulated, already using 3 TB of space on HDFS Python MapReduce jobs analyse the log files Streaming API: read from stdin, write to stdout Processing the data takes about 70 minutes Average IO at 70MB/s Potential for 15% performance increase if re-written in pure Java n Better read patterns and reducing temporary network usag Apache Apache Apache write log file FUSE HDFS MapReduce Apache

Use Cases : Trace Mining 11 Client interaction with ATLAS DDM generates traces E.g., downloading a dataset/file from a remote site Lots of information (25 attributes), time-based One month of traces uncompressed 80GB, compressed 25GB n Can be mapreduced in under 2 minutes Implemented in HBase as distributed atomic counters Previously developed in Cassandra At various granularities (minutes, hours, days) Size of HBase tables negligible Average rate at 300 insertions/s Migrated from Cassandra within 2 days Almost the same column-based data model Get extra Hadoop benefits for free (mature ecosystem with many tools) The single Cassandra benefit, HA, was implemented in Hadoop

Use Cases : DQ2Share 12 HTTP cache for dataset/file downloads Downloads via ATLAS DDM tools to HDFS, serves via Apache Get all the features of HDFS for free, i.e., one large reliable disk pool

Use Cases : Accounting 13 Break down usage of ATLAS data contents Historical free-form meta data queries {site, nbfiles, bytes} := {project=data10*, datatype=esd, location=cern*}! Non-relational periodic summaries A full accounting run takes about 8 minutes Pig data pipeline creates MapReduce jobs 7 GB of input data, 100 MB of output data Pig Oracle periodic snapshot HDFS publish Apache! (come and see the poster)

Operational experiences: Hadoop 14 Cloudera distribution Tests and packages the Hadoop ecosystem Straightforward installation via Puppet/YUM But the configuration was... not so obvious with many parameters, extensive documentation, but bad default performance Disk failure is common and cannot be ignored Data centre annual disk replacement rate up to 13% (Google & CMU, 2011) Within one year we had: n 5 disk failures (20% failure rate!) Out of which 3 happened at the same time n 1 Mainboard failure together with the disk failure, but another node Worst case scenario experienced up to now: 4 nodes out of 12 dead within a few minutes Hadoop Reported erroneous nodes, blacklisted them nd resynced the remaining ones No manual intervention necessary Nothing was lost (but one node is still causing trouble, RAID controller?, cluster not affected)

The Next DDM Version: Rucio 15 DQ2 will simply not continue to scale Order(s) of magnitude more data with significant higher trigger rates and event pileup (after LS1) Extensibility issues Computing model and middleware changes Rucio is the new major version, anticipated for 2014, to ensure system scalability, remove unused concepts and support new use cases Rucio exploits commonalities between experiments (not just LHC) and other data intensive sciences (e.g., Cloud/Big data computing) Rucio & Databases Relational database management system n SQLAlchemy Object Relational Mapper(ORM) supporting Oracle, sqlite, mysql, n Use cases: Real-time data and transactional consistency n Test with representative data volume is ongoing n Evaluation of synchronization strategies between DQ2 and Rucio Non relational structured storage (Hadoop) n Use cases: Search functionality, realtime stats, monitoring, meta-data, complex analytical reports over large volumes

Conclusions 16 ATLAS Distributed Data Management delivered a working system to the collaboration in time for LHC data taking relying heavily on Oracle NoSQL: DDM use cases well covered Hadoop proved to be the correct choice: Stable reliable fast easy to work with We see Hadoop complementary to RDBMS, not as a replacement DDM team is currently developing and (stress-)testing a new DDM version called Rucio anticipated for 2014 Using both SQL and NoSQL

Thanks! 17 http://rucio.cern.ch