Survey of data formats and conversion tools

Similar documents
Bridging the Particle Physics and Big Data Worlds

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

Big Data Software in Particle Physics

Rapidly moving data from ROOT to Numpy and Pandas

Open Data Standards for Administrative Data Processing

DATA FORMATS FOR DATA SCIENCE Remastered

Toward real-time data query systems in HEP

Data Formats. for Data Science. Valerio Maggio Data Scientist and Researcher Fondazione Bruno Kessler (FBK) Trento, Italy.

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

A Tutorial on Apache Spark

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Albis: High-Performance File Format for Big Data Systems

fastparquet Documentation

Spark and HPC for High Energy Physics Data Analyses

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Pyspark standalone code

IBM Big SQL Partner Application Verification Quick Guide

Big Data Analytics Tools. Applied to ATLAS Event Data

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Greenplum-Spark Connector Examples Documentation. kong-yew,chan

An Introduction to Big Data Formats

Pandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

SparkSQL 11/14/2018 1

Apache Kudu. Zbigniew Baranowski

Elliotte Rusty Harold August From XML to Flat Buffers: Markup in the Twenty-teens

Turning Relational Database Tables into Spark Data Sources

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Apache Hive for Oracle DBAs. Luís Marques

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

CS 378 Big Data Programming

Creating an Avro to Relational Data Processor Transformation

Big Data Tools as Applied to ATLAS Event Data

Hive SQL over Hadoop

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Backtesting with Spark

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Scalable Tools - Part I Introduction to Scalable Tools

New Developments in Spark

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

SciSpark 201. Searching for MCCs

Database infrastructure for electronic structure calculations

Apache Spark 2.0. Matei

Oracle Big Data Connectors

KNIME for the life sciences Cambridge Meetup

CarbonData: Spark Integration And Carbon Query Flow

Programming for Data Science Syllabus

The New World of Big Data is The Phenomenon of Our Time

Dremel: Interac-ve Analysis of Web- Scale Datasets

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

An Overview of Apache Spark

arxiv: v2 [hep-ex] 14 Nov 2018

A CD Framework For Data Pipelines. Yaniv

microsoft

Big Data Hadoop Course Content

Cloud Computing & Visualization

Distributed Machine Learning" on Spark

DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE SYSTEM.

April Copyright 2013 Cloudera Inc. All rights reserved.

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Impala Intro. MingLi xunzhang

Databricks, an Introduction

THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Time Series Storage with Apache Kudu (incubating)

Data Informatics. Seon Ho Kim, Ph.D.

Data Analytics Job Guarantee Program

I am: Rana Faisal Munir

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

ProIO Key Concepts. ProIO is for PROS! right in the name

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Unifying Big Data Workloads in Apache Spark

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web: [ Total Questions: 10]

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CSE 190D Spring 2017 Final Exam Answers

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Hadoop ecosystem. Nikos Parlavantzas

pandas: Rich Data Analysis Tools for Quant Finance

MLeap: Release Spark ML Pipelines

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MapR Enterprise Hadoop

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Spark and distributed data processing

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

PUBLIC SAP Vora Sizing Guide

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

HSF Community White Paper Contribution Towards a Column-oriented Data Management System for Hierarchical HEP Data

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data XML Parsing in Pentaho Data Integration (PDI)

The Reality of Qlik and Big Data. Chris Larsen Q3 2016


ACHIEVEMENTS FROM TRAINING

Department of Computer Engineering 1, 2, 3, 4,5

Transcription:

Survey of data formats and conversion tools Jim Pivarski Princeton University DIANA May 23, 2017 1 / 28

The landscape of generic containers By generic, I mean file formats that define general structures that we can specialize for particular kinds of data, like XML and JSON, but we re interested in binary formats with schemas for efficient numerical storage. (what it means) flat tables nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit Arrays of numbers split TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 2 / 28

The landscape of generic containers ROOT: can do anything with the right settings (what it means) flat tables nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 3 / 28

The landscape of generic containers ROOT: can do anything with the right settings HDF5: stores block-arrays well, good for flat ntuples; can use variable-length arrays of compounds to store e.g. lists of particles, but not in an efficient, columnar way (?) (what it means) flat tables nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 4 / 28

The landscape of generic containers ROOT: can do anything with the right settings HDF5: stores block-arrays well, good for flat ntuples; can use variable-length arrays of compounds to store e.g. lists of particles, but not in an efficient, columnar way (?) Numpy: usually in-memory, also has an efficient file format; only for block-arrays (can hold Python objects, but not efficiently) (what it means) flat tables nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 5 / 28

The landscape of generic containers ROOT: can do anything with the right settings HDF5: stores block-arrays well, good for flat ntuples; can use variable-length arrays of compounds to store e.g. lists of particles, but not in an efficient, columnar way (?) Numpy: usually in-memory, also has an efficient file format; only for block-arrays (can hold Python objects, but not efficiently) Avro et al: interlingual, binary, deep structures with schemas, row-wise storage is best for streaming and RPC (what it means) flat tables nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 6 / 28

The landscape of generic containers ROOT: can do anything with the right settings HDF5: stores block-arrays well, good for flat ntuples; can use variable-length arrays of compounds to store e.g. lists of particles, but not in an efficient, columnar way (?) Numpy: usually in-memory, also has an efficient file format; only for block-arrays (can hold Python objects, but not efficiently) Avro et al: interlingual, binary, deep structures with schemas, row-wise storage is best for streaming and RPC Parquet: extension of Avro et al with columnar storage, intended for databases and fast querying flat tables (what it means) nested structures (how it's stored) row-wise columnar recarray Fortranorder Numpy unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 7 / 28

The landscape of generic containers ROOT: can do anything with the right settings HDF5: stores block-arrays well, good for flat ntuples; can use variable-length arrays of compounds to store e.g. lists of particles, but not in an efficient, columnar way (?) Numpy: usually in-memory, also has an efficient file format; only for block-arrays (can hold Python objects, but not efficiently) Avro et al: interlingual, binary, deep structures with schemas, row-wise storage is best for streaming and RPC Parquet: extension of Avro et al with columnar storage, intended for databases and fast querying Arrow: in-memory extension of Parquet intended for zero-copy communication among databases, query servers, analysis frameworks, etc. (how it's stored) row-wise columnar recarray Fortranorder flat tables Numpy (what it means) unsplit split Arrays of numbers TNtuple SQL '99 HDF5 ROOT Varlength of Compounds TTree SQL '03 nested structures Avro, Thrift, ProtoBuf Parquet (on disk) Arrow (in memory) 8 / 28

Is there a performance penalty? Formats differ most in how nested structure is represented: Avro: whole records are contiguous ROOT: each leaf is contiguous with list sizes in a separate array Parquet: each leaf is contiguous with depth in repetition levels 9 / 28

Is there a performance penalty? Formats differ most in how nested structure is represented: Avro: whole records are contiguous ROOT: each leaf is contiguous with list sizes in a separate array Parquet: each leaf is contiguous with depth in repetition levels Nevertheless, differences (with gzip/deflate) are only 15%. Use-cases may have more variation than choice of format. 47 407 t t Monte Carlo events in TClonesArrays or variable-length lists of custom classes. ROOT 6.06, Avro 1.8.1, Parquet 1.8.1. format MB rel. ROOT none 399 1.96 ROOT gzip 1 204 1.00 ROOT gzip 2 208 1.02 ROOT gzip 9 202 0.99 Avro none 237 1.16 Avro snappy 198 0.97 Avro deflate 180 0.88 Avro LZMA 169 0.83 Parquet none 210 1.03 Parquet snappy 200 0.98 Parquet gzip 176 0.86 I don t have a grand study of all formats. 10 / 28

Is there a performance penalty? Formats differ most in how nested structure is represented: Avro: whole records are contiguous ROOT: each leaf is contiguous with list sizes in a separate array Parquet: each leaf is contiguous with depth in repetition levels Nevertheless, differences (with gzip/deflate) are only 15%. Use-cases may have more variation than choice of format. Speed depends more on runtime representation than file format. E.g. Avro s C library loads into its custom C objects in 113 sec; Avro s Java library in 8.3 sec! But if Avro s C library reads through the same row-wise data and fills minimalist objects, it s 5.4 sec. 47 407 t t Monte Carlo events in TClonesArrays or variable-length lists of custom classes. ROOT 6.06, Avro 1.8.1, Parquet 1.8.1. format MB rel. ROOT none 399 1.96 ROOT gzip 1 204 1.00 ROOT gzip 2 208 1.02 ROOT gzip 9 202 0.99 Avro none 237 1.16 Avro snappy 198 0.97 Avro deflate 180 0.88 Avro LZMA 169 0.83 Parquet none 210 1.03 Parquet snappy 200 0.98 Parquet gzip 176 0.86 I don t have a grand study of all formats. 11 / 28

What matters is what you ll use it with ROOT HDF5 Numpy Avro et al Parquet is the best way to access petabytes of HEP data and use tools developed in HEP is the best way to use tools developed in other sciences, particuarly R, MATLAB, HPC is the best way to use the scientific Python ecosystem, particularly recent machine learning software is the best way to use the Hadoop ecosystem, particularly streaming frameworks like Storm is the best way to use database-like tools in the Hadoop ecosystem, such as SparkSQL Arrow is in its infancy, but is already a good way to share data between Python (Pandas) DataFrames and R DataFrames 12 / 28

What matters is what you ll use it with ROOT HDF5 Numpy Avro et al Parquet is the best way to access petabytes of HEP data and use tools developed in HEP is the best way to use tools developed in other sciences, particuarly R, MATLAB, HPC is the best way to use the scientific Python ecosystem, particularly recent machine learning software is the best way to use the Hadoop ecosystem, particularly streaming frameworks like Storm is the best way to use database-like tools in the Hadoop ecosystem, such as SparkSQL Arrow is in its infancy, but is already a good way to share data between Python (Pandas) DataFrames and R DataFrames 13 / 28

Conversions: getting from here to there Numpy pandas Python DataFrames pandas PyTables root_numpy rpy2 rpy2 PySpark root_pandas pandas rhdf5 ROOTR HDF5 rootpy ROOT PySpark h5spark spark-root arrow-cpp SparkSQL DataFrames SparkR fastparquet rootconverter cyavro rmr2 Spark arrow-cpp spark-avro Avro, Thrift, ProtoBuf parquet-mr Parquet R DataFrames arrow-cpp arrow-cpp Arrow/ Feather 14 / 28

Conversions: getting from here to there HDF5 SparkSQL DataFrames Numpy root_numpy spark-root rootconverter Avro, Thrift, ProtoBuf ROOT Python DataFrames root_pandas ROOTR rootpy Parquet R DataFrames Arrow/ Feather 15 / 28

Conversions: getting from here to there HDF5 SparkSQL DataFrames Numpy Python DataFrames root_numpy mature! use it! root_pandas GitHub repo uses root_numpy ROOTR part of ROOT uses root_numpy and PyTables rootpy ROOT spark-root pure JVM code rootconverter unmaintained (by me) can be scavenged Google Summer of Code Avro, Thrift, ProtoBuf Parquet R DataFrames Arrow/ Feather 16 / 28

Conversions: getting from here to there HDF5 SparkSQL DataFrames Numpy Python DataFrames root_numpy mature! use it! root_pandas GitHub repo uses root_numpy ROOTR part of ROOT uses root_numpy and PyTables rootpy ROOT spark-root pure JVM code rootconverter unmaintained (by me) can be scavenged Google Summer of Code Avro, Thrift, ProtoBuf Parquet R DataFrames Arrow/ Feather 17 / 28

ROOT is now a Spark DataFrame format Launch Spark with JARs from Maven Central (zero install). pyspark --packages org.diana-hep:spark-root_2.11:0.1.11 Access the ROOT files as you would any other DataFrame. df = sqlcontext.read \.format("org.dianahep.sparkroot") \.load("hdfs://path/to/files/*.root") df.printschema() root -- met: float (nullable = false) -- muons: array (nullable = false) -- element: struct (containsnull = false) -- pt: float (nullable = false) -- eta: float (nullable = false) -- phi: float (nullable = false) -- jets: array (nullable = false) -- element: struct (containsnull = false) -- pt: float (nullable = false) 18 / 28

ROOT is now a Spark DataFrame format Launch Spark with JARs from Maven Central (zero install). pyspark --packages org.diana-hep:spark-root_2.11:0.1.11 Access the ROOT files as you would any other DataFrame. df = sqlcontext.read \.format("org.dianahep.sparkroot") \.load("hdfs://path/to/files/*.root") df.printschema() root -- met: float (nullable = false) abstract type system! -- muons: array (nullable = false) -- element: struct (containsnull = false) -- pt: float (nullable = false) -- eta: float (nullable = false) -- phi: float (nullable = false) -- jets: array (nullable = false) -- element: struct (containsnull = false) -- pt: float (nullable = false) 19 / 28

20 / 28

Tony Johnson SLAC 21 / 28

22 / 28

Viktor Khristenko University of Iowa 23 / 28

ROOT I/O implementations As far as I m aware, root4j is one of only five ROOT-readers: standard ROOT C++ JsRoot JavaScript for web browsers root4j Java can t link Java and ROOT RIO in GEANT C++ to minimize dependencies go-hep Go to use Go s concurrency 24 / 28

ROOT I/O implementations As far as I m aware, root4j is one of only five ROOT-readers: standard ROOT C++ JsRoot JavaScript for web browsers root4j Java can t link Java and ROOT RIO in GEANT C++ to minimize dependencies go-hep Go to use Go s concurrency Most file formats are fully specified in the abstract and then re-implemented in dozens of languages. But ROOT is much more complex than most file formats. 25 / 28

Dreaming... Serialization mini-languages: DSLs that transform byte sequence abstract data types Construct: http://construct.readthedocs.io PADS: http://pads.cs.tufts.edu PacketTypes, DataScript, the Power of Pi paper... 26 / 28

Dreaming... Serialization mini-languages: DSLs that transform byte sequence abstract data types Construct: http://construct.readthedocs.io PADS: http://pads.cs.tufts.edu PacketTypes, DataScript, the Power of Pi paper... Even if these languages are too limited to implement ROOT I/O, could it be implemented in a translatable subset of a common language? 27 / 28

Dreaming... Serialization mini-languages: DSLs that transform byte sequence abstract data types Construct: http://construct.readthedocs.io PADS: http://pads.cs.tufts.edu PacketTypes, DataScript, the Power of Pi paper... Even if these languages are too limited to implement ROOT I/O, could it be implemented in a translatable subset of a common language? That is, a disciplined use of minimal language features, allowing for automated translation to C, Javascript, Java, Go...? 28 / 28