ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Similar documents
Apache Hive for Oracle DBAs. Luís Marques

Hive Vectorized Query Execution Design

Albis: High-Performance File Format for Big Data Systems

Hive SQL over Hadoop

Integrating Hive and Kafka

I am: Rana Faisal Munir

April Copyright 2013 Cloudera Inc. All rights reserved.

Open Data Standards for Administrative Data Processing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Tuning the Hive Engine for Big Data Management

An Introduction to Big Data Formats

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Avro Specification

Cloudera Kudu Introduction

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Integrating with Apache Hadoop

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hadoop. copyright 2011 Trainologic LTD

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

CS 222/122C Fall 2016, Midterm Exam

HIVE MOCK TEST HIVE MOCK TEST III

Data Access 3. Managing Apache Hive. Date of Publish:

Avro Specification

Performance Optimization for Informatica Data Services ( Hotfix 3)

Time Series Storage with Apache Kudu (incubating)

The New World of Big Data is The Phenomenon of Our Time

Security and Performance advances with Oracle Big Data SQL

Storage hierarchy. Textbook: chapters 11, 12, and 13

IBM Big SQL Partner Application Verification Quick Guide

Advanced Database Systems

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Impala Intro. MingLi xunzhang

Lucene 4 - Next generation open source search

Distributed File Systems II

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CSE 530A. B+ Trees. Washington University Fall 2013

Preview. Memory Management

Hadoop Map Reduce 10/17/2018 1

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Oracle Big Data SQL High Performance Data Virtualization Explained

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Pig Latin Reference Manual 1

Shark: Hive (SQL) on Spark

CSE 190D Spring 2017 Final Exam Answers

Start Working with Parquet!!!!

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Part 1: Indexes for Big Data

Distributed Systems 16. Distributed File Systems II

Physical Data Organization. Introduction to Databases CompSci 316 Fall 2018

Apache Avro# Specification

Hortonworks Certified Developer (HDPCD Exam) Training Program

microsoft

In-core compression: how to shrink your database size in several times. Aleksander Alekseev Anastasia Lubennikova.

Developing MapReduce Programs

Hadoop. Introduction to BIGDATA and HADOOP

CSE 190D Spring 2017 Final Exam

Column Store Internals

MapReduce-style data processing

File Structures and Indexing

Data Blocks: Hybrid OLTP and OLAP on compressed storage

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

CLOUD-SCALE FILE SYSTEMS

R16 SET - 1 '' ''' '' ''' Code No: R

CS301 - Data Structures Glossary By

Tuning Intelligent Data Lake Performance

APACHE HIVE CIS 612 SUNNIE CHUNG

Introduction to BigData, Hadoop:-

Apache Hive. CMSC 491 Hadoop-Based Distributed Compu<ng Spring 2016 Adam Shook

MI-PDB, MIE-PDB: Advanced Database Systems

MapReduce, Hadoop and Spark. Bompotas Agorakis

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Caching and Buffering in HDF5

HIVE INTERVIEW QUESTIONS

Chapter 12: Query Processing. Chapter 12: Query Processing

Apache Accumulo 1.4 & 1.5 Features

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Unique Data Organization

CAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32));

COS 226 Midterm Exam, Spring 2009

BigData and Map Reduce VITMAC03

Apache Kudu. Zbigniew Baranowski

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Sepand Gojgini. ColumnStore Index Primer

CS 405G: Introduction to Database Systems. Storage

10 Million Smart Meter Data with Apache HBase

Database System Concepts

Fundamentals of Database Systems

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

MapReduce. U of Toronto, 2014

Shark: Hive (SQL) on Spark

Transcription:

ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1

Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce Was architect of Security Broke terasort, minute, and petasort record Working on Hive for last 1.5 years Working on Hive code Deep engagement with a large Hive user Page 2

History Originally Hive used MapReduce s FileFormats Text SequenceFile RCFile was developed to improve matters Only read required columns (projection) Each column is stored separately Compresses columns separately Avoid deserializing extra columns select * from people where age > 99 Lazy deserialization Page 3

Remaining Challenges Stores columns as type-free binary blobs Can t optimize representation by type Can t keep meaningful indexes Complex types were not decomposed Seeking into file requires byte by byte search Need to find the sync marker Compression done using stream-based codecs Decompress column from start of group Metadata is added at beginning of file Lazy deserialization is very slow Page 4

Requirements Two Primary Goals Improve query speed Improve storage efficiency Support for the Hive type model Complex types (struct, list, map, union) New types (datetime, decimal) A single file as output of each task. Dramatically simplifies integration with Hive Lowers pressure on the NameNode Bound the amount of memory required Page 5

File Structure Break file into sets of rows called a stripe. Default stripe size is 256 MB Large size enables efficient read of columns Footer Contains list of stripe locations Self-describing types Count, min, max, and sum for each column Postscript Contains compression parameters Size of compressed footer Page 6

Stripe Structure Data Composed of multiple streams per column Index Required for skipping rows Defaults to every 10,000 rows Position in each stream Min and max for each column Footer Directory of stream locations Encoding of each column Page 7

File Layout Index Data Column 1 256 MB Stripe Row Data Column 2 Column 3 Column 4 Column 5 Stripe Footer Column 6 256 MB Stripe Index Data Row Data Column 7 Column 8 Column 1 Column 2 Stream 2.1 Stream 2.2 Stripe Footer Column 3 Stream 2.3 Index Data Column 4 Stream 2.4 256 MB Stripe Row Data Column 5 Column 6 Stripe Footer Column 7 Column 8 File Footer Postscript Page 8

Compression Light-Weight Compression Type specific (int, string, timestamp, etc.) Always enabled Focus on being very inexpensive Generic Compression Support for none, Snappy, LZO, and Zlib Entire file uses the same compression Applied to each stream and footer Applied after light-weight compression Page 9

Integer Column Serialization Protobuf style variable length integers Small integers have small representations Run Length Encoding Uses byte for run length Runs can include very small fixed delta Non-runs use only 1 extra byte / 128 values Integer columns have two streams: Present bit stream is the value non-null? Data stream stream of integers Page 10

String Column Serialization Use a dictionary to uniquify column values Speeds up predicate filtering Improves compression Uses the same run length encoding String columns have four streams: Present bit stream is the value non-null? Dictionary data the bytes for the strings Dictionary length the length of each entry Row data the row values Page 11

Hive Compound Types Create a tree of readers/writers A different class of reader or writer for each type Compound types use children for their values create table Tbl ( col1 int, col2 map<string, struct<col5:string, col6:double>>, col7 timestamp) 1 Int 3 String 0 Struct 2 Map 4 Struct 7 Time 5 String 6 Double Page 12

Compound Type Serialization Lists Encode the number of items in each value Uses the same run length encoding Uses child writer for value Maps Encode the number of items in each value Uses the same run length encoding Uses the key and value child writers Page 13

Generic Compression Codecs are used to compress blocks of bytes Each compression block is prefixed by a 3 byte header Was the codec used or not? What is the compressed length? To record a position in a compressed stream: Offset of the start of compression header Number of uncompressed bytes in block Also the count from run length encoding Page 14

Column Projection Hive puts the list of required column numbers for the current query in the configuration. Only has ability to request top-level columns Can t request the empty list of columns ORC uses the stripe footer to find the requested columns data streams Reads of adjacent streams are merged Large stripe size means the HDFS reads are reasonable size Page 15

How Do You Use ORC Added as a storage choice create table Tbl (col int) stored as orc; Do NOT change the Serde! ORC in/output formats require OrcSerde Control table properties orc.compress default ZLIB orc.compress.size default 256K orc.stripe.size default 256MB orc.row.index.stride default 10K Page 16

Managing Memory The entire stripe needs to be buffered ORC uses a memory manager Scales down buffer sizes as number of writers increase Configured using hive.exec.orc.memory.pool Defaults to 50% of JVM heap size It may improve performance to provide more memory to insert queries. Scaling down stripe size cuts disk IO size Use MapReduce s High RAM jobs Page 17

Pavan s Trick During data ingest, you often need to write to many dynamic partitions. Pavan was writing to thousands of partitions Even RCFile s small 4MB buffers were OOM Re-wrote query as: from staging insert into partition(part) where key is between x1 and x2 insert into partition(part) where key is between x2 and x3 Reads input once, but splits writing tasks so that each writes to fewer partitions Page 18

Looking at ORC File Structures A useful tool is orc file dump Invoked as hive service orcfiledump file.orc Shows file information, statistics, and stripe info Rows: 21000 Compression: ZLIB Compression size: 10000 Type: struct<i:int,l:bigint,s:string> Statistics: Column 0: count: 21000 Column 1: count: 21000 min: -214 max: 214 sum: 193017 Column 2: count: 21000 min: -922 max: 922 sum: 45060 Column 3: count: 21000 min: Darkness, max: worst Page 19

Looking at ORC File Structures Stripes: Stripe: offset: 3 data: 69638 rows: 5000 tail: 85 index: 126 column 0 section ROW_INDEX start: 3 length 10 column 1 section ROW_INDEX start: 13 length 38 column 2 section ROW_INDEX start: 51 length 42 column 3 section ROW_INDEX start: 93 length 36 column 1 section PRESENT start: 129 length 11 column 1 section DATA start: 140 length 22605 column 2 section PRESENT start: 22745 length 11 column 2 section DATA start: 22756 length 43426 column 3 section PRESENT start: 66182 length 11 column 3 section DATA start: 66193 length 3403 column 3 section LENGTH start: 69596 length 38 column 3 section DICTIONARY_DATA start: 69634 length 133 Page 20

TPC-DS File Sizes Page 21

TPC-DS Query Performance Page 22

Additional Details Metadata is stored using Protocol Buffers Allows addition and removal of fields Reader supports seeking to a given row # ORC doesn t include checksums, Already done in HDFS User metadata can be added at any time before the ORC writer is closed. ORC files can be read or written from MapReduce or Pig using HCat. Page 23

Current work Predicate pushdown row group skipping Take subset of where clause and compare to min/max values in index to skip 10k rows Make string columns sample the written data and pick their encoding appropriately After 100k values, if dictionary has more than 80k entries use a direct encoding. Add optional dictionary encoding of numbers. Suppress present stream if no null values Improve run length encoding Page 24

Vectorization Hive uses Object Inspectors to work on a row Enables level of abstraction Costs major performance Exacerbated by using lazy serdes. Inner loop has many many method calls Kills performance on modern CPUs Completely trashes the L1 and L2 caches Replace inner loop with direct data access Operate on 1000 rows at once Store primitives in primitive arrays Page 25

Vectorization Preliminary Results First rough prototype was able to run queries in memory at 100 million rows per a second. Tasks are getting 10 to 20 times faster Involves completely re-writing the engine Being developed in a Hive dev branch Will have adaptors to switch from vectorized to non-vectorized execution until all of the operators are converted. Thanks to Eric Hanson, Jitendra Pandey, Remus Rusanu, Sarvesh Sakalanaga, Teddy Choi, and Tony Murphy Page 26

Future Work Support count(*), min, max, and average from metadata Extend index with optional bloom filter for string columns Extend push down predicates with bloom filters Allow MapReduce InputSplits to split stripes. Writer may reorder rows to improve compression. Must preserve requested sort order Page 27

Thanks! Owen O Malley owen@hortonworks.com @owen_omalley Page 28

Comparison RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y Page 29