Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Similar documents
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Hadoop Online Training

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

An Introduction to Big Data Formats

Performance Tuning Data Transfer Between RDB and Hadoop. Terry Koch. Sr. Engineer

IBM Big SQL Partner Application Verification Quick Guide

In this exercise, you will import orders table from MySQL database. into HDFS. Get acquainted with some of basic commands of Sqoop

Powered by Teradata Connector for Hadoop

Apache Hive for Oracle DBAs. Luís Marques

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Importing and Exporting Data Between Hadoop and MySQL

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Third Party Software: How can I import data from MySQL to Hadoop with Apache Sqoop? How can I import data from MySQL to Hadoop with Apache Sqoop?

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale

Apache Kudu. Zbigniew Baranowski

Data Access 3. Migrating data. Date of Publish:

Hive SQL over Hadoop

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Hortonworks Certified Developer (HDPCD Exam) Training Program

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

The New World of Big Data is The Phenomenon of Our Time

Scale out databases for CERN use cases

Table of Contents. 7. sqoop-import Purpose 7.2. Syntax

Hortonworks Data Platform

microsoft

Exam Questions

fastparquet Documentation

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Introduction to BigData, Hadoop:-

TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

Want to read more? You can buy this book at oreilly.com in print and ebook format. Buy 2 books, get the 3rd FREE!

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Hadoop An Overview. - Socrates CCDH

Evolution of the Logging Service Hands-on Hadoop Proof of Concept for CALS-2.0

Just add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Big Data Hadoop Course Content

Oracle Big Data SQL High Performance Data Virtualization Explained

Enterprise Data Catalog Fixed Limitations ( Update 1)

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

50 Must Read Hadoop Interview Questions & Answers

Albis: High-Performance File Format for Big Data Systems

April Copyright 2013 Cloudera Inc. All rights reserved.

Scaling ETL. with Hadoop. Gwen

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Hadoop Development Introduction

Techno Expert Solutions An institute for specialized studies!

Tuning the Hive Engine for Big Data Management

Hortonworks Data Platform

Security and Performance advances with Oracle Big Data SQL

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Start Working with Parquet!!!!

iway Big Data Integrator New Features Bulletin and Release Notes

Apache Spark 2.0. Matei

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Kafka Connect FileSystem Connector Documentation

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

BIG DATA Standardisation - Data Lake Ingestion

Hadoop. Introduction to BIGDATA and HADOOP

Cloudera Kudu Introduction

Oracle Big Data Cloud Service, Oracle Storage Cloud Service, Oracle Database Cloud Service

Time Series Storage with Apache Kudu (incubating)

Turning Relational Database Tables into Spark Data Sources

Data Collec2on DATA. \ App Server LOG. Data Repository Eg: HDFS

Shine a Light on Dark Data with Vertica Flex Tables

I am: Rana Faisal Munir

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Databricks, an Introduction

Survey of data formats and conversion tools

DOWNLOAD PDF MICROSOFT SQL SERVER HADOOP CONNECTOR USER GUIDE

Big Data Analytics using Apache Hadoop and Spark with Scala

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Big Data Analysis using Hadoop. Lecture 4. Hadoop EcoSystem

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

A Tutorial on Apache Spark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Hadoop: The Definitive Guide

Aaron Sun, in collaboration with Taehoon Kang, William Greene, Ben Speakmon and Chris Mills

Exercise #1: ANALYZING SOCIAL MEDIA AND CUSTOMER SENTIMENT WITH APACHE NIFI AND HDP SEARCH INTRODUCTION CONFIGURE AND START SOLR

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Performance Tuning and Sizing Guidelines for Informatica Big Data Management

MapR Enterprise Hadoop

Unifying Big Data Workloads in Apache Spark

Big Data SQL Deep Dive

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Integrating with Apache Hadoop

Open Data Standards for Administrative Data Processing

Transcription:

Prasanth Kothuri, CERN 2

Files Formats not just CSV - Key factor in Big Data processing and query performance - Schema Evolution - Compression and Splittability - Data Processing Write performance Partial read Full read 3

Available File Formats - Text / CSV - JSON - SequenceFile binary key/value pair format - Avro - Parquet - ORC optimized row columnar format 4

AVRO - Language neutral data serialization system - Write a file in python and read it in C - AVRO data is described using language independent schema - AVRO schemas are usually written in JSON and data is encoded in binary format - Supports schema evolution - producers and consumers at different versions of schema - Supports compression and are splittable 5

Avro File structure and example Sample AVRO schema in JSON format Avro file structure { "type" : "record", "name" : "tweets", "fields" : [ { "name" : "username", "type" : "string", }, { "name" : "tweet", "type" : "string", }, { "name" : "timestamp", "type" : "long", } ], "doc:" : schema for storing tweets" } 6

Parquet columnar storage format key strength is to store nested data in truly columnar format using definition and repetition levels 1 Nested schema Table representation Row format X Y Z X Y Z x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 Y5 z5 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5 y5 z5 Columnar format x1 x2 x3 x4 x5 y1 y2 y3 y4 y5 z1 z2 z3 z4 z5 encoded chunk encoded chunk encoded chunk (1) Dremel made simple with parquet - https://blog.twitter.com/2013/dremel-made-simple-with-parquet 7

Optimizations CPU and I/O Statistics for filtering and query optimization projection push down predicate push down read only the data you need Y Minimizes CPU cache misses SLOW cache misses costs cpu cycles 8

Encoding - Delta Encoding: - E.g timestamp can be encoded by storing first value and the delta between subsequent values which tend to be small due to temporal validity - Prefix Encoding: - delta encoding for strings - Dictionary Encoding: - Small set of values, e.g post code, ip addresses etc - Run Length Encoding: - repeating data 9

Parquet file structure & Configuration Internal structure of parquet file Configurable parquet parameters Property name Default value Description parquet.block.size 128 MB The size in bytes of a block (row group). parquet.page.size parquet.dictionary.page.size 1MB 1MB The size in bytes of a page. The maximum allowed size in bytes of a dictionary before falling back to plain encoding for a page. parquet.enable.dictionary true Whether to use dictionary encoding. parquet.compression UNCOMPRESSED The type of compression: UNCOMPRESSED, SNAPPY, GZIP & LZO In summation, Parquet is state-of-the-art, open-source columnar format the supports most of Hadoop processing frameworks and is optimized for high compression and high scan efficiency 10

Data Ingestion 11

Flume Flume is for high-volume ingestion into Hadoop of eventbased data e.g collect logfiles from a bank of web servers, then move log events from those files to HDFS (clickstream) 12

Flume Example 13

Flume configuration agent1.sources = source1 agent1.sinks = sink1 agent1.channels = channel1 agent1.sources.source1.channels = channel1 agent1.sinks.sink1.channel = channel1 agent1.sources.source1.type = spooldir agent1.sources.source1.spooldir = /var/spooldir agent1.sinks.sink1.type = hdfs agent1.sinks.sink1.hdfs.path = hdfs://hostname:8020/user/flume/logs agent1.sinks.sink1.hdfs.filetype = DataStream agent1.channels.channel1.type = memory agent1.channels.channel1.capacity = 10000 agent1.channels.channel1.transactioncapacity = 100 flume-ng agent name agent1 conf $FLUME_HOME/conf Category Source Sink Channel Component Avro Exec HTTP JMS Netcat Sequence generator Spooling directory Syslog Thrift Twitter Avro Elasticsearch File roll HBase HDFS IRC Logger Morphline (Solr) Null Thrift File JDBC Memory 14

Sqoop SQL to Hadoop - Open source tool to extract data from structured data store into Hadoop - Architecture 15

Sqoop contd. - Sqoop schedules map reduce jobs to effect imports and exports - Sqoop always requires the connector and JDBC driver - Sqoop requires JDBC drivers for specific database server, these should be copied to /usr/lib/sqoop/lib - The command-line structure has the following structure Sqoop TOOL PROPERTY_ARGS SQOOP_ARGS TOOL - indicates the operation that you want to perform, e.g import, export etc PROPERTY_ARGS - are a set of parameters that are entered as Java properties in the format -Dname=value. SQOOP_ARGS - all the various sqoop parameters. 16

Sqoop How to run sqoop Example: sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 1 \ --target-dir visitcount_rfidlog \ --table VISITCOUNT.RFIDLOG 17

Sqoop how to parallelize -- table table_name -- query select * from table_name where $CONDITIONS -- table table_name -- split-by primary key -- num-mappers n -- table table_name -- split-by primary key -- boundary-query select range from dual -- num-mappers n 18

Hands On 1 Use Kite SDK to demonstrate copying of various file formats to Hadoop Step 1) Download the MovieLens Dataset curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o movies.zip unzip movies.zip cd ml-latest-small/ Step 2) Load the Dataset into Hadoop in Avro format -- infer the schema kite-dataset csv-schema ratings.csv --record-name ratings -o ratings.avsc cat ratings.avsc -- create the schema kite-dataset create ratings --schema ratings.avsc -- load the data kite-dataset csv-import ratings.csv --delimiter ',' ratings 19

Hands On 1 contd Step 3) Load the Dataset into Hadoop in Parquet format -- infer the schema kite-dataset csv-schema ratings.csv --record-name ratingsp -o ratingsp.avsc cat ratingsp.avsc -- create the schema kite-dataset create ratingsp --schema ratingsp.avsc --format parquet -- load the data kite-dataset csv-import ratings.csv --delimiter ',' ratingsp Step 4) Run a sample query to compare the elapsed time between Avro & Parquet hive select avg(rating)from ratings; select avg(rating)from ratingsp; 20

Hands On 2 Use Sqoop to copy an Oracle table to Hadoop Step 1) Get the Oracle JDBC driver sudo su - cd /var/lib/sqoop curl -L https://pkothuri.web.cern.ch/pkothuri/ojdbc6.jar -o ojdbc.jar exit Step 2) Run the sqoop job sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 1 \ --target-dir visitcount_rfidlog \ --table VISITCOUNT.RFIDLOG 21

Hands On 3 Use Sqoop to copy an Oracle table to Hadoop, multiple mappers sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 2 \ --split-by alarm_id \ --target-dir lemontest_alarms \ --table LEMONTEST.ALARMS \ --as-parquetfile Check the size and number of files hdfs dfs -ls lemontest_alarms/ 22

Hands On 4 Use Sqoop to make incremental copy of a Oracle table to Hadoop Step 1) Create a sqoop job sqoop job \ --create alarms \ -- \ import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 1 \ --target-dir lemontest_alarms_i \ --table LEMONTEST.ALARMS \ --incremental append \ --check-column alarm_id \ --last-value 0 \ 23

Hands On 4 contd. Step 2) Run the sqoop job sqoop job --exec alarms Step 3) Run sqoop in incremental mode sqoop import \ --connect jdbc:oracle:thin:@devdb11-s.cern.ch:10121/devdb11_s.cern.ch \ --username hadoop_tutorial \ -P \ --num-mappers 1 \ --table LEMONTEST.ALARMS \ --target-dir lemontest_alarms_i \ --incremental append \ --check-column alarm_id \ --last-value 47354 \ hdfs dfs -ls lemontest_alarms_i/ 24

Q & A E-mail: Prasanth.Kothuri@cern.ch Blog: http://prasanthkothuri.wordpress.com See also: https://db-blog.web.cern.ch/ 25