Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Similar documents
Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Classification Key Concepts

Pig A language for data processing in Hadoop

Analytics Building Blocks

COSC 6339 Big Data Analytics. Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout. Edgar Gabriel Fall Pig

Data Collection, Simple Storage (SQLite) & Cleaning

Pig Latin Reference Manual 1

Dr. Chuck Cartledge. 18 Feb. 2015

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

this is so cumbersome!

Text Analytics (Text Mining)

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Text Analytics (Text Mining)

The Pig Experience. A. Gates et al., VLDB 2009

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Beyond Hive Pig and Python

Data Mining Concepts & Tasks

Data Mining Concepts & Tasks

Innovatus Technologies

(Big) Data Processing

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Pig Latin: A Not-So-Foreign Language for Data Processing

Processing Big Data with Hadoop in Azure HDInsight

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Apache Pig coreservlets.com and Dima May coreservlets.com and Dima May

Expert Lecture plan proposal Hadoop& itsapplication

Data Cleansing some important elements

Hadoop Development Introduction

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Introduction to Database Systems CSE 444. Lecture 22: Pig Latin

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Introduction to BigData, Hadoop:-

Shell and Utility Commands

Text Analytics (Text Mining)

1.2 Why Not Use SQL or Plain MapReduce?

Introduction to Data Management CSE 344

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

A Review on Hive and Pig

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Shell and Utility Commands

Section 8. Pig Latin

Hadoop. Introduction to BIGDATA and HADOOP

Big Data Hadoop Stack

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Data-intensive computing systems

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Introduction to Data Management CSE 344

Hadoop Online Training

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Processing Big Data with Hadoop in Azure HDInsight

MAP-REDUCE ABSTRACTIONS

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Introduction to Apache Pig ja Hive

Typing Massive JSON Datasets

Hadoop ecosystem. Nikos Parlavantzas

Introduction to Data Management CSE 344

Hadoop & Big Data Analytics Complete Practical & Real-time Training

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Map Reduce & Hadoop Recommended Text:

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

Declarative MapReduce 10/29/2018 1

Information Retrieval

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

Importing and Exporting Data Between Hadoop and MySQL

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010

A Review Paper on Big data & Hadoop

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

IT Certification Exams Provider! Weofferfreeupdateserviceforoneyear! h ps://

Going beyond MapReduce

Big Data Analytics using Apache Hadoop and Spark with Scala

South Asian Journal of Engineering and Technology Vol.2, No.50 (2016) 5 10

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Pig Latin. Dominique Fonteyn Wim Leers. Universiteit Hasselt

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Exam Questions 1z0-449

Techno Expert Solutions An institute for specialized studies!

Specialist ICT Learning

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

International Journal of Advance Research in Engineering, Science & Technology

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray 1

Pig http://pig.apache.org High-level language instead of writing low-level map and reduce functions Easier to program, understand and maintain Created at Yahoo! Produces sequences of Map-Reduce programs (Lets you do joins much more easily) 2

Pig http://pig.apache.org Your data analysis task becomes a data flow sequence (i.e., data transformations) Input data flow output You specify data flow in Pig Latin (Pig s language). Then, Pig turns the data flow into a sequence of MapReduce jobs automatically! 3

Pig: 1st Benefit Write only a few lines of Pig Latin Typically, MapReduce development cycle is long Write mappers and reducers Compile code Submit jobs... 4

Pig: 2nd Benefit Pig can perform a sample run on representative subset of your input data automatically! Helps debug your code in smaller scale (much faster!), before applying on full data 5

What Pig is good for? Batch processing Since it s built on top of MapReduce Not for random query/read/write May be slower than MapReduce programs coded from scratch You trade ease of use + coding time for some execution speed 6

How to run Pig Pig is a client-side application (run on your computer) Nothing to install on Hadoop cluster 7

How to run Pig: 2 modes Local Mode Run on your computer (e.g., laptop) Great for trying out Pig on small datasets MapReduce Mode Pig translates your commands into MapReduce jobs Remember you can have a single-machine cluster set up on your computer Difference between PIG local and mapreduce mode: http://stackoverflow.com/questions/ 11669394/difference-between-pig-local-and-mapreduce-mode 8

Pig program: 3 ways to write Script Grunt (interactive shell) Great for debugging Embedded (into Java program) Use PigServer class (like JDBC for SQL) Use PigRunner to access Grunt 9

Grunt (interactive shell) Provides code completion Press Tab key to complete Pig Latin keywords and functions Let s see an example Pig program run with Grunt Find highest temperature by year 10

Example Pig program Find highest temperature by year records = LOAD 'input/ ncdc/ micro-tab/ sample.txt' AS (year:chararray, temperature:int, quality:int); filtered_records = FILTER records BY temperature!= 9999 AND (quality = = 0 OR quality = = 1 OR quality = = 4 OR quality = = 5 OR quality = = 9); grouped_records = GROUP filtered_records BY year; max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); DUMP max_temp; 11

Example Pig program Find highest temperature by year grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); grunt> DUMP records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) called a tuple grunt> DESCRIBE records; records: {year: chararray, temperature: int, quality: int} 12

Example Pig program Find highest temperature by year grunt> filtered_records = FILTER records BY temperature!= 9999 AND (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9); grunt> DUMP filtered_records; (1950,0,1) (1950,22,1) (1950,-11,1) (1949,111,1) (1949,78,1) In this example, no tuple is filtered out 13

Example Pig program Find highest temperature by year grunt> grouped_records = GROUP filtered_records BY year; grunt> DUMP grouped_records; (1949,{(1949,111,1), (1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)}) called a bag = unordered collection of tuples grunt> DESCRIBE grouped_records; alias that Pig created grouped_records: {group: chararray, filtered_records: {year: chararray, temperature: int, quality: int}} 14

Example Pig program Find highest temperature by year (1949,{(1949,111,1), (1949,78,1)}) (1950,{(1950,0,1),(1950,22,1),(1950,-11,1)}) grouped_records: {group: chararray, filtered_records: {year: chararray, temperature: int, quality: int}} grunt> max_temp = FOREACH grouped_records GENERATE group, MAX(filtered_records.temperature); grunt> DUMP max_temp; (1949,111) (1950,22) 15

Run Pig program on a subset of your data You saw an example run on a tiny dataset How to do that for a larger dataset? Use the ILLUSTRATE command to generate sample dataset 16

Run Pig program on a subset of your data grunt> ILLUSTRATE max_temp; 17

How does Pig compare to SQL? SQL: fixed schema PIG: loosely defined schema, as in records = LOAD 'input/ncdc/micro-tab/sample.txt' AS (year:chararray, temperature:int, quality:int); 18

How does Pig compare to SQL? SQL: supports fast, random access (e.g., <10ms, but of course depends on hardware, data size, and query complexity too) PIG: batch processing 19

Pig vs SQL 1. Pig Latin is procedural, where SQL is declarative. 2. Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline. 3. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer. 4. Pig Latin supports splits in the pipeline. 5. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline. http://yahoohadoop.tumblr.com/post/98294444546/comparing-pig-latin-and-sql-for-constructing-data 20

Much more to learn about Pig Relational Operators, Diagnostic Operators (e.g., describe, explain, illustrate), utility commands (cat, cd, kill, exec), etc. 21