Pig Latin: A Not-So-Foreign Language for Data Processing

Similar documents
Section 8. Pig Latin

Introduction to Database Systems CSE 444. Lecture 22: Pig Latin

Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Pig Latin. Dominique Fonteyn Wim Leers. Universiteit Hasselt

1.2 Why Not Use SQL or Plain MapReduce?

Introduction to Data Management CSE 344

Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

Data-intensive computing systems

Data Management in the Cloud PIG LATIN AND HIVE. The Google Stack. Sawzall. Map/Reduce. Bigtable GFS

Pig Latin: A Not-So-Foreign Language for Data Processing

Declarative MapReduce 10/29/2018 1

The Hadoop Stack, Part 1 Introduction to Pig Latin. CSE Cloud Computing Fall 2018 Prof. Douglas Thain University of Notre Dame

Going beyond MapReduce

Motivation: Building a Text Index. CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce

The Pig Experience. A. Gates et al., VLDB 2009

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Pig A language for data processing in Hadoop

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Data Cleansing some important elements

Dr. Chuck Cartledge. 18 Feb. 2015

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Big Data Hadoop Stack

this is so cumbersome!

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Big Data for Oracle DBAs. Arup Nanda

Pig Latin Reference Manual 1

TI2736-B Big Data Processing. Claudia Hauff

Programming and Debugging Large- Scale Data Processing Workflows

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Innovatus Technologies

International Journal of Advance Research in Engineering, Science & Technology

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Hadoop ecosystem. Nikos Parlavantzas

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Pig on Spark project proposes to add Spark as an execution engine option for Pig, similar to current options of MapReduce and Tez.

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

CISC 7610 Lecture 2b The beginnings of NoSQL

Design Studio Data Flow performance optimization

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

KAMD: A Progress Estimator for MapReduce Pipelines

A Review Paper on Big data & Hadoop

Querying Data with Transact SQL

Query processing on raw files. Vítor Uwe Reus

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Introduction to Data Management CSE 344

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Interview Questions on DBMS and SQL [Compiled by M V Kamal, Associate Professor, CSE Dept]

MRBench : A Benchmark for Map-Reduce Framework

CSE 344 MAY 7 TH EXAM REVIEW

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

MTA Database Administrator Fundamentals Course

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

MAP-REDUCE ABSTRACTIONS

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

Introduction to Apache Pig ja Hive

CSE 344 Final Review. August 16 th

IN ACTION. Chuck Lam SAMPLE CHAPTER MANNING

Querying Data with Transact-SQL

DryadLINQ. Distributed Computation. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

What s a database system? Review of Basic Database Concepts. Entity-relationship (E/R) diagram. Two important questions. Physical data independence

81067AE Development Environment Introduction in Microsoft

University of Maryland. Tuesday, March 23, 2010

Sempala. Interactive SPARQL Query Processing on Hadoop

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Beyond Hive Pig and Python

Query Performance Visualization

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Hadoop Development Introduction

RAPID: Enabling Scalable Ad-Hoc Analytics on the Semantic Web

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

Workbooks (File) and Worksheet Handling

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Unit Assessment Guide

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

Intellicus Enterprise Reporting and BI Platform

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

2.3 Algorithms Using Map-Reduce

Processing Large / Big Data through MapR and Pig

MapReduce: Algorithm Design for Relational Operations

Distributed Computing

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Multisets and Duplicates. SQL: Duplicate Semantics and NULL Values. How does this impact Queries?

CSE 562 Final Exam Solutions

CSE 190D Spring 2017 Final Exam

II B.Sc(IT) [ BATCH] IV SEMESTER CORE: RELATIONAL DATABASE MANAGEMENT SYSTEM - 412A Multiple Choice Questions.

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

CSE 444: Database Internals. Lecture 23 Spark

Transcription:

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo! Research) Presented by Aaron Moss (University of Waterloo)

We have this problem > Developers at top web properties trying to make their product better > They have massive, un-indexable datasets to work with > And they want to do ad-hoc analysis on them

and we have this tool > Map-reduce over a parallel cluster solves the problem quite efficiently > Takes advantage of the massive parallelism inherent in the data analysis problem

how do we make it simple? > Map-reduce is hard to use doesn t support common operations like join in a reusable fashion > SQL is unfamiliar Users are developers who are more comfortable with procedural than declarative code

A different kind of question > This class has largely been concerned with implementing relational database constructs efficiently in a distributed setting > This paper is concerned with building easy to use constructs on top of an efficient distributed implementation

Introducing Pig Latin > A language designed to provide abstraction over procedural map-reduce without being so declarative as to be opaque > The authors have built a query translator called Pig for Pig Latin Written in Java for the Hadoop map-reduce environment Open source and available at http://pig.apache.org

Quick Overview of Map-Reduce (k 1, v 1 ) (k 2, v 2 ) (k n, v n ) map map map (k * 1, v * 1) (k * 2, v * 2) (k * m, v * m) k * 1 (v * 1 ) k * m (v * m ) reduce reduce reduce (v ** 1 ) (v ** p )

Disadvantages of SQL > Not as inherently parallelizable > Declarative style uncomfortable for procedural programmers > Many primitives with complex interaction Hard for the programmer to know where the performance bottlenecks are

Advantages of Map-Reduce > Inherently Parallel > Procedural model > Only two primitives (map and reduce) Makes it clear what the system is doing

Disadvantages of Map-Reduce > Can be difficult to make queries fit into the two-stage map then reduce model > Can t optimize across the abstraction > No primitives for common operations Projection Selection

An Example > SQL SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 1000000 > Pig Latin good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 1000000; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Dataflow Language > Each line defines a single manipulation > Similar to a query execution plan Lower level than SQL This can aid optimization > User-defined I/O allows Pig to work with data stores that aren t databases Queries are read-only (no transactions) Queries are ad-hoc (less value in pre-built indices)

Data Model > Atom: single atomic value, e.g. string, number > Tuple: sequence of fields of any data type > Bag: collection of tuples Possible duplicates Tuples need not have consistent schema > Map: collection of data items which can be looked up by an atomic key Keys must have same type, data items may not

Pig Expressions t = ( quux, {( foo,1)( bar,2)}, [ baz 20]) Expression Type Example Value for t Constant foo foo Field (by position) $0 quux Field (by name) f3 [ age 20] Projection f2.$0 {( foo )( bar )} Map Lookup f3# baz 20 Function Evaluation SUM(f2.$1) 3 Conditional f3# baz >42? 1:0 0 Flattening f1,flatten(f2) ( quux, foo,1) ( quux, bar,2)

Pig Latin Primitives - Input > queries = LOAD query_log.txt USING myload() AS (userid, querystring, timestamp); > Loads from a file > Using a de-serializer Default is plain-text tab-separated values > Given an optional schema If not given, refer to data by position

Pig Latin Primitives - Output > STORE queries INTO q_output.txt USING mystore(); > Stores bag to file > Using a serializer Default is plain-text tab-separated values

Pig Latin Primitives - Per-tuple > f_queries = FOREACH queries GENERATE userid, f(querystring); > For each tuple in the bag, generate the tuple described by the GENERATE expression > May use FLATTEN in the GENERATE expression to generate more than one tuple

Pig Latin Primitives - Per-tuple > real_queries = FILTER queries BY userid neq bot ; > Output each tuple in the bag only if it satisfies the BY expression > AND, OR, and NOT logical operators > ==,!= operate on numbers, eq, neq on strings

Pig Latin Primitives - Grouping > grouped_revenue = GROUP revenue BY querystring; > Outputs a list of tuples, one for each unique value of the BY expression, where the first element is the value of that expression and the second is a bag of tuples matching that expression

Pig Latin Primitives - Grouping > grouped_data = COGROUP results BY querystring revenue BY querystring; > Outputs a list of tuples, one for each unique value of the BY expressions; the first element is that expression and the rest are bags of tuples from each BY which match that expression > Can generate a join by flattening the bags

Pig Latin Primitives - Sets > UNION returns the union of two bags > CROSS returns the cross product of two bags > ORDER sorts a bag by the given field(s) > DISTINCT eliminates duplicate tuples in a bag

Pig, the Pig Latin Compiler > Lazily builds execution plan as it sees each command Allows optimizations like combining or reordering filters > Processing is only triggered on STORE > Logical plan construction is platformindependent, but designed for Hadoop mapreduce

Implementation > Each (CO)GROUP command is converted into a map-reduce job > Map assigns keys to tuples by BY clauses Also initial per-tuple processing > Reduce handles pertuple processing up to the next (CO)GROUP LOAD FILTER GROUP FOREACH map 2 LOAD map 1 reduce 2 reduce 1 COGROUP STORE

Efficiency > Moving/replicating data between successive map-reduce jobs results in overhead > When a nested bag is created, then aggregated using a parallelizable operation, can just perform the reduction as tuples are seen instead of materializing the bag If this can t be done, spill to disk, use databasestyle sorting

Debugging > The Pig Pen environment dynamically creates a sandbox dataset to illustrate query flow > Allows incremental query development without long execution times of full queries > Takes random samples of data, attempts to provide full query coverage, synthesizes data where needed.

Critique - Strengths > Approach of making language features translate clearly to low-level constructs Allows the programmer to better optimize their work > Simple, regular data model > Ability to fit a full language overview in a short paper

Critique - Weaknesses > Only anecdotal evidence for Pig Latin being easier to use than SQL > Inadequate coverage of Pig Pen debugger Algorithms are opaque, not apparently published > No data to back up optimization claims Can t prove programmers can optimize better than automated SQL optimizer No comparison of Pig to optimized SQL