Outline. MapReduce Data Model. MapReduce. Step 2: the REDUCE Phase. Step 1: the MAP Phase 11/29/11. Introduction to Data Management CSE 344

Similar documents
Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010

Section 8. Pig Latin

Introduction to Database Systems CSE 444. Lecture 22: Pig Latin

Introduction to Data Management CSE 344

"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Pig Latin: A Not-So-Foreign Language for Data Processing

1.2 Why Not Use SQL or Plain MapReduce?

Pig Latin. Dominique Fonteyn Wim Leers. Universiteit Hasselt

Motivation: Building a Text Index. CS 347 Distributed Databases and Transaction Processing Distributed Data Processing Using MapReduce

Data Management in the Cloud PIG LATIN AND HIVE. The Google Stack. Sawzall. Map/Reduce. Bigtable GFS

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Data-intensive computing systems

Introduction to Data Management CSE 344

Pig A language for data processing in Hadoop

Going beyond MapReduce

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

The Pig Experience. A. Gates et al., VLDB 2009

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin Reference Manual 1

Introduction to Data Management CSE 344

The Hadoop Stack, Part 1 Introduction to Pig Latin. CSE Cloud Computing Fall 2018 Prof. Douglas Thain University of Notre Dame

Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Introduction to Data Management CSE 344

Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Databases 2 (VU) ( / )

Declarative MapReduce 10/29/2018 1

Index. Symbols A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Data Management CSE 344

CSE 344 MAY 2 ND MAP/REDUCE

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Hadoop ecosystem. Nikos Parlavantzas

data parallelism Chris Olston Yahoo! Research

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

CSE 344 Introduc/on to Data Management. Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun

CompSci 516: Database Systems

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

MapReduce. Stony Brook University CSE545, Fall 2016

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

CSE 444 Final Exam. August 21, Question 1 / 15. Question 2 / 25. Question 3 / 25. Question 4 / 15. Question 5 / 20.

Dr. Chuck Cartledge. 18 Feb. 2015

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Beyond Hive Pig and Python

this is so cumbersome!

CSE 444 Final Exam 2 nd Midterm

Apache Pig. Craig Douglas and Mookwon Seo University of Wyoming

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce: Algorithm Design for Relational Operations

Introduction to HDFS and MapReduce

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Implementation of Relational Operations

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Principles of Data Management. Lecture #9 (Query Processing Overview)

CSC 261/461 Database Systems Lecture 19

Administrivia. CS 133: Databases. Cost-based Query Sub-System. Goals for Today. Midterm on Thursday 10/18. Assignments

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Click Stream Data Analysis Using Hadoop

Advanced Data Management Technologies

Implementing Joins 1

CSE 444: Database Internals. Lecture 23 Spark

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Query processing on raw files. Vítor Uwe Reus

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

MapReduce and Friends

Hadoop Map Reduce 10/17/2018 1

Evaluation of Relational Operations

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Innovatus Technologies

Introduction to Apache Pig ja Hive

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Data Cleansing some important elements

Database Applications (15-415)

Parallel Computing: MapReduce Jin, Hai

Lecture 11 Hadoop & Spark

Evaluation of Relational Operations

Big Data Hadoop Stack

Database Systems. Announcement. December 13/14, 2006 Lecture #10. Assignment #4 is due next week.

Introduction to BigData, Hadoop:-

International Journal of Advance Research in Engineering, Science & Technology

Evaluation of Relational Operations. Relational Operations

Getting Started. Table of contents. 1 Pig Setup Running Pig Pig Latin Statements Pig Properties Pig Tutorial...

CompSci 516 Data Intensive Computing Systems. Lecture 11. Query Optimization. Instructor: Sudeepa Roy

Transcription:

Outline Introduction to Data Management CSE 344 Review of MapReduce Introduction to Pig System Pig Latin tutorial Lecture 23: Pig Latin Some slides are courtesy of Alan Gates, Yahoo!Research 1 2 MapReduce Google: paper published 2004 Open source implementation: Hadoop MapReduce = high-level programming model and implementation for large-scale parallel data processing Main competing alternative Dryad from Microsoft Files! MapReduce Data Model A file = a bag of (key, value) pairs A MapReduce program: Input: a bag of (input key, value) pairs Output: a bag of (output key, value) pairs 3 4 Step 1: the MAP Phase User provides the MAP function: Input: one (input key, value) Ouput: a bag of (intermediate key, value) pairs Step 2: the REDUCE Phase User provides the REDUCE function: Input: intermediate key, and bag of values Output: bag of output values System applies function in parallel to all (input key, value) pairs in the input file System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function 5 6 1

Parallel Execution in Cluster MapReduce Phases See illustration on white board: Data is typically a file in the Google File System HDFS for Hadoop File system partitions file into chunks Each chunk is replicated on k (typically 3) machines Each machine can run a few and tasks simultaneously Each task consumes one chunk Can adjust how much data goes into each task using splits Scheduler tries to schedule task where its input data is located Map output is partitioned across rs Map output is also written locally to disk Number of tasks is configurable System shuffles data between and tasks Reducers sort-merge data before consuming it Local storage ` 7 Magda Balazinska - CSE 344 Fall 2011 8 MapReduce MapReduce Illustrated Computation is moved to the data A simple yet powerful programming model Map: every record handled individually Shuffle: records collected by key Reduce: key and iterator of all associated values User provides: input and output (usually files) Java function key to aggregate on Java function Opportunities for more control: partitioning, sorting, partial aggregations, etc. Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? - 9 - - 10 - MapReduce Illustrated MapReduce Illustrated Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? wherefore, 1 What, 1 hurt, 1 wherefore, 1 What, 1 hurt, 1 art, (1, 1) hurt (1), thou (1, 1) Romeo, (1, 1, 1) wherefore, (1) what, (1) - 11 - - 12-2

11/29/11 MapReduce Illustrated Making Parallelism Simple Romeo, Romeo, wherefore art thou Romeo? wherefore, 1 art, (1, 1) hurt (1), thou (1, 1) What, art thou hurt? Sequential reads = good read speeds In large cluster failures are guaranteed; MapReduce handles retries What, 1 hurt, 1 Good fit for batch processing applications that need to touch all your data: data mining model tuning Bad fit for applications that need to find one particular record Romeo, (1, 1, 1) wherefore, (1) what, (1) art, 2 hurt, 1 thou, 2 Bad fit for applications that need to communicate between processes; oriented around independent units of work Romeo, 3 wherefore, 1 what, 1-13 - - 14 - MapReduce (MR) tools What is Pig? An engine for executing programs on top of Hadoop It provides a language, Pig Latin, to specify these programs Open source MapReduce implementation: An Apache open source project http://hadoop.apache.org/pig/ One MR query language: Pig Latin Query engine: Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90 15 Why use Pig? - 16 - In MapReduce Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited sites by users aged 18-25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5 170 lines of code, 4 hours to write - 17 - - 18-3

In Pig Latin Users = load users as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load pages as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into top5sites ; 9 lines of code, 15 minutes to write - 19 - Ensemble of MapReduce jobs Pig System Overview Pig Latin program A = LOAD 'file1' AS (sid,pid,mass,px:double); B = LOAD 'file2' AS (sid,pid,mass,px:double); C = FILTER A BY px < 1.0; D = JOIN C BY sid, B BY sid; STORE g INTO 'output.txt'; 20 But can it fly? Pig Latin Mini-Tutorial Based entirely on Pig Latin: A not-soforeign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins, 2008 22-21 - Pages(url, category, pagerank) Pig-Latin Overview Example: In SQL Data model = loosely typed nested relations Query model = Relational Algebra (less SQL) Execution model: Option 1: run locally (pig x local) Option 2: run in a Hadoop cluster Option 1 is good for debugging SELECT category, AVG(pagerank) FROM Pages WHERE pagerank > 0.2 GROUP By category HAVING COUNT(*) > 10 6 23 24 4

Pages(url, category, pagerank) Example: in Pig-Latin Types in Pig-Latin pages = LOAD pages-file.txt as (url, category, pagerank); good_pages = FILTER pages BY pagerank > 0.2; groups = GROUP good_pages BY category; big_groups = FILTER groups BY COUNT(good_pages) > 10 6 ; result = FOREACH big_groups GENERATE group, AVG(good_pages.pagerank); Use describe to check schema For the last statement, can also use: result = FOREACH big_groups GENERATE $0, AVG($1.pagerank); Atomic: string or number, e.g. Alice or 55 Tuple: ( Alice, 55, salesperson ) Bag: {( Alice, 55, salesperson ), ( Betty,44, manager ), } Maps: we will not use these 25 26 Types in Pig-Latin Bags can be nested! {( a, {1,4,3}), ( c,{ }), ( d, {2,2,5,3,2})} Tuple components can be referenced by name or by number $0, $1, $2, url, category, pagerank, 27 28 Loading data Loading data Input data = FILES! LOAD command parses an input file pages = LOAD pages-file.txt USING myload( ) AS (url, category, pagerank) Parser provided by user In HW6: myudfs.jar, function RDFSplit3 Example.pig shows you how to use it pages = LOAD pages-file.txt USING PigStorage(, ) AS (url, category, pagerank) 29 30 5

FOREACH pages_projected = FOREACH pages GENERATE url, pagerank Loading Data in HW6 There is no parser for RDF data, hence: -- example.pig = in your starter code register s3n://uw-cse344-code/myudfs.jar raw = LOAD 's3n://uw-cse344-test/cse344-test-file' USING TextLoader as (line:chararray); Note: you cannot write joins in FOREACH FOREACH pages, users = error 31 ntriples = FOREACH raw GENERATE FLATTEN(myudfs.RDFSplit3(line)) AS (subject:chararray, predicate:chararray, 32 object:chararray); FILTER Remove all queries from Web bots: good_pages = FILTER pages BY pagerank > 0.8; results: revenue: JOIN {(querystring, url, position)} {(querystring, adslot, amount)} join_result = JOIN results BY querystring revenue BY querystring join_result : {(querystring, url, position, querystring, adslot, amount)} 33 34 revenue: GROUP BY {(querystring, adslot, amount)} grp_revenue = GROUP revenue BY querystring query_revenues = FOREACH grp_revenue GENERATE group as querystring, SUM(revenue.amount) AS totalrevenue In current implementation, join attribute remains duplicated grp_revenue: {(group, {(querystring, adslot, amount)})} query_revenues: {(querystring, totalrevenue)} 35 36 6

Simple MapReduce input : {(field1, field2, field3,....)} _result = FOREACH input GENERATE FLATTEN((*)) key_groups = GROUP _result BY $0 output = FOREACH key_groups GENERATE (*) _result : {(a1, a2, a3,...)} key_groups : {(group, {(a1, a2, a3,...)})} 37 Co-Group results: {(querystring, url, position)} revenue: {(querystring, adslot, amount)} grouped_data = COGROUP results BY querystring, revenue BY querystring; grouped_data: {(group, results:{(querystring, url, position)}, revenue:{(querystring, adslot, amount)})} What is the output type in general? 38 Co-Group Co-Group grouped_data: {(group, results:{(querystring, url, position)}, revenue:{(querystring, adslot, amount)})} url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); Is this an inner join or an outer join? 39 distributerevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them. 40 Co-Group v.s. Join grouped_data: {(group, results:{(querystring, url, position)}, revenue:{(querystring, adslot, amount)})} grouped_data = COGROUP results BY querystring, revenue BY querystring; join_result = FOREACH grouped_data GENERATE FLATTEN(results), FLATTEN(revenue); Asking for Output: STORE In example.pig STORE count_by_object_ordered IN '/user/hadoop/example-results' USING PigStorage(); Result is the same as JOIN 41 42 7

Implementation Implementation Over Hadoop! Parse query: Everything between LOAD and STORE one logical plan Logical plan sequence of MapReduce jobs All statements between two (CO)GROUPs one MapReduce job 43 44 8