Distributed Data Management Summer Semester 2013 TU Kaiserslautern

Size: px

Start display at page:

Download "Distributed Data Management Summer Semester 2013 TU Kaiserslautern"

Madison Townsend
5 years ago
Views:

1 Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.- Ing. Sebas4an Michel saarland.de Distributed Data Management, SoSe 2013, S. Michel 1

2 Lecture 4 PIG/HIVE Distributed Data Management, SoSe 2013, S. Michel 2

3 MapReduce Remember slides on pros and cons of MapReduce, par4cularly cri4cism (too low level, ) We have seen how to code joins in MR How to filter (grep!), group by, Now: look at high- level tools on top of MapReduce Why? Claim: MapReduce too low level for normal users (developers) + large effort for ad- hoc queries. Distributed Data Management, SoSe 2013, S. Michel 3

4 Pig & Pig La4n high- level tool for expressing data analysis programs, originated from Yahoo (now at Apache) compiler transforms query into sequence of MapReduce jobs Data Flow language, Pig La4n (not really something like SQL) h`p://pig.apache.org Gates et al. Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2(2): (2009) Distributed Data Management, SoSe 2013, S. Michel 4

5 Rela4on Pig and Hadoop Pig La=n Commands: A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; STORE B INTO 'output'; Parsing, logical op4miza4on. Crea4on of MapReduce jobs + running them. Hadoop MapReduce Distributed Data Management, SoSe 2013, S. Michel 5

6 Example Input, e.g., using Shell: grunt>. Commands like: A = LOAD 'input' AS (x, y, z); B = FILTER A BY x > 5; STORE B INTO 'output'; Pig operates directly over files (and other sources, if specified by user defined func4ons (UDFs)). Distributed Data Management, SoSe 2013, S. Michel 6

7 (Nested) Data Model Atom: int, double, chararray, etc. E.g., Distributed Data Management, Michel Tuple: sequence of fields (any types) (,,, ) E.g., ( Distributed Data Management, 2013, {(1,2,3)}) Bag: collec4on of tuples (mul4set, i.e., can have duplicates) E.g., {( DDM13, Infosys13 )} Map: Mapping of keys to values E.g., { Michel => { DDM13 }, Deßloch =>{ Infosys13 }} Violates First Normal Form of tradi4onal RDMBS Distributed Data Management, SoSe 2013, S. Michel 7

8 Pig La4n: Example Joins A (2,Tie) (4,Coat) (3,Hat) (1,Scarf) B (Joe,2) (Hank,4) (Ali,0) (Eve,3) (Hank,2) A = LOAD ; B = LOAD.. C=Join A BY $0, B BY $1 Also support for OUTER JOINS Distributed Data Management, SoSe 2013, S. Michel 8

9 Data with Associated Schema PARTS = LOAD 'hdfs:///user/hduser/testjoin/parts.txt' as (id: int, name: chararray); PEOPLE = LOAD 'hdfs:///user/hduser/testjoin/people.txt' as (name: chararray, partsid: int); Distributed Data Management, SoSe 2013, S. Michel 9

10 Pig La4n: Commands (Subset) LOAD, STORE, DUMP FILTER FLATTEN FOR EACH GENERATE GROUP CROSS JOIN ORDER BY LIMIT PLUS: Built in and user defined func4ons. h`p://wiki.apache.org/pig/pigla4n Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig la4n: a not- so- foreign language for data processing. SIGMOD Conference 2008: Distributed Data Management, SoSe 2013, S. Michel 10

11 Example: Word Count //LOAD input file from HDFS A = LOAD 'hdfs:///user/hduser/gutenberg' AS (line : chararray); //Parse input lines into words B = FOREACH A GENERATE FLATTEN(TOKENIZE(line)) as term; //Remove whitespace- only words C = FILTER B BY term MATCHES '\\w+'; //Group by term D = GROUP C BY term; //and count for each group (i.e., for a term) its occurrences E = FOREACH D GENERATE group, COUNT($1) as frequency; //ORDER by frequency of occurrence F = ORDER E BY frequency ASC; Distributed Data Management, SoSe 2013, S. Michel 11

pig.backend.hadoop.execu4onengine.mapreducel ayer.mul4queryop4mizer - MR plan size a~er op4miza4on: 3.

12 Example: Word Count (Cont d) Output:... (which,2475) (it,2553) (that,2715) (a,3813) (is,4178) (to,5070) (in,5236) (and,7666) (of,10394) (the,20592) :02:21,062 [main] INFO org.apache.pig.backend.hadoop.execu4onengine.mapreducel ayer.mul4queryop4mizer - MR plan size a~er op4miza4on: 3.. Counters: Total records wri`en : Total bytes wri`en : Logically, mul4ple connected MapReduce jobs form a DAG* Job DAG: job_ _ > job_ _0052, job_ _ > job_ _0053, job_ _0053 *) DAG = Directed Acyclic Graph Distributed Data Management, SoSe 2013, S. Michel 12

13 Op4miza4ons Logical Op4miza4on: Filter as early as possible Eliminate unnecessary informa4on (project) Mul4ple MapReduce jobs (in general, not only here in Pig) give possibili4es to op4mize execu4on order. Considering DAG dependencies! Distributed Data Management, SoSe 2013, S. Michel 13

14 Pig vs. Na4ve MapReduce Two sides of the coin (generally). Statement from Twi`er engineer in typically a Pig script is 5% of the code of na4ve map/reduce wri`en in about 5% of the 4me. However, queries typically take between % the 4me to execute that a na4ve map/reduce job would have taken. h`p://blog.tonybain.com/tony_bain/2009/11/analy4cs- at- twi`er.html Distributed Data Management, SoSe 2013, S. Michel 14

15 Pig La4n vs. SQL Pig La4n is a data flow programming language user specified opera4on(s) put together to achieve task SQL is declara4ve user specifies what the result should be, not how it is implemented Distributed Data Management, SoSe 2013, S. Michel 15

16 Pig vs. RDBMS RDBMS: tables with predefined schema support of transac4ons and indices aim at fast response 4me Pig: schema at run4me (even op4onal) any source (by applying user defined func4ons) no loading/indexing of data as pre- processing: data is loaded at execu4on 4me (usually from HDFS) like MapReduce: aim at throughput, not super fast short queries Distributed Data Management, SoSe 2013, S. Michel 16

One more: Hive For structured data On top of Hadoop (like Pig) and, hence, HDFS RDBMS for big data Query language is similar to SQL (declara4ve) (not a data

17 One more: Hive For structured data On top of Hadoop (like Pig) and, hence, HDFS RDBMS for big data Query language is similar to SQL (declara4ve) (not a data flow language as Pig La4n) Originated from Facebook s effort to analyze their data. Now, an Apache Project Distributed Data Management, SoSe 2013, S. Michel 17

18 Hive Data is organized in tables, stored in files. CREATE TABLE records(year STRING, temperature INT, quality INT) LOAD DATA LOCAL INPATH input/./ sample.txt OVERWRITE INTO TABLE records; Distributed Data Management, SoSe 2013, S. Michel 18

19 Hive QL SELECT year, MAX(temperature) FROM records WHERE temperature!= 9999 AND.. GROUP BY year; No full support of SQL- 92 standard. Distributed Data Management, SoSe 2013, S. Michel 19

20 Architecture Thri~ Client CLI Meta- store Metastore database Applica4ons JDBC Client ODBC Client Hive Server Web Interface Driver File- System JobClient Hadoop cluster Distributed Data Management, SoSe 2013, S. Michel 20

21 Literature Alan Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan Narayanam, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava: Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. PVLDB 2(2): (2009) Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig la4n: a not- so- foreign language for data processing. SIGMOD Conference 2008: h`p://pig.apache.org h`p://wiki.apache.org/pig/pigla4n h`p://hive.apache.org/ Distributed Data Management, SoSe 2013, S. Michel 21

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.com ABSTRACT: In today world, as we know data is expanding along with the