this is so cumbersome!

Size: px

Start display at page:

Download "this is so cumbersome!"

Emery Hunter
5 years ago
Views:

1 Pig Arend Hintze

2 this is so cumbersome! Instead of programming everything in java MapReduce or streaming: wouldn t it we wonderful to have a simpler interface? Problem: break down complex MapReduce tasks into simple commands pig does that!

3 approach Pig Latin -> high level language to program tasks takes input and cues commands on that input to create output Pig compiler -> takes a script, creates jobs, runs them locally or distributed on HDFS

4 interacting with pig grunt shell -> enter commands directly (start: pig -x local OR: pig -X mapreduce) pig pigscript.pig -> executes a pig script put your commands inside your java program (import org.apache.pig.pigserver) OR: use the graphical web interface!

5 typical workflow Load data into an alias alias = LOAD filename AS ( ) Manipulate the alias using relational operators or functions new_alias = pig_command(old_alias) Dump alias to shell, output, or file in a HDFS directory

6 Pig Lating Commands Categories Read-Write from/to HDFS Diagnostics Data types Expression and functions Relational operators

7 Read-Write Operators

8 example data = LOAD movie_short.csv using PigStorage(, ) AS (id,name,year,rating,score); res = FILTER data BY (float)rating>2.0; DUMP res;

9 LOADING alias = LOAD file [USING function] [AS schema]; Default: assumes data is tab-delimeted if data has different spacers use PigStorage( delimiter ) the schema can have types data = LOAD movie_short.csv using PigStorage(, ) AS (id:int,name:chararray,year:int,rating:float,score:float);

10 SAVING DUMP just dumps the output to the command line or display STORE saves the content to a file LIMIT allows you to specify the number of tuples to be returned

11 saving data = LOAD movie_short.csv using PigStorage(, ) AS (id,name,year,rating,score); res = FILTER data BY (float)rating>2.0; STORE res INTO output ; uses tabs as delimiter data = LOAD movie_short.csv using PigStorage(, ) AS (id,name,year,rating,score); res = FILTER data BY (float)rating>2.0; STORE res INTO output using PigStorage(, ); uses other delimiter

12 LIMIT data = LOAD movie_short.csv using PigStorage(, ) AS (id,name,year,rating,score); res = FILTER data BY (float)rating>2.0; res_limit = LIMIT res 10; STORE res_limit INTO output using PigStorage(, );

13 DIAGNOSTIC

14 Atomic Data Types data = LOAD movie_short.csv using PigStorage(, ) AS (id:int,name:chararray,year:int,rating:float,score:float);

15 Complex Data Types

16 Data Types A field in a tuple or a value in a map can be null or any atomic/complex type (NESTING) (John, {(48, Jolly Rd, Okemos),(10, Grand,Lansing)}) Defining a schema if you leave out the field type Pig will default to byte array if you leave out the name a field would be unnamed and you can reference it by it s position ($0, $1, $2 and so on)

17 Loading complex data types tuples are tab delimited

18 Expressions 1 expressions are used in FILTER, FOREACH, GROUP and SPLIT as well as in eval functions

19 Expressions 2

20 Built-In Functions case sensitive!

21 PIG the 2nd Arend Hintze

22 relational operators FOREACH FILTER ORDER BY SPLIT UNION DISTINCT dataset: A,1,2,3,m B,1,2,3,m C,2,2,2,f GROUP JOIN

23 FOREACH data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); sums = FOREACH data GENERATE name,g1+g2+g3; DUMP sums;

24 FILTER data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); goodones = FILTER data BY g2>10; DUMP goodones;

25 ORDER BY data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); myorder = ORDER data BY name DESC; DUMP myorder;

26 SPLIT data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); sums = FOREACH data GENERATE name,g1+g2+g3; SPLIT sums INTO high if $1>100, low if $1<=100; DUMP low; DUMP high;

27 UNION data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); sums = FOREACH data GENERATE name,g1+g2+g3; SPLIT sums INTO high if $1>100, low if $1<=100; DUMP low; DUMP high; myu = UNION low,high; DUMP myu;

28 DISTINCT

29 GROUP data = LOAD grades.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); genders = GROUP data BY gender; DUMP genders;

30 JOIN (inner join) dataa = LOAD gradesa.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); datab = LOAD gradesb.csv using PigStorage(, ) AS (name,g1:int,g2:int,g3:int,gender); j = JOIN dataa BY name,datab BY name;

31 Built-In Functions CASE sensitive!

32 FLATTEN flattens a nested datatype (bags for example) list of Bags -> list of all Bag elements

33 WordCount in PIG data = LOAD file0 ; words = FOREACH data GENERATE TOKENIZE($0) AS wordlist; allwords = FOREACH words GENERATE FLATTEN(wordlist),1; grp = GROUP allwords BY $0; counts = FOREACH grp GENERATE $0,SUM($1.$1); DUMP counts;

34 Step 1:

35 Step 2:

36 Step 3:

37 Step 4: For the first row/record: (a, { (a,1), (a,1)} ) $0 = a $1 = (a,1) $1.$1 = 1 => sum($1.$1) = 2

38 Step 5:

39 Macros Macros provide a way to define reusable code (functions) DEFINE <macroname> (<args>) RETURNS <returnvalue> { thecode } Wordcount Example: DEFINE wordcount(text) RETURNS counts { tokens = foreach $text generate TOKENIZE($0) as terms; wordlist = foreach tokens generate FLATTEN(terms) as word,1 as freq; groups = group wordlist by word; $counts = foreach groups generate group as word,sum(wordlist.freq) as freq; }

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop