Introduction to Apache Pig ja Hive

Size: px

Start display at page:

Download "Introduction to Apache Pig ja Hive"

Dale Jordan
6 years ago
Views:

1 Introduction to Apache Pig ja Hive Pelle Jakovits 30 September, 2014, Tartu

2 Outline Why Pig or Hive instead of MapReduce Apache Pig Pig Latin language Examples Architecture Hive Hive Query Language Examples Disadvantages of scripting languages Pig vs Hive Pelle Jakovits 2/18

3 You already know MapReduce MapReduce = Map, GroupBy, Sort, Reduce Designed or huge scale data processing Provides Distributed file system High scalability Automatic parallelisation Automatic fault recovery Data is replicated Failed tasks are re-executed on other nodes Pelle Jakovits 3/18

4 Is MapReduce enough? Hadoop MapReduce is one of the most used frameworks for large scale data processing However: Writing low level Mapreduce code slow Need a lot of expertise to optimize MapReduce code Prototyping is slow A lot of custom code required Even for the most simplest tasks Hard to manage more complex mapreduce job chains Pelle Jakovits 4/18

Apache Pig A data flow framework on top of Hadoop MapReduce Retains all its advantages And some of it s disadvantages Models a scripting language Fast prototyping

5 Apache Pig A data flow framework on top of Hadoop MapReduce Retains all its advantages And some of it s disadvantages Models a scripting language Fast prototyping Uses Pig Latin language Similiar to declarative SQL Easier to get started with Pig Latin statements are automatically translated into MapReduce jobs Pelle Jakovits 5/18

6 Pig workflow Pelle Jakovits 6/18

7 Pig workflow Pelle Jakovits 7/18

8 Advantages of Pig Easy to Program 5% of the code, 5% of the time required Self-Optimizing Pig Latin statment optimizations Generated MapReduce code optimizations Can manage more complex data flows Easy to use and join multiple separate inputs, transformations and outputs Extensible Can be extended with User Defined Functions (UDF) to provide more functionality Pelle Jakovits 8/18

9 Running Pig Local mode Everything installed locally on one machine Distributed mode Everything runs in a MapReduce cluster Interactive mode Grunt shell Batch mode Pig scripts Pelle Jakovits 9/18

10 Pig Latin Write complex MapReduce transformations using much simpler scripting language Not quite SQL, but similar Lazy evaluation Compiling is hidden from the user Pelle Jakovits 10/18

11 Pig Latin Example I = load /mydata/images using ImageParser() as (id, image); F = foreach I generate id, detectfaces(image); store F into /mydata/faces ; Input and output are HDFS folders or files /mydata/images /mydata/faces I and F are relations Right hand side contains Pig expressions Pelle Jakovits 11/18

12 Relations, Bags, Tuples, Fields Relation Bag Can have nested relations Similiar to a table in a relational database Consists of a Bag Collection of unordered tuples Tuple An ordered set of fields Similiar to a row in a relational database Can contain any number of fields, does not have to match other tuples Fields A piece of data Pelle Jakovits 12/18

13 Fields Consists of either: Data atoms - Int, long, float, double, chararray, boolean, datetime, etc. Complex data - Bag, Map, Tuple Assigning types to fields A = LOAD 'student' AS (name:chararray, age:int, gpa:float); Referencing Fields By order - $0, $1, $2 By name - assigned by user schemas A = LOAD in.txt AS (age, name, occupation); Pelle Jakovits 13/18

14 Complex data types Looking into complex, nested data client.$0 author.age Pelle Jakovits 14/18

15 Loading and storing data LOAD A = LOAD myfile.txt USING PigStorage( \t ) AS (f1:int, f2:int, f3:int); User defines data loader and delimiters STORE STORE A INTO output_1.txt USING PigStorage (, ); STORE B INTO output_2.txt USING PigStorage ( * ); Other data loaders BinStorage PigDump TextLoader Or create a custom one. Pelle Jakovits 15/18

16 FOREACH GENERATE General data transformation statement Used to: Change the structure of data Apply functions to data Flatten complex data to remove nesting X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1); Pelle Jakovits 16/18

17 Group.. BY A = load 'student' AS (name:chararray, age:int, gpa:float); DUMP A; (John, 18, 4.0F) (Mary, 19, 3.8F) (Bill, 20, 3.9F) (Joe, 18, 3.8F) B = GROUP A BY age; DUMP B; (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)}) (19, {(Mary, 19, 3.8F)}) (20, {(Bill, 20, 3.9F)}) Pelle Jakovits 17/18

18 JOIN A = LOAD 'data1' AS (a1:int,a2:int,a3:int); B = LOAD 'data2' AS (b1:int,b2:int); X = JOIN A BY a1, B BY b1; DUMP A; (1,2,3) (4,2,1) DUMP B; (1,3) (2,7) (4,6) DUMP X; (1,2,3,1,3) (4,2,1,4,6) Pelle Jakovits 18/18

19 Union A = LOAD 'data' AS (a1:int, a2:int, a3:int); B = LOAD 'data' AS (b1:int, b2:int); X = UNION A, B; DUMP A; (1,2,3) (4,2,1) DUMP A; (2,4) (8,9) DUMP X; (1,2,3) (4,2,1) (2,4) (8,9) Pelle Jakovits 19/18

20 Functions SAMPLE A = LOAD 'data' AS (f1:int,f2:int,f3:int); X = SAMPLE A 0.01; X will contain 1% of tuples in A FILTER A = LOAD 'data' AS (a1:int, a2:int, a3:int); X = FILTER A BY a3 == 3; Pelle Jakovits 20/18

21 Functions DISTINCT removes duplicate tuples X = DISTINCT A; LIMIT X = LIMIT B 3; SPLIT SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); Pelle Jakovits 21/18

22 Pig Example A = LOAD 'student' USING PigStorage() AS (name, age, gpa); DUMP A; (John, 18, 4.0F) (Mary, 19, 3.8F) (Bill, 20, 3.9F) (Joe, 18, 3.8F) B = GROUP A BY age; C = FOREACH B GENERATE group, AVG(A.gpa) Pelle Jakovits 22/18

23 Hive Data warehousing on top of Hadoop. Designed to enable easy data summarization ad-hoc querying analysis of large volumes of data. HiveQL statements are automatically translated into MapReduce jobs 23

24 Advantages of Hive Higher level query language Simplifies working with large amounts of data Lower learning curve than Pig or MapReduce HiveQL is much closer to SQL than Pig Less trial and error than Pig 24

25 Running Hive We will look at it more closely in the practice session, but you can run hive from Hive web interface Hive shell $HIVE_HOME/bin/hive for interactive shell Or you can run queries directly: $HIVE_HOME/bin/hive -e select a.col from tab1 a JDBC Java Database Connectivity "jdbc:hive://host:port/dbname Also possible to use hive directly in Python, C, C++, PHP 25

26 HiveQL Hive query language provides the basic SQL like operations. These operations are: Ability to filter rows from a table using a where clause. Ability to select certain columns from the table using a select clause. Ability to do equi-joins between two tables. Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local directory. Ability to store the results of a query in a hadoop dfs directory. Ability to manage tables and partitions (create, drop and alter). Ability to use custom scripts in chosen language (for map/reduce). 26

27 Data units Databases: Namespaces that separate tables and other data units from naming confliction. Tables: Homogeneous units of data which have the same schema. Consists of specified columns accordingly to its schema Partitions: Each Table can have one or more partition Keys which determines how data is stored. Partitions allow the user to efficiently identify the rows that satisfy a certain criteria. It is the user's job to guarantee the relationship between partition name and data! Partitions are virtual columns, they are not part of the data, but are derived on load. Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid to sample the data. 27

28 Complex types Structs the elements within the type can be accessed using the DOT (.) notation. F or example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a Maps (key-value tuples) The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] Arrays (indexable lists) The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'. 28

29 Create Table CREATE TABLE page_view(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; 29

30 Load Data There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. The user can copy a file into the specified location in HDFS and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. 30

31 Load example hadoop dfs -put /tmp/pv_ txt /user/data/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt=' ', country='us') SELECT pvs.viewtime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; 31

32 Loading and storing data Loading data directly: LOAD DATA LOCAL INPATH /tmp/pv_ _us.txt INTO TABLE page_view PARTITION(date=' ', country='us') 32

33 Storing locally To write the output into a local disk, for example to load it into an excel spreadsheet later: INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' SELECT pv_gender_sum.* FROM pv_gender_sum; 33

34 INSERT INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= ' ' AND page_views.date <= ' ' AND page_views.referrer_url like '%xyz.com'; 34

35 Multiple table/file inserts The output can be sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). If along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query: FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files. 35

36 JOIN LEFT OUTER, RIGHT OUTER or FULL OUTER Can join more than 2 tables at once It is best to put the largest table on the rightmost side of the join to get the best performance. INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = ' '; 36

37 Aggregations INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(distinct pv_users.userid), count(*), sum(distinct pv_users.userid) FROM pv_users GROUP BY pv_users.gender; 37

38 Union INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = ' ' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = ' ' ) actions JOIN users u ON(u.id = actions.uid); 38

39 Running custom mapreduce FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script.py' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script.py' AS date, count; 39

40 import sys import datetime Map Example for line in sys.stdin: line = line.strip() userid, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)). isoweekday() print ','.join([userid, str(weekday)]) 40

41 Using UDF s in Hive create temporary function my_lower as 'com.example.hive.udf.lower'; hive> select my_lower(title), sum(freq) from titles group by my_lower(title); 41

42 Java UDF package com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.udf; import org.apache.hadoop.io.text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); } } 42

43 Pig vs Hive Pig Hive Purpose Data transformation Ad-Hoc querying Language Something similiar to SQL SQL-like Difficulty Medium (Trial-and-error) Low to medium Schemas Yes (implicit) Yes (explicit) Join (Distributed) Yes Yes Shell Yes Yes Streaming Yes Yes Web interface No Yes Partitions No Yes UDF s Yes Yes 43

44 Pig vs Hive SQL might not be the perfect language for expressing data transformation commands Pig Mainly for data transformations and processing Unstructured data Hive Mainly for warehousing and querying data Structured data 44

45 Pig & Hive disadvantages Slow start-up and clean-up of MapReduce jobs It takes time for Hadoop to schedule MR jobs Not suitable for interactive OLAP Analytics When results are expected in < 1 sec Complex applications may require many UDF s Pig loses it s simplicity over MapReduce Pelle Jakovits 45/18

46 Pig & Hive disadvantages Updating data is complicated Mainly because of using HDFS Can add records Can overwrite partitions No real time access to data Use other means like Hbase or Impala High latency 46

47 More Related Hadoop projects Hbase Open-source distributed database ontop of HDFS Hbase tables only use a single key Tuned for real-time access to data Cloudera Impala Simplified, real time queries over HDFS Bypass job schedulling, and remove everything else that makes MR slow. 47

48 Big Picture Store large amounts of data to HDFS Process raw data: Pig Build schema using Hive Data querries: Hive Real time access access to data with Hbase real time queries with Impala 48

49 Is ApacheHadoop enough? Why use Hadoop for large scale data processing? It is becoming a de facto standard in Big Data Collaboration among Top Companies instead of vendor tool lock-in. Amazon, Apache, Facebook, Yahoo!, etc all contribute to open source Hadoop There are tools from setting up Hadoop cluster in minutes and importing data from relational databases to setting up workflows of MR, Pig and Hive. 49

50 Thats All This week`s practice session Processing data with Pig ja Hive Similiar data processing exercise as in the previous 2 weeks Next week`s lecture Data processing in Spark Pelle Jakovits 50/18

Tutorial. Hive Tutorial. Concepts. What Is Hive. What Hive Is NOT. Getting Started. Data Units

Tutorial. Hive Tutorial. Concepts. What Is Hive. What Hive Is NOT. Getting Started. Data Units Tutorial Hive Tutorial Hive Tutorial Concepts What Is Hive What Hive Is NOT Getting Started Data Units Type System Built In Operators and Functions Language Capabilities Usage and Examples Concepts What