APACHE HIVE CIS 612 SUNNIE CHUNG

Size: px

Start display at page:

Download "APACHE HIVE CIS 612 SUNNIE CHUNG"

Austin Atkinson
6 years ago
Views:

1 APACHE HIVE CIS 612 SUNNIE CHUNG

2 APACHE HIVE IS Data warehouse infrastructure built on top of Hadoop enabling data summarization and ad-hoc queries. Initially developed by Facebook. Hive stores data in Hadoop Distributed File System Supports SQL like Query Language : HiveQL Hive complied Hive Query Language statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. 2

3 HOW HIVE WORKS? Hive structures data into well-understood database concepts such as tables, rows, columns, and partitions. It supports primitive types, as well as Associative Arrays, Lists, Struct. HQL supports DDL and DML. HQL has limited equality and join predicates, and has no inserts on existing tables. (It can override tables) Users can embed Custom Map-Reduce scripts. 3

4 HIVE Data in Hive is organized into Tables Provides structure for unstructured Big Data Work with data inside HDFS Tables Data : File or Group of Files in HDFS Schema : In the form of metadata stored in Relational Database Have a corresponding HDFS directory Data in a table is Serialized Supports Primitive Column Types and Nestable Collection Types: Array and Map(Key Value pair) 4

5 HIVE DATABASE Data Model Tables Analogous to tables in relational database Each table has a corresponding HDFS directory Hive provides built-in serialization formats which exploit compression and lazy-serialization Partitions Each table can have one or more partitions (Horizontal Partitions) Example: Table T in the directory : /wh/t. If Tis partitioned on columns ds = , and ctry = US, will be stored /wh/t/ds= /ctry=us. Buckets Data in each partition may in turn be divided into buckets based on the hash of a column in the table Each bucket is stored as a file in the partition directory

6 TABLE SCHEMA EXAMPLE CREATE TABLE page_view(viewtime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; 6

7 HIVE QUERY LANGUAGE SQL like language: HiveQL DDL : to create tables with specific serialization formats DML : load and insert to load data from external sources and insert query results into Hive tables Do not support updating and deleting rows in existing tables Supports Multi-Table insert Supports Select, Project, Join, Aggregate, Supports Union all and Sub-queries in the From clause 7

8 HIVEQL: UDTF, UDAF Can be extended with custom functions (UDFs) User Defined Transformation Function(UDTF) User Defined Aggregation Function (UDAF) Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface

9 WHAT HIVE DOES? Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to SQL statements, but with limited in the commands. It therefore allows developers to explore and structure massive amounts of data, analyze it then turn into business insight. Hive queries have very high latency because it is based on Hadoop. Hive is read-based and not appropriate for write operation. 9

in an NFS directory /logs/status_updates Compute daily

10 HIVEQL Running time example: Status Meme When Facebook users update their status, the updates are logged into flat files in an NFS directory /logs/status_updates Compute daily statistics on the frequency of status updates based on gender and school

11 ADVANTAGES OF HIVE Familiar: hundreds of unique users can simultaneously query the data using a language familiar to SQL users. Fast Response: times are typically much faster than other types of queries on the same type of huge datasets. Scalable and extensible: as data variety and volume grows, more commodity machines can be added to the cluster, without a corresponding reduction in performance. Informative Familiar JDBC and ODBC drivers: allow many applications to pull Hive data for seamless reporting. Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats. (SerDes: serialized and deserialized API is used to move data in and out of tables) 11

12 HIVE ARCHITECTURE External Interfaces: Web UI : Management Hive CLI : Run Queries, Browse Tables, etc API : JDBC, ODBC Metastore : System catalog which contains metadata about Hive tables Driver : manages the life cycle of a Hive-QL statement during compilation, optimization and execution Compiler : translates Hive-QL statement into a plan which consists of a DAG of map-reduce jobs Database: is a namespace for tables Table: metadata for table contains list of columns and their types, owner, storage and SerDe information. Also contains any user supplied key and value data. Partition: each partition can have it own columns and SerDe and storage information. 12

13 13 HIVE ARCHITECTURE

14 14 HIVE ARCHITECTURE

15 HIVE ARCHITECTURE External interface: Both user interface like command line (cli) and web UI Thrift is a framework for cross-language services, where a server written in one language (like Java) can also support clients in other languages. Metastore is the system catalog. All other components of Hive interact with metastore The Driver manages the life cycle (statistics) of a HiveQL statement during compilation, optimization and execution Figure 1: Hive Architecture

16 COMMAND LINE INTERFACE There are several ways to interact with Hive, including some popular graphical user interface but CLI is sometimes preferable. CLI allows creating, inspecting schema and query tables, etc. All commands and queries go to the Driver, which complies, optimizes and executes queries usually with MapReduce jobs. Hive doesn t generate MapReduce programs, it uses generic Mapper and Reducer modules. Hive communicates with Job Tracker to initiate the MapReduce job. Data files to be processed are usually in HDFS, managed by NameNode. Hive uses Hive Query Language HQL, which is similar to SQL. 16

17 HIVE ARCHITECTURE MetaStore The system catalog which contains metadata about the tables stored in Hive This data is specified during table creation and reused very time the table is referenced in HiveQL Contains the following objects: database : the namespace for tables table : metadata for table contains list of columns and their types, owners, storage and SerDe information Partition: each partition can have its own columns and SerDe and storage information

18 HIVE ARCHITECTURE Bottom Top Figure 2: Query plan with 3 map-reduce jobs for multi-table insert query

19 HIVE ARCHITECTURE Compile The compiler converts the string(ddl/dml/query statement) to a plan. The parser transforms a query string to a parse tree representation The semantic analyzer transforms the parse tree to a block-based internal query representation The logical plan generator converts the internal query representation to a logical plan The optimizer performs multiple passes over the logical plan and rewrites it in several ways Combined multiple joins which share the join key into a single multiway join, and hence a single map-reduce job adds repartition operators Prunes columns early and pushes predicates closer to the table scan operators

20 HIVE ARCHITECTURE Compile (continue..) The optimizer performs multiple passes over the logical plan and rewrites it in several ways Combined multiple joins which share the join key into a single multiway join, and hence a single map-reduce job adds repartition operators Prunes columns early and pushes predicates closer to the table scan operators In case of partitioned tables, prunes partitions that are not needed by the query In case of sampling queries, prunes buckets that are not needed Users can also provide hints to the optimizer to Add partial aggregation operators to handle large cardinality grouped aggregation Add repartition operators to handle skew in grouped aggregations Perform joins in the map phrase instead of the reduce phase The Physical Plan generator converts the logical plan into physical plan, consisting a directed-acyclic graph(dag)of map-reproduce jobs

21 INPUT DATA Hive has no row-level insert, update or delete operations. The only way to put data into a table is to use one of load operations. There are four file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and RCFILE. Example: NASDAQ_daily_prices_B.csv a log file of stocks record of NASDAQ. exchange,stock_symbol,date,stock_price_open,stock_price_hig h,stock_price_low,stock_price_close,stock_volume,stock_price_ adj_close NASDAQ,BBND, ,2.92,2.98,2.86,2.96,483800,2.96 NASDAQ,BBND, ,2.85,2.94,2.79,2.93,884000,2.93 NASDAQ,BBND, ,2.83,2.88,2.78,2.83, ,

22 CREATE TABLE TO HOLD THE DATA: hive> CREATE TABLE IF NOT EXISTS stocks ( exchange STRING, symbol STRING, ymd STRING, price_open FLOAT, price_high FLOAT, price_low FLOAT, price_close FLOAT, volume INT, price_adj_close FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 22

23 HIVE QUERY LANGUAGE: HIVEQL Create a database: hive> CREATE DATABASE financials; or hive> CREATE DATABASE IF NOT EXISTS financials; Describe table: hive> DESCRIBE DATABASE financials; OK Financials hdfs://localhost:54310/user/hive/warehouse/financials.db Use database: hive> USE financials; Drop database: hive> DROP DATABASE IF EXISTS financials; 23

24 HOW TO LOAD DATA INTO HIVE TABLE Use LOAD DATA to import data into a Hive table Hive>Load Data LOCAL INPATH '/home/sunny/employeedetails.txt ' INTO TABLE Employee Use the word OVERWRITE to write over a file of the same name We can Load data from Local file system by using LOCAL keyword as above Example Inserting Data into new table by using SELECT statement For Example, INSERT OVERWRITE <table_name> SELECT * FROM Employee 24

25 MANAGING TABLES Operation See current tables Check the table name Change the table name Add a column Drop a partition Command Syntax Hive>Show TABLES Hive>Describe <Table_Name> Hive>Alter Table <table_name> Rename to mytab Hive> Alter Table <table_name> ADD COLUMNS (MyID String) Hive>Alter Table <table_name> DROP PARTITION (Age>70) 25

26 HIVE SUPPORTS THE FOLLOWINGS: WHERE Clause UNION All and DISTINCT GROUP BY and HAVING LIMIT Clause Hive Supports Sub-Queries but only in FROM Clause JOINS, ORDER BY, SORT BY 26

27 OUTPUT DATA Output data produced by Hive is structured, typically stored in a relational database. For cluster, MySQL or similar relational database is required. The result tables then can be manipulated using HiveQL in the similar way of SQL to relational database. 27

28 LOAD FILE INTO TABLE: hive> LOAD DATA LOCAL INPATH '/Users/nqt289/Desktop/NASDAQ_daily_prices_B.csv' > OVERWRITE INTO TABLE stocks; Copying data from file:/users/nqt289/desktop/nasdaq_daily_prices_b.csv Copying file: file:/users/nqt289/desktop/nasdaq_daily_prices_b.csv Loading data to table mydb.stocks Deleted hdfs://localhost:54310/users/nqt289/desktop/nasdaq_ daily_prices_b.csv OK Time taken: seconds 28

29 EXAMPLE OF OUTPUT OF HIVE hive> SELECT * FROM STOCKS WHERE price_open='2.92'; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_ _0003, Tracking URL = Kill Command = /Users/nqt289/hadoop /bin/../bin/hadoop job -Dmapred.job.tracker=localhost: kill job_ _0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: :39:20,577 Stage-1 map = 0%, reduce = 0% :39:23,597 Stage-1 map = 100%, reduce = 0% :39:26,625 Stage-1 map = 100%, reduce = 100% Ended Job = job_ _0003 MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: HDFS Write: 5166 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK NASDAQ BBND NASDAQ BTFG NASDAQ BJCT NASDAQ BJCT Time taken: seconds 29

30 DEFINITION: ACID Atomicity Atomicity requires that each transaction be "all or nothing": if one part of the transaction fails, the entire transaction fails, and the database state is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. To the outside world, a committed transaction appears (by its effects on the database) to be indivisible ("atomic"), and an aborted transaction does not happen. Consistency The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof. This does not guarantee correctness of the transaction in all ways the application programmer might have wanted (that is the responsibility of application-level code) but merely that any programming errors cannot result in the violation of any defined rules. Isolation The isolation property ensures that the concurrent execution of transactions result in a system state that would be obtained if transactions were executed serially, i.e. one after the other. Providing isolation is the main goal of concurrency control. Depending on concurrency control method, the effects of an incomplete transaction might not even be visible to another transaction. [citation needed] Durability Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors. In a relational database, for instance, once a group of SQL statements execute, the results need to be stored permanently (even if the database crashes immediately thereafter). To defend against power loss, transactions (or their effects) must be recorded in a non-volatile memory. 30

31 ACID IN HIVE ACID for Hive is added manually with the use cases: A set of Inserts and Updates is processed once an hour. A set of Deletes is processed once a day. A log of transactions is exported from a RDBMS to reflect new data once an hour. The delay is not an important issue here due to the purpose of Hive, also the number of transactions committed each time is huge (100 to 500 thousands rows.) 31

32 HIVE ACHIEVEMENTS & FUTURE PLANS First step to provide warehousing layer for Hadoop(Web-based Map-Reduce data processing system) Accepts only sub-set of SQL: Working to subsume SQL syntax Working on Rule-based optimizer : Plans to build Cost-based optimizer Enhancing JDBC and ODBC drivers for making the interactions with commercial BI tools. Working on making it perform better 32

33 PROJECTS & TOOLS ON HADOOP HBase Hive Pig Jaql ZooKeeper AVRO UIMA Sqoop 33

34 HIVE TUTORIAL 34

35 REFERENCES [1] "Apache Hadoop", [2] Apache Hive, [3] Apache HBase, [4] Apache ZooKeeper, [5] Jason Venner, "Pro Hadoop", Apress Books, 2009 [6] "Hadoop Wiki", [7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun Tian, James Majors, Adam Manzanares, Xiao Qin, " Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters", 19th International Heterogeneity in Computing Workshop, Atlanta, Georgia, April

36 REFERENCES [8]Dhruba Borthakur, The Hadoop Distributed File System: Architecture and Design, The Apache Software Foundation [9] "Apache Hadoop", [10] "Hadoop Overview", [11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, The Hadoop Distributed File System, Yahoo!, Sunnyvale, California USA, Published in: Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium. 36

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean