Apache Hive 3: A new horizon

Size: px

Start display at page:

Download "Apache Hive 3: A new horizon"

Evelyn Berry
5 years ago
Views:

1 Apache Hive 3: A new horizon

6 DAS 1.0 Smart query log search Query log reports: Most expensive queries, Long running queries, Hot files/tables, Space usage by table etc. Query log filter/search: Tables not using statistics, queries not optimized by CBO etc. Storage optimizations Storage heatmap and data layout optimization suggestions Storage level optimization recommendations Batch operations Query optimizations Query level optimization recommendations Detailed query level report Admin alerts on expensive queries Quality of life changes Query editor auto complete Data browser (Top 20 rows sample data) Specify output destination (S3, CVS etc.) Query kill

7 Hortonworks Data Analytics Studio

JDBC Table mapping example CREATE TABLE postgres_table CREATE EXTERNAL TABLE hive_table ( ( id INT, name varchar(20) ); id INT, name varchar(20) ) STORED BY 'org.apache.hive.storage.jdbc.

13 JDBC Table mapping example CREATE TABLE postgres_table CREATE EXTERNAL TABLE hive_table ( ( id INT, name varchar(20) ); id INT, name varchar(20) ) STORED BY 'org.apache.hive.storage.jdbc.jdbcstoragehandler' TBLPROPERTIES ( "hive.sql.database.type" = "POSTGRES", "hive.sql.jdbc.driver"="org.postgresql.driver", "hive.sql.jdbc.url"="jdbc:postgresql://...", "hive.sql.dbcp.username"="jdbctest", "hive.sql.dbcp.password"="", "hive.sql.query"="select * from postgres_table", "hive.sql.column.mapping" = "id=id, name=name" );

14 Realtime Node Realtime Node Broker HiveServer2 Realtime Node Instantly analyze kafka data with milliseconds latency

15 Send promotional to all customers from CA who purchased more than 1000$ worth of merchandise today. create external table sales(` time` timestamp, quantity int, sales_price double,customer_id bigint, item_id int, store_id int) stored by 'org.apache.hadoop.hive.druid.druidstoragehandler' tblproperties ( "kafka.bootstrap.servers" = "localhost:9092", "kafka.topic" = "sales-topic", "druid.kafka.ingestion.maxrowsinmemory" = "5"); create table customers (customer_id bigint, first_name string, last_name string, string, state string); select from customers join sales using customer_id where to_date(sales. time) = date and quantity * sales_price > 1000 and customers.state = CA ; Bloom filter pushdown greatly reduces data transfer

16 LLAP Node LLAP Node Query Coordinator LLAP Node Ad-hoc / Ingest / Transform HiveServer2

17 I want to have moving average over sliding window in kafka from stock ticker kafka stream. create external table tickers (` time` timestamp, stock_id bigint, stock_sym varchar(4), price decimal (10,2), exhange_id int) stored by 'org.apache.hadoop.hive.kafka.kafkastoragehandler tblproperties ("kafka.topic" = "stock-topic", "kafka.bootstrap.servers"="localhost:9092", "kafka.serde.class"="org.apache.hadoop.hive.serde2.jsonserde"); create external table moving_avg (` time` timestamp, stock_id bigint, avg_price decimal (10,2) stored by 'org.apache.hadoop.hive.kafka.kafkastoragehandler' tblproperties ("kafka.topic" = "averages-topic", "kafka.bootstrap.servers"="localhost:9092", "kafka.serde.class"="org.apache.hadoop.hive.serde2.jsonserde"); Insert into table moving_avg select CURRENT_TIMESTAMP, stock_id, avg(price) group by stock_id, from tickers where timestamp > to_unix_timestamp(current_timestamp - 5 minutes) * 1000 Transformation over stream in real time

20 Hive managed table ACID on by default No SBA, Ranger auth only Statistics and other optimizations apply Spark access via HiveWarehouseConnector External tables No ACID, Text by default SBA possible Some optimizations unavailable Spark direct file access Note: SBA in HDP 3 requires ACL in HDFS. ACL is turned on by default in HDP3

21 V1: CREATE TABLE hello_acid (load_date date, key int, value int) CLUSTERED BY(key) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true'); V2: CREATE TABLE hello_acid_v2 (load_date date, key int, value int);

23 SELECT distinct dest,origin FROM flights; SELECT origin, count(*) FROM flights GROUP BY origin HAVING origin = OAK ; CREATE MATERIALIZED VIEW flight_agg AS SELECT dest,origin,count(*) FROM flights GROUP BY dest,origin;

25 Example: CREATE TABLE Persons ( ID Int NOT NULL, Name String NOT NULL, Age Int, Creator String DEFAULT CURRENT_USER(), CreateDate Date DEFAULT CURRENT_DATE(), PRIMARY KEY (ID) DISABLE NOVALIDATE ); CREATE TABLE BusinessUnit ( ID Int NOT NULL, Head Int NOT NULL, Creator String DEFAULT CURRENT_USER(), CreateDate Date DEFAULT CURRENT_DATE(), PRIMARY KEY (ID) DISABLE NOVALIDATE, CONSTRAINT fk FOREIGN KEY (Head) REFERENCES Persons(ID) DISABLE NOVALIDATE );

26 CREATE TABLE AIRLINES_V2 (ID BIGINT DEFAULT SURROGATE_KEY(), CODE STRING, DESCRIPTION STRING, PRIMARY KEY (ID) DISABLE NOVALIDATE); INSERT INTO AIRLINES_V2 (CODE, DESCRIPTION) SELECT * FROM AIRLINES; ALTER TABLE FLIGHTS ADD COLUMNS (carrier_sk BIGINT); MERGE INTO FLIGHTS f USING AIRLINES_V2 a ON f.uniquecarrier = a.code WHEN MATCHED THEN UPDATE SET carrier_sk = a.id;

27 Symptoms Solution

30 ELAPSED_TIME EXECUTION_TIME TOTAL_TASKS HDFS_BYTES_READ, HDFS_BYTES_WRITTEN CREATED FILES CREATED_DYNAMIC_PARTITIONS CREATE RESOURCE PLAN guardrail; CREATE TRIGGER guardrail.long_running WHEN EXECUTION_TIME > 2000 DO KILL; ALTER TRIGGER guardrail.long_running ADD TO UNMANAGED; ALTER RESOURCE PLAN guardrail ENABLE ACTIVATE;

31 CREATE RESOURCE PLAN daytime; CREATE POOL daytime.bi WITH ALLOC_FRACTION=0.8, QUERY_PARALLELISM=5; CREATE POOL daytime.etl WITH ALLOC_FRACTION=0.2, QUERY_PARALLELISM=20; CREATE RULE downgrade IN daytime WHEN total_runtime > 3000 THEN MOVE etl; ADD RULE downgrade TO bi; CREATE APPLICATION MAPPING tableau in daytime TO bi; ALTER PLAN daytime SET default pool= etl; APPLY PLAN daytime;

33 Without cache With cache

39 External table/ Direct Hive Warehouse Connector Spark

42 Features: Spark access to ACID & column security

43 Features: Spark access to Ranger tables

46 Connector WRITE API hive.executeupdate(sql : String) : Bool df.write.format(hive_warehouse_connector) df.write.format(stream_to_stream)

47 a) hive.executeupdate( INSERT INTO s SELECT * FROM t )

48 df.select("ws_sold_time_sk", "ws_ship_date_sk").filter("ws_sold_time_sk > 80000").write.format(HIVE_WAREHOUSE_CONNECTOR).option("table", my_acid_table ).save()

49 b) df.write.format(hive_warehouse_connector).save()

50 val df = spark.readstream.format("socket")....load() df.writestream.format(stream_to_stream).option( table, my_acid_table ).start()

51 c) df.write.format(stream_to_stream).start()

55 SELECT * FROM ( SELECT AVG(ss_list_price) B1_LP, COUNT(ss_list_price) B1_CNT, COUNT(DISTINCT ss_list_price) B1_CNTD FROM store_sales WHERE ss_quantity BETWEEN 0 AND 5 AND (ss_list_price BETWEEN 11 and OR ss_coupon_amt BETWEEN 460 and OR ss_wholesale_cost BETWEEN 14 and 14+20)) B1, ( SELECT AVG(ss_list_price) B2_LP, COUNT(ss_list_price) B2_CNT, COUNT(DISTINCT ss_list_price) B2_CNTD FROM store_sales WHERE ss_quantity BETWEEN 6 AND 10 AND (ss_list_price BETWEEN 91 and OR ss_coupon_amt BETWEEN 1430 and OR ss_wholesale_cost BETWEEN 32 and 32+20)) B2, B1 Filter B2 Filter B3 Filter B4 Filter... LIMIT 100; store_sales Combined OR ed B1-B6 Filters B5 Filter

56 SELECT FROM sales JOIN sales.time_id = WHERE time.year time.quarter IN time ON time.time_id = 2014 AND ('Q1', 'Q2 )

59 DATA PLANE SERVICES Cluster Lifecycle Manager Organizational Services Data Analytics Studio (DAS) SHARED SERVICES COMPUTE CLUSTER Ranger API Server Tiller Atlas Metastore DAS Web Service Hive Server Query Coordinators RDBMS Registry Blobstore Indexe r Query Executors Ingress Controller or Load Balancer Long-running kubernetes cluster Internal Service Endpoint for ReplicaSet or StatefulSet Ephemeral kubernetes cluster Inter-cluster communication Intra-cluster communication

Integrating Hive and Kafka

3 Integrating Hive and Kafka Date of Publish: 2018-12-18 https://docs.hortonworks.com/ Contents... 3 Create a table for a Kafka stream...3 Querying live data from Kafka... 4 Query live data from Kafka...