Interactive Query With Apache Hive

Size: px

Start display at page:

Download "Interactive Query With Apache Hive"

Maximilian Chambers
6 years ago
Views:

1 Interactive Query With Apache Hive Ajay Singh Dec Page 1 4, 2014

2 Agenda HDP 2.2 Apache Hive & Stinger Initiative Stinger.Next Putting It Together Q&A Page 2

3 HDP 2.2 Generally Available GOVERNANCE Hortonworks Data Platform 2.2 BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS YARN is the architectural center of HDP Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Script Pig Tez SQL Hive Tez Java Scala Cascading Tez NoSQL HBase Accumulo Slider Stream Storm Slider In-Memory Spark YARN: Data Operating System (Cluster Resource Management) Search Solr Others ISV Engines Authentication Authorization Accounting Data Protection Storage: HDFS Resources: YARN Access: Hive, Pipeline: Falcon Cluster: Knox Cluster: Ranger Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Enables batch, interactive and real-time workloads Provides comprehensive enterprise capabilities 1 HDFS (Hadoop Distributed File System) Deployment Choice Linux Windows On-Premises Cloud The widest range of deployment options Delivered Completely in the OPEN Page 3

4 HDP IS Apache Hadoop There is ONE Enterprise Hadoop: everything else is a vendor derivation HDP 2.2 October 2014 HDP 2.1 April 2014 HDP 2.0 October Hadoop &YARN Pig Hive & HCatalog HBase Phoenix Accumulo Storm Spark Solr Tez Slider Falcon Kafka Sqoop Flume Ambari Oozie Zookeeper Knox Ranger Data Management Data Access Governance & Integration Operations Security Page 4 Hortonworks Data Platform 2.2 * version numbers are targets and subject to change at time of general availability in accordance with ASF release process

5 Complete List of New Features in HDP 2.2 Apache Hadoop YARN Slide existing services onto YARN through Slider GA release of HBase, Accumulo, and Storm on YARN Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads Support for CPU Scheduling and CPU Resource Isolation through CGroups Apache Hadoop HDFS Heterogeneous storage: Support for archival Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack). Multi-NIC Support Heterogeneous storage: Support memory as a storage tier (TP) HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy. Hive SQL Enhancements including: ACID Support: Insert, Update, Delete Temporary Tables Metadata-only queries return instantly Pig on Tez Including DataFu for use with Pig Vectorized shuffle Tez Debug Tooling & UI Hue Support for HiveServer 2 Support for Resource Manager HA Apache HBase, Apache Phoenix, & Apache Accumulo HBase & Accumulo on YARN via Slider HBase HA Replicas update in real-time Fully supports region split/merge Scan API now supports standby RegionServers HBase Block cache compression HBase optimizations for low latency Phoenix Robust Secondary Indexes Performance enhancements for bulk import into Phoenix Hive over HBase Snapshots Hive Connector to Accumulo HBase & Accumulo wire-level encryption Accumulo multi-datacenter replication Apache Storm Storm-on-YARN via Slider Ingest & notification for JMS (IBM MQ not supported) Kafka bolt for Storm supports sophisticated chaining of topologies through Kafka Kerberos support Hive update support Streaming Ingest Connector improvements for HBase and HDFS Deliver Kafka as a companion component Kafka install, start/stop via Ambari Security Authorization Integration with Ranger Apache Slider Allow on-demand create and run different versions of heterogeneous applications Allow users to configure different application instances differently Manage operational lifecycle of application instances Expand / shrink application instances Provide application registry for publish and discovery Apache Spark Refreshed Tech Preview to Spark (available now) ORC File support & Hive 0.13 integration Planned for GA of Spark Operations integration via YARN ATS and Ambari Security: Authentication Apache Solr Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr Cascading Cascading 3.0 on Tez distributed with HDP coming soon Apache Falcon Authentication Integration Lineage now GA. (it s been a tech preview feature ) Improve UI for pipeline management & editing: list, detail, and create new (from existing elements) Replicate to Cloud Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie Sqoop import support for Hive types via HCatalog Secure Windows cluster support: Sqoop, Flume, Oozie Flume streaming support: sink to HCat on secure cluster Oozie HA now supports secure clusters Oozie Rolling Upgrade Operational improvements for Oozie to better support Falcon Capture workflow job logs in HDFS Don t start new workflows for re-run Allow job property updates on running jobs Apache Knox & Apache Ranger (Argus) & HDP Security Apache Ranger Support authorization and auditing for Storm and Knox Introducing REST APIs for managing policies in Apache Ranger Apache Ranger Support native grant/revoke permissions in Hive and HBase Apache Ranger Support Oracle DB and storing of audit logs in HDFS Apache Ranger to run on Windows environment Apache Knox to protect YARN RM Apache Knox support for HDFS HA Apache Ambari install, start/stop of Knox Apache Ambari Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations Launch and monitor HDFS rebalance Perform Capacity Scheduler queue refresh Configure High Availability for ResourceManager Ambari Administration framework for managing user and group access to Ambari Ambari Views development framework for customizing the Ambari Web user experience Ambari Stacks for extending Ambari to bring custom Services under Ambari management Ambari Blueprints for automating cluster deployments Performance improvements and enterprise usability guardrails Page 5

6 Just How Many New Features are in HDP 2.2? Apache Hadoop YARN Slide existing services onto YARN through Slider GA release of HBase, Accumulo, and Storm on YARN Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads Support for CPU Scheduling and CPU Resource Isolation through CGroups Apache Hadoop HDFS Heterogeneous storage: Support for archival Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack). Multi-NIC Support Heterogeneous storage: Support memory as a storage tier (TP) HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy. Hive SQL Enhancements including: ACID Support: Insert, Update, Delete Temporary Tables Metadata-only queries return instantly Pig on Tez Including DataFu for use with Pig Vectorized shuffle Tez Debug Tooling & UI Hue Support for HiveServer 2 Support for Resource Manager HA 88 Apache HBase, Apache Phoenix, & Apache Accumulo HBase & Accumulo on YARN via Slider HBase HA Replicas update in real-time Fully supports region split/merge Scan API now supports standby RegionServers HBase Block cache compression HBase optimizations for low latency Phoenix Robust Secondary Indexes Performance enhancements for bulk import into Phoenix Hive over HBase Snapshots Hive Connector to Accumulo HBase & Accumulo wire-level encryption Accumulo multi-datacenter replication Apache Storm Storm-on-YARN via Slider Ingest & notification for JMS (IBM MQ not supported) Kafka bolt for Storm supports sophisticated chaining of topologies through Kafka Kerberos support Hive update support Streaming Ingest Connector improvements for HBase and HDFS Deliver Kafka as a companion component Kafka install, start/stop via Ambari Security Authorization Integration with Ranger Apache Slider Allow on-demand create and run different versions of heterogeneous applications Allow users to configure different application instances differently Manage operational lifecycle of application instances Expand / shrink application instances Provide application registry for publish and discovery Astonishing amount of innovation in the OPEN Apache Community Apache Spark Refreshed Tech Preview to Spark (available now) ORC File support & Hive 0.13 integration Planned for GA of Spark Operations integration via YARN ATS and Ambari Security: Authentication Apache Solr Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr Cascading Cascading 3.0 on Tez distributed with HDP coming soon Apache Falcon Authentication Integration Lineage now GA. (it s been a tech preview feature ) Improve UI for pipeline management & editing: list, detail, and create new (from existing elements) Replicate to Cloud Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie Sqoop import support for Hive types via HCatalog Secure Windows cluster support: Sqoop, Flume, Oozie Flume streaming support: sink to HCat on secure cluster Oozie HA now supports secure clusters Oozie Rolling Upgrade Operational improvements for Oozie to better support Falcon Capture workflow job logs in HDFS Don t start new workflows for re-run Allow job property updates on running jobs Apache Knox & Apache Ranger (Argus) & HDP Security Apache Ranger Support authorization and auditing for Storm and Knox Introducing REST APIs for managing policies in Apache Ranger Apache Ranger Support native grant/revoke permissions in Hive and HBase Apache Ranger Support Oracle DB and storing of audit logs in HDFS Apache Ranger to run on Windows environment Apache Knox to protect YARN RM Apache Knox support for HDFS HA Apache Ambari install, start/stop of Knox HDP is Apache Ambari Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client Hadoop configurations Launch and monitor HDFS rebalance Perform Capacity Scheduler queue refresh Configure High Availability for ResourceManager Ambari Administration framework for managing user and group access to Ambari Ambari Views development framework for customizing the Ambari Web user experience Ambari Stacks for extending Ambari to bring custom Services under Ambari management Ambari Blueprints for automating cluster deployments Performance improvements and enterprise usability guardrails Page 6

7 Apache Hive & Stinger Initiative Page 7

8 Hive Single tool for all SQL use cases Interactive Analytics Batch Reports / Deep Analytics ETL / ELT OLTP, ERP, CRM Systems Unstructured documents, s Server logs Hive - SQL Sen>ment, Web Data Sensor. Machine Data Geoloca>on Clickstream Page 8

9 Hive Scales To Any Workload " The original developers of Hive. " More data than existing RDBMS could handle. " 100+ PB of data under management. " 15+ TB of data loaded daily. " 60,000+ Hive queries per day. " More than 1,000 users per day. Page 9 Page 9

10 Hive Join Strategies Type Approach Pros Cons Shuffle Join Join keys are shuffled using map/ reduce and joins performed reduce side. Works regardless of data size or layout. Most resource-intensive and slowest join type. Broadcast Join Small tables are loaded into memory in all nodes, mapper scans through the large table and joins. Very fast, single scan through largest table. All but one table must be small enough to fit in RAM. Sort-Merge- Bucket Join Mappers take advantage of colocation of keys to do efficient joins. Very fast for tables of any size. Data must be bucketed ahead of time. Page 10 Page 10

11 HDP 2.1 Stinger Initiative Governance & Integration Data Access Data Management Security Operations Stinger Initiative DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop Business Analy=cs Apache MapReduce SQL Apache Hive Apache YARN 1 Custom Apps Apache Tez HDFS (Hadoop Distributed File System) N 100 s to 1000 s of seconds Hive 10 Dramatically faster queries speeds time to insight seconds Hive 13 An Open Community at its finest: Apache Hive Contribution 1,672 Jira Tickets Closed 145 Developers 44 Companies 360,000 Lines Of Code Added (2.5x) 13 Months Page 11

12 Stinger Initiative - Key Innovations Execution Engine Tez File Format + + ORCFile Query Planner CBO = 100X Page 12

13 Tez ( Speed ) What is it? A data processing framework as an alternative to MapReduce Who else is involved? Hortonworks, Facebook, Twitter, Yahoo, Microsoft Why does it matter? Widens the platform for Hadoop use cases Crucial to improving the performance of low-latency applications Core to the Stinger initiative Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop Page 13

14 Comparing: Hive/MR vs. Hive/Tez SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemid = c.itemid) GROUP BY a.state Tez avoids unneeded writes to HDFS Hive MR Hive Tez SELECT a.state M M M R R HDFS SELECT b.id M M SELECT a.state, c.itemid M M M R R SELECT b.id M M JOIN (a, c) SELECT c.price M R M R HDFS JOIN (a, c) R R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M R M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R Page 14 Page 14

15 ORCFile Columnar Storage for Hive Columns stored separately Knows types Uses type-specific encoders Stores statistics (min, max, sum, count) Has light-weight index Skip over blocks of rows that don t matter Page 15 Page 15

16 ORCFile Columnar Storage for Hive Large block size ideal for map/reduce. Columnar format enables high compression and high performance. Page 16

17 Query Planner Cost Based Optimizer in Hive The Cost-Based Optimizer (CBO) uses statistics within Hive tables to produce optimal query plans Why cost-based optimization? Ease of Use Join Reordering Reduces the need for specialists to tune queries. More efficient query plans lead to better cluster utilization. Page 17 Page 17

18 Statistics: Foundations for CBO Kind of statistics Table Statistics Collected on load per partition Uncompressed size Number of rows Number of files Column Statistics Required by CBO NDV (Number of Distinct Values) Nulls, Min, Max Usability - How does the data get Statistics Analyze Table Command Analyze entire table Run this command per partition Run for some partitions and the compiler will extrapolate statistics Collecting statistics on load Table stats can be collected if you insert via hive using set hive.stats.autogather=true Not with load data file Page 18

19 HDP 2.1 A Journey to SQL Compliance Governance & Integration Data Access Data Management Security Operations Evolu=on of SQL Compliance in Hive SQL Datatypes SQL Seman=cs INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING BOOLEAN JOIN on explicit join key ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins STRING Sub- queries in the FROM clause BINARY ROLLUP and CUBE TIMESTAMP UNION DECIMAL Standard aggrega>ons (sum, avg, etc.) DATE Custom Java UDFs VARCHAR Windowing func>ons (OVER, RANK, etc.) CHAR Advanced UDFs (ngram, XPath, URL) JOINs in WHERE Clause Sub- queries for IN/NOT IN, HAVING Legend Hive 10 or earlier Hive 11 Hive 12 Hive 13 Page 19

20 Hive 0.13 Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. -Winston Churchill Page 20

21 Stinger.Next Page 21

22 Stinger.Next: Delivery Themes Hive 0.14 Transac>ons with ACID allowing insert, update and delete Sub- Second 1 st Half 2015 Sub- Second queries with LLAP Richer Analy=cs 2 nd Half 2015 Toward SQL:2011 Analy>cs Streaming Ingest Cost Based Op>mizer op>mizes star and bushy join queries Hive- Spark Machine Learning integra>on Opera>onal repor>ng with Hive Streaming Ingest and Transac>ons Materialized Views Cross- Geo Queries Workload Management via YARN and LLAP integra>on Page 22

23 Transaction Use Cases Analytics Modifications Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables Hive Low concurrency updates, low TPS4 Operational Reporting (YES) High throughput ingest from operational (OLTP) database OLTP Replication Hive Periodic inserts every 5-30 minutes Requires tool support and changes in our Transactions Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts High Concurrency - Hundreds to thousands of connections High Concurrency OLTP Hive Page 23

24 Deep Dive: Transactions Transaction Support in Hive with ACID semantics Hive native support for INSERT, UPDATE, DELETE. Split Into Phases: [Done] [Done] [Next] Phase 1: Hive Streaming Ingest (append) Phase 2: INSERT / UPDATE / DELETE Support Phase 3: BEGIN / COMMIT / ROLLBACK Txn Hive ACID Compactor periodically merges the delta files in the background. Read- Optimized ORCFile Read- Optimized ORCFile Delta File Merged Read- Optimized ORCFile Task Task Task 1. Original File Task reads the latest ORCFile 2. Edits Made Task reads the ORCFile and merges the delta file with the edits 3. Edits Merged Task reads the updated ORCFile Page 24

25 Transactions - Requirements Needs to declare table as having Transaction Property Table must be in ORC format Tables must to be bucketed Page 25 Page 25

26 Putting It Together Page 26

27 Step 1 - Turn On Transactions Hive Configuration hive.support.concurrency=true hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.dbtxnmanager hive.compactor.initiator.on=true hive.compactor.worker.threads=2 hive.enforce.bucketing=true hive.exec.dynamic.partition.mode=nonstrict Page 27 Page 27

28 Step 2 Enable Concurrency By Defining Queues YARN Configuration yarn.scheduler.capacity.root.default.capacity=50 yarn.scheduler.capacity.root.hiveserver.capacity=50 yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50 Cluster Capacity Default Hive1 Hive2 yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4 yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50 yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4 yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2 yarn.scheduler.capacity.root.queues=default,hiveserver Page 28

29 Step 3 Deliver Capacity Guarantees BY Enabling YARN Preemption YARN Configuration yarn.resourcemanager.scheduler.monitor.enable=true yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourceman ager.monitor.capacity.proportionalcapacitypreemptionpolicy yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000 yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000 yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4 Page 29

30 Step 4 Enable Tez Execution Engine & Tez Sessions Hive Configuration hive.execution.engine=tez hive.server2.tez.initialize.default.sessions=true hive.server2.tez.default.queues=hive1,hive2 hive.server2.tez.sessions.per.default.queue=1 hive.server2.enable.doas=false hive.vectorized.groupby.maxentries=10240 hive.vectorized.groupby.flush.percent=0.1 Enable Sessions For Hive Queues Page 30

31 Step 5 - Create Partitioned & Bucketed ORC Tables Create table if not exists test (id int, val string) partitioned by (year string,month string,day string) clustered by (id) into 7 buckets stored as orc TBLPROPERTIES ("transactional"="true ); Note: Transaction Requires Bucketed tables in ORC Format. Tables cannot be sorted. Transactional=true must be set as table properties For performance, table Partition is recommended but not mandatory Partition on filter columns with low cardinality For optimal performance stay below 1000 partitions Cluster on join columns Number of buckets contingent on dataset size Page 31

32 Step 6 - Loading Data into ORC table SQOOP, FLUME & STORM support direct ingestion to ORC Tables Have a Text File? Load to a Table stored as textfile Transfer to ORC Table using Hive insert statement Page 32

33 Step 7 - Compute Statistics Compute Table Stats Note: analyze table test partition(year,month,day) compute statistics; Compute Column Stats analyze table test partition(year,month,day) compute statistics for columns; In hive 0.14, column stats can be calculated for all partitions in a single statement To limit computation to a specific partition, specify partition keys Keep Stats Updated Speed computation by limiting it to partitions that have changed Page 33

34 Sample Code Sqoop Import To ORC Table sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons --username root --hcatalog-table persons --hcatalog-storage-stanza "stored as orc" -m 1 Use Hcatalog to import to ORC Table Page 34

35 Sample Code Flume Configuration For Hive Streaming Ingest ## Agent ## Hive Streaming Sink agent.sources = csvfile agent.sources.csvfile.type = exec agent.sources.csvfile.command = tail -F /root/test.txt agent.sources.csvfile.batchsize = 1 agent.sources.csvfile.channels = memorychannel agent.sources.csvfile.interceptors = intercepttime agent.sources.csvfile.interceptors.intercepttime.type = timestamp ## Channels agent.channels = memorychannel agent.sinks = hiveout agent.sinks.hiveout.type = hive agent.sinks.hiveout.hive.metastore=thrift://localhost:9083 agent.sinks.hiveout.hive.database=default agent.sinks.hiveout.hive.table=test agent.sinks.hiveout.hive.partition=%y,%m,%d agent.sinks.hiveout.serializer = DELIMITED agent.sinks.hiveout.serializer.fieldnames =id,val agent.sinks.hiveout.channel = memorychannel agent.channels.memorychannel.type = memory agent.channels.memorychannel.capacity = Page 35

36 Q&A Page 36

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013 Stinger Initiative Making Hive 100X Faster Page 1 HDP: Enterprise Hadoop Distribution OPERATIONAL SERVICES Manage AMBARI & Operate at Scale OOZIE HADOOP CORE FLUME SQOOP DATA SERVICES PIG Store, HIVE Process