Cloudera Impala Headline Goes Here

Size: px

Start display at page:

Download "Cloudera Impala Headline Goes Here"

Isabella Holmes
6 years ago
Views:

1 Cloudera Impala Headline Goes Here JusAn Erickson Senior Product Manager Speaker Name or Subhead Goes Here February 2013 DO NOT USE PUBLICLY PRIOR TO 10/23/12

2 Agenda Intro to Impala Architectural Overview AlternaAve Approaches Status, Timelines, and Futures

3 Why Hadoop? Scalability Simply scales just by adding nodes Local processing to avoid network borlenecks Flexibility All kinds of data (blobs, documents, records, etc) In all forms (structured, semi- structured, unstructured) Store anything then later analyze what you need Efficiency Cost efficiency on commodity hardware Unified storage, metadata, security (no duplicaaon or synchronizaaon)

4 Beyond Batch The Next Stage for Hadoop HADOOP TODAY IS TOO SLOW MapReduce is batch Simple queries can take minutes / tens of minutes CURRENT DATA MANAGEMENT IS TOO COMPLEX OpAmized for rigid schemas & known access parerns Redundant data storage & processes Very expensive systems: $20K- 150K / TB

5 What s Impala? InteracCve SQL Typically 4-35x faster than Hive (observed up to 100x faster) Responses in seconds instead of minutes (someames sub- second) Nearly ANSI- 92 standard SQL queries with HiveQL CompaAble SQL interface for exisang Hadoop/CDH applicaaons Based on industry standard SQL NaCvely on Hadoop/HBase storage and metadata Flexibility, scale, and cost advantages of Hadoop No duplicaaon/synchronizaaon of data and metadata Local processing to avoid network borlenecks Separate runcme from MapReduce MapReduce is designed and great for batch Impala is purpose- built for low- latency SQL queries on Hadoop

6 So what? InteracCve BI/analyCcs BI tools impracacal on Hadoop before Impala Move from 10s of Hadoop users per cluster to 100s of SQL users More and faster value from big data Data processing with Cght SLAs Sub- minute SLAs now possible Cost efficiency Fewer nodes to meet response Ame SLAs

Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER

000s of nodes Same Flexibility Same Efficiency: Unified Storage

(Hadoop security) Unified Client Interfaces (SQL syntax, ODBC/JDBC,

MapReduce Greater efficiency with Impala: All nodes for all

7 Cloudera Now Powered by Impala BEFORE IMPALA WITH IMPALA USER INTERFACE BATCH PROCESSING REAL- TIME ACCESS Same Scalability to 000s of nodes Same Flexibility Same Efficiency: Unified Storage (HDFS and Hbase) Unified Metadata (Hive metastore) Unified Security (Hadoop security) Unified Client Interfaces (SQL syntax, ODBC/JDBC, Hue) Greater Flexibility with Impala: Both InteracAve SQL and Batch MapReduce Greater efficiency with Impala: All nodes for all processing Impala Provides: InteracAve BI/analyAcs Tighter SLAs Greater efficiency for same SLAs

8 Impala Architecture Two binaries: impalad and statestored Impala daemon (impalad) one Impala daemon on each node with data handles external client requests and all internal requests related to query execuaon State store daemon (statestored) provides name service and metadata distribuaon

9 Impala Architecture: Query ExecuAon Phases Request arrives via ODBC/JDBC/Beeswax/Shell Planner turns request into collecaons of plan fragments Coordinator iniaates execuaon on impalad's local to data During execuaon: intermediate results are streamed between executors query results are streamed back to client subject to limitaaons imposed to blocking operators (top- n, aggregaaon)

10 Impala Architecture: Planner Example: query with join and aggregaaon SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 TopN Agg Hash Join Hdfs Scan Hbase Scan TopN Agg Exch Agg Hash Join Hdfs Exch Scan Hbase Scan at coordinator at DataNodes at region servers

11 Impala Architecture: Query ExecuAon Request arrives via ODBC/JDBC/Beeswax/Shell SQL App ODBC SQL request Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase

12 Impala Architecture: Query ExecuAon Planner turns request into collecaons of plan fragments Coordinator iniaates execuaon on impalad's local to data SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase

13 Impala Architecture: Query ExecuAon Intermediate results are streamed between impalad s Query results are streamed back to client SQL App ODBC query results Query Planner Query Coordinator Query Executor HDFS DN HBase Hive Metastore Query Planner Query Coordinator Query Executor HDFS DN HBase HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase

14 Impala and Hive Everything Client- Facing is Shared with Hive: Metadata (table definiaons) ODBC/JDBC drivers Hue Beeswax SQL syntax (HiveQL) Flexible file formats Machine pool Internal Improvements: Purpose- built query engine direct on HDFS and HBase No JVM startup and no MapReduce In- memory data transfers NaAve distributed relaaonal query engine

15 AlternaAve Hadoop Query Approaches MapReduce Remote Query Side Storage Query Node MR DN OR Hive MR HDFS Query Node NN Query Node Query Node DN DN DN Query Engine MR DN High- latency MR Separate nodes for SQL/MR Duplicate metadata, security, SQL, MR, etc. Network borleneck Separate nodes for SQL/MR Duplicate metadata, security, SQL, MR, etc. Query subset of data RDBMS rigid schema Duplicate storage, metadata, security, SQL, etc.

16 Comparing Impala to Dremel What is Dremel: columnar storage for data with nested structures distributed scalable aggregaaon on top of that Columnar storage in Hadoop: joint project between Cloudera and TwiRer new columnar format, derived from Doug Cuong's Trevni stores data in appropriate naave/binary types can also store nested structures similar to Dremel's ColumnIO Distributed aggregaaon: Impala Impala plus columnar format: a superset of the published version of Dremel (which didn't support joins and mulaple file formats)

17 Timelines Beta (available since Oct 2012) QuesAons/comments? impala- GA (very early Q2 2013) JDBC driver All CDH4 OSes Columnar format MR/Impala resource isolaaon Perf (joins, aggregaaons, SQL features) Post- GA top items: UDFs Nested data Window funcaons Memory caching

18 Validated Beta Partners POWERED BY IMPALA

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera