Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Similar documents
Impala Intro. MingLi xunzhang

April Copyright 2013 Cloudera Inc. All rights reserved.

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Cloudera Impala Headline Goes Here

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Configuring and Deploying Hadoop Cluster Deployment Templates

Innovatus Technologies

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Apache Hive for Oracle DBAs. Luís Marques

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hadoop. Introduction / Overview

BIG DATA COURSE CONTENT

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Big Data Analytics using Apache Hadoop and Spark with Scala

Hive SQL over Hadoop

Apache HAWQ (incubating)

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

microsoft

Exam Questions

Cloudera Impala User Guide

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Big Data Hadoop Stack

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Big Data Hadoop Course Content

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Hadoop An Overview. - Socrates CCDH

Big Data with Hadoop Ecosystem

Security and Performance advances with Oracle Big Data SQL

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

HDInsight > Hadoop. October 12, 2017

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Couchbase Architecture Couchbase Inc. 1

Big Data Architect.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Unifying Big Data Workloads in Apache Spark

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Certified Big Data and Hadoop Course Curriculum

Workload Experience Manager

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Sempala. Interactive SPARQL Query Processing on Hadoop

Hadoop Development Introduction

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Cloudera Kudu Introduction

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Introduction to BigData, Hadoop:-

Practical Big Data Processing An Overview of Apache Flink

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

CSE 444: Database Internals. Lecture 23 Spark

Data Architectures in Azure for Analytics & Big Data

Swimming in the Data Lake. Presented by Warner Chaves Moderated by Sander Stad

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Time Series Storage with Apache Kudu (incubating)

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Certified Big Data Hadoop and Spark Scala Course Curriculum

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Cloud Computing & Visualization

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Oracle Big Data Connectors

Hadoop. Introduction to BIGDATA and HADOOP

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Hadoop course content

The Evolution of Big Data Platforms and Data Science

Introduction to Hive Cloudera, Inc.

Apache Flink Big Data Stream Processing

Processing of big data with Apache Spark

MapR Enterprise Hadoop

Going beyond MapReduce

HADOOP FRAMEWORK FOR BIG DATA

Oracle Big Data SQL High Performance Data Virtualization Explained

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

Databricks, an Introduction

What is Gluent? The Gluent Data Platform

Oracle Big Data Fundamentals Ed 2

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Databases 2 (VU) ( / )

Insight Case Studies. Tuning the Beloved DB-Engines. Presented By Nithya Koka and Michael Arnold

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Lecture 11 Hadoop & Spark

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Approaching the Petabyte Analytic Database: What I learned

Apache Hadoop.Next What it takes and what it means

Transcription:

Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam

Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL

Introduction

Why not use Hive or HBase? Hive is a data warehousing tool built on top of Hadoop and uses Hive Query Language(HQL) for querying data stored in a Hadoop cluster. HQL automatically translates queries into MapReduce jobs. Hive doesn t support transactions. HBase is a NoSQL database that runs on top of HDFS that provides real-time read/write access.

Impala General purpose SQL query engine: Works across analytical and transactional workloads High performance: Execution engine written in C++ Runs directly within Hadoop Does not use MapReduce MPP database support: Multi-user workloads

Creating tables CREATE TABLE T (...) PARTITIONED BY (day int, month int) LOCATION '<hdfs-path>' STORED AS PARQUET; For a partitioned table, data is placed in subdirectories whose paths reflect the partition columns' values. For example, for day 17, month 2 of table T, all data files would be located in <root>/day=17/month=2/

Metadata Table metadata including the table definition, column names, data types, schema etc. are stored in HCatalog.

INSERT / UPDATE / DELETE The user can add data to a table simply by copying/moving data files into the directory! Does NOT support UPDATE and DELETE. Limitation of HDFS, as it does not support an in-place update. Recompute the values and replace the data in the partitions. COMPUTE STATS <table> after inserts. Those statistics will subsequently be used during query optimization.

Architecture

I: Impala Daemon Impala daemon service is dually responsible for: 1. Accepting queries from client processes and orchestrating their execution across the cluster. In this role it s called the query coordinator. 2. Executing individual query fragments on behalf of other Impala daemons. The Impala daemons are in constant communication with the statestore, to confirm which nodes are healthy and can accept new work. They also receive broadcast messages from the catalog daemon via the statestore, to keep track of metadata changes. Catalog Statestore...... Impala Daemon

II: Statestore Daemon Handles cluster membership information. Periodically sends two kinds of messages to Impala daemons: Topic update: The new changes made since the last topic update message Keepalive: A heartbeat mechanism If an Impala daemon goes offline, the statestore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable node.

III: Catalog Daemon Impala's catalog service serves catalog metadata to Impala daemons via the statestore broadcast mechanism, and executes DDL operations on behalf of Impala daemons. The catalog service pulls information from Hive Metastore and aggregates that information into an Impala-compatible catalog structure. This structure is then passed on to the statestore daemon which communicates with the Impala daemons.

1. Request arrives from client via Thrift API SQL App ODBC SQL request Hive Metastore HDFS NN Statestore Impala Daemon Impala Daemon Impala Daemon

2. Planner turns request into collections of plan fragments. Coordinator initiates execution on remote Impala daemons. SQL App ODBC Hive Metastore HDFS NN Statestore

3. Intermediate results are streamed between Impala daemons. Query results are streamed back to client. SQL App ODBC Hive Metastore HDFS NN Statestore Query Results Query Planner Query Coordinator Query Executor HDFS DN HBase

Front-End

Query Plans The Impala frontend is responsible for compiling SQL text into query plans executable by the Impala backends. The query compilation process proceeds as follows: Query parsing Semantic analysis Query planning/optimization Query planning 1. Single node planning 2. Plan parallelization and fragmentation

Query Planning: Single Node In the first phase, the parse tree is translated into a non-executable single-node plan tree. E.g. Query joining two HDFS tables (t1, t2) and one HBase table (t3) followed by an aggregation and order by with limit (top-n). SELECT t1.custid, SUM(t2.revenue) AS revenue FROM LargeHdfsTable t1 JOIN LargeHdfsTable t2 ON (t1.id1 = t2.id) JOIN SmallHbaseTable t3 ON (t1.id2 = t3.id) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10; Agg HashJoin HashJoin Scan: t3 Scan: t2 Scan: t1

Query Planning: Distributed Nodes The second planning phase takes the single-node plan as input and produces a distributed execution plan. Goal: To minimize data movement Maximize scan locality as remote reads are considerably slower than local ones. Cost--based decision based on column stats/estimated cost of data transfers Decide parallel join strategy: Broadcast Join: Join is collocated with left-hand side input; right--hand side table is broadcast to each node executing join. Preferred for small right-hand side input. Partitioned Join: Both tables are hash-partitioned on join columns. Preferred for large joins.

Back-End

Executing the Query Impala's backend receives query fragments from the front-end and is responsible for their execution. High performance: Written in C++ for minimal execution overhead Internal in-memory tuple format puts fixed-width data at fixed offsets Uses intrinsic/special CPU instructions for tasks like text parsing and CRC computation. Runtime code generation for big loops

Runtime Code Generation Impala uses runtime code generation to produce query-specific versions of functions that are critical to performance. For example, to convert every record to Impala s in-memory tuple format: Known at query compile time: # of tuples in a batch, tuple layout, column types, etc. Generate at compile time: unrolled loop that inlines all function calls, dead code elimination and minimizes branches. Code generated using LLVM

Evaluation

Comparison of query response times on single-user runs.

Comparison of query response times and throughput on multi-user runs.

Comparison of the performance of Impala and a commercial analytic RDBMS. https://github.com/cloudera/impala-tpcds-kit

Comparison with Spark SQL

Impala is faster than Spark SQL as it is an engine designed especially for the mission of interactive SQL over HDFS, and it has architecture concepts that helps it achieve that. For example the Impala always-on daemons are up and waiting for queries 24/7 something that is not part of Spark SQL.

Thank you!