April Copyright 2013 Cloudera Inc. All rights reserved.

Similar documents
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Impala Intro. MingLi xunzhang

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Cloudera Impala Headline Goes Here

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MapR Enterprise Hadoop

Big Data Hadoop Stack

Sempala. Interactive SPARQL Query Processing on Hadoop

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Part 1: Indexes for Big Data

Cloudera Impala User Guide

Big Data with Hadoop Ecosystem

Apache HAWQ (incubating)

An Introduction to Big Data Formats

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Approaching the Petabyte Analytic Database: What I learned

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

EsgynDB Enterprise 2.0 Platform Reference Architecture

Trafodion Enterprise-Class Transactional SQL-on-HBase

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hadoop. Introduction / Overview

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Security and Performance advances with Oracle Big Data SQL

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Apache Hive for Oracle DBAs. Luís Marques

Large-Scale Data Engineering

Apache Kylin. OLAP on Hadoop

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Oracle Big Data Connectors

Big Data Infrastructures & Technologies. SQL on Big Data

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Large-Scale Data Engineering. Modern SQL-on-Hadoop Systems

Modern Data Warehouse The New Approach to Azure BI

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Backtesting with Spark

Practical Big Data Processing An Overview of Apache Flink

Apache Spark 2.0. Matei

microsoft

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Warehouse- Scale Computing and the BDAS Stack

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Big Data Analytics using Apache Hadoop and Spark with Scala

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Cloudera Introduction

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Time Series Storage with Apache Kudu (incubating)

Unifying Big Data Workloads in Apache Spark

I am: Rana Faisal Munir

Big Data Infrastructures & Technologies

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Big Data Hadoop Course Content

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Exam Questions

Databases 2 (VU) ( / )

Cloudera Kudu Introduction

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Prototyping Data Intensive Apps: TrendingTopics.org

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

The Stratosphere Platform for Big Data Analytics

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Oracle Big Data Fundamentals Ed 1

Configuring and Deploying Hadoop Cluster Deployment Templates

Innovatus Technologies

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Shark: Hive (SQL) on Spark

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Albis: High-Performance File Format for Big Data Systems

Cloudera Introduction

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hive SQL over Hadoop

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Cloudera Introduction

Big Data Infrastructures & Technologies

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Integrating with Apache Hadoop

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

What is Gluent? The Gluent Data Platform

Hortonworks Data Platform

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Hacking PostgreSQL Internals to Solve Data Access Problems

Hadoop Development Introduction

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

HANA Performance. Efficient Speed and Scale-out for Real-time BI

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Actian SQL Analytics in Hadoop

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Transcription:

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014

Analytic Workloads on Hadoop: Where Do We Stand? DeWitt Clause prohibits using DBMS vendor name!2

Hadoop for Analytic Workloads Hadoop has traditional been utilized for offline batch processing: ETL and ELT Next step: Hadoop for traditional business intelligence (BI)/data warehouse (EDW) workloads: interactive concurrent users Topic of this talk: a Hadoop-based open-source stack for EDW workloads: HDFS: a high-performance storage system Parquet: a state-of-the-art columnar storage format Impala: a modern, open-source SQL engine for Hadoop!3

Hadoop for Analytic Workloads Thesis of this talk: techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack Hadoop stack is effective solution for certain EDW workloads Hadoop-based EDW solution maintains Hadoop s strengths: flexibility, ease of scaling, cost effectiveness!4

HDFS: A Storage System for Analytic Workloads Available in Hdfs today: high-efficiency data scans at or near hardware speed, both from disk and memory On the immediate roadmap: co-partitioned tables for even faster distributed joins temp-fs: write temp table data straight to memory, bypassing disk!5

HDFS: The Details High efficiency data transfers short-circuit reads: bypass DataNode protocol when reading from local disk -> read at 100+MB/s per disk HDFS caching: access explicitly cached data w/o copy or checksumming -> access memory-resident data at memory bus speed -> enable in-memory processing!6

HDFS: The Details Coming attractions: affinity groups: collocate blocks from different files -> create co-partitioned tables for improved join performance temp-fs: write temp table data straight to memory, bypassing disk -> ideal for iterative interactive data analysis!7

Parquet: Columnar Storage for Hadoop What it is: state-of-the-art, open-source columnar file format that s available for (most) Hadoop processing frameworks: Impala, Hive, Pig, MapReduce, Cascading, offers both high compression and high scan efficiency co-developed by Twitter and Cloudera; hosted on github and soon to be an Apache incubator project with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn used in production at Twitter and Criteo!8

Parquet: The Details columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs optimized storage of nested data structures: patterned after Dremel s ColumnIO format extensible set of column encodings: run-length and dictionary encodings in current version (1.2) delta and optimized string encodings in 2.0 embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency!9

Parquet: Storage Efficiency!10

Parquet: Scan Efficiency!11

Impala: A Modern, Open-Source SQL Engine implementation of an MPP SQL query engine for the Hadoop environment highest-performance SQL engine for the Hadoop ecosystem; already outperforms some of its commercial competitors effective for EDW-style workloads maintains Hadoop flexibility by utilizing standard Hadoop components (HDFS, Hbase, Metastore, Yarn) plays well with traditional BI tools: exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL)!12

Impala: A Modern, Open-Source SQL Engine history: developed by Cloudera and fully open-source; hosted on github released as beta in 10/2012 1.0 version available in 05/2013 current version is 1.2.3, available for CDH4 and CDH5 beta!13

Impala from The User s Perspective create tables as virtual views over data stored in HDFS or Hbase; schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) connect via odbc/jdbc; authenticate via Kerberos or LDAP run standard SQL: current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, has UDFs and UDAs!14

Impala from The User s Perspective 2014 roadmap: 1.3: admission control 1.4: Decimal(<precision>, <scale>) 2.0 or earlier: analytic window functions, Order By without Limit, support for nested types (structs, arrays, maps), UDTFs, disk-based joins and aggregation!15

Impala Architecture distributed service: daemon process (impalad) runs on every node with data easily deployed with Cloudera Manager each node can handle user requests; load balancer configuration for multi-user environments recommended query execution phases: client request arrives via odbc/jdbc planner turns request into collection of plan fragments coordinator initiates execution on remote impala s!16

Impala Query Execution Request arrives via odbc/jdbc!17

Impala Query Execution Planner turns request into collection of plan fragments Coordinator initiates execution on remote impalad nodes!18

Impala Query Execution Intermediate results are streamed between impala s Query results are streamed back to client!19

Impala Architecture: Query Planning 2-phase process: single-node plan: left-deep tree of query operators partitioning into plan fragments for distributed parallel execution: maximize scan locality/minimize data movement, parallelize all query operators cost-based join order optimization cost-based join distribution optimization!20

Impala Architecture: Query Execution execution engine designed for efficiency, written from scratch in C++; no reuse of decades-old open-source code circumvents MapReduce completely in-memory execution: aggregation results and right-hand side inputs of joins are cached in memory example: join with 1TB table, reference 2 of 200 cols, 10% of rows -> need to cache 1GB across all nodes in cluster -> not a limitation for most workloads!21

Impala Architecture: Query Execution runtime code generation: uses llvm to jit-compile the runtime-intensive parts of a query effect the same as custom-coding a query: remove branches propagate constants, offsets, pointers, etc. inline function calls optimized execution for modern CPUs (instruction pipelines)!22

Impala Architecture: Query Execution!23

Impala vs MR for Analytic Workloads Impala vs. SQL-on-MR Impala 1.1.1/Hive 0.12 ( Stinger Phases 1 and 2 ) file formats: Parquet/ORCfile TPC-DS, 3TB data set running on 5-node cluster!24

Impala vs MR for Analytic Workloads Impala speedup: interactive: 8-69x report: 6-68x deep analytics: 10-58x!25

Impala vs non-mr for Analytic Workloads Impala 1.2.3/Presto 0.6/Shark file formats: RCfile (+ Parquet) TPC-DS, 15TB data set running on 21-node cluster!26

Impala vs non-mr for Analytic Workloads!27

Impala vs non-mr for Analytic Workloads Multi-user benchmark: 10 users concurrently same dataset, same hardware workload: queries from interactive group!28

Impala vs non-mr for Analytic Workloads!29

Scalability in Hadoop Hadoop s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities -> adapt to any kind of workload changes simply by adding more nodes to cluster scaling dimensions for EDW workloads: response time concurrency/query throughput data size!30

Scalability in Hadoop Scalability results for Impala: tests show linear scaling along all 3 dimensions setup: 2 clusters: 18 and 36 nodes 15TB TPC-DS data set 6 interactive TPC-DS queries!31

Impala Scalability: Latency!32

Impala Scalability: Concurrency Comparison: 10 vs 20 concurrent users!33

Impala Scalability: Data Size Comparison: 15TB vs. 30TB data set!34

Summary: Hadoop for Analytic Workloads Thesis of this talk: techniques and functionality of established commercial solutions are either already available or are rapidly being implemented in Hadoop stack Impala/Parquet/Hdfs is effective solution for certain EDW workloads Hadoop-based EDW solution maintains Hadoop s strengths: flexibility, ease of scaling, cost effectiveness!35

Summary: Hadoop for Analytic Workloads latest technological innovations add capabilities that originated in high-end proprietary systems: high-performance disk scans and memory caching in HDFS Parquet: columnar storage for analytic workloads Impala: high-performance parallel SQL execution!36

Summary: Hadoop for Analytic Workloads Impala/Parquet/Hdfs for EDW workloads: integrates into BI environment via standard connectivity and security comparable or better performance than commercial competitors currently still SQL limitations but those are rapidly diminishing!37

Summary: Hadoop for Analytic Workloads Impala/Parquet/Hdfs maintains traditional Hadoop strengths: flexibility: Parquet is understood across the platform, natively processed by most popular frameworks demonstrated scalability and cost effectiveness!38

!39 The End

Summary: Hadoop for Analytic Workloads what the future holds: further performance gains more complete SQL capabilities improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster!40