Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Similar documents
April Copyright 2013 Cloudera Inc. All rights reserved.

Cloudera Impala Headline Goes Here

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala Intro. MingLi xunzhang

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

MapR Enterprise Hadoop

Big Data Hadoop Stack

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Sempala. Interactive SPARQL Query Processing on Hadoop

Trafodion Enterprise-Class Transactional SQL-on-HBase

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Approaching the Petabyte Analytic Database: What I learned

Cloudera Impala User Guide

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Big Data with Hadoop Ecosystem

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Hadoop. Introduction / Overview

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Apache HAWQ (incubating)

Part 1: Indexes for Big Data

Oracle Big Data Connectors

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Backtesting with Spark

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Apache Hive for Oracle DBAs. Luís Marques

Big Data Analytics using Apache Hadoop and Spark with Scala

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

An Introduction to Big Data Formats

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Cloudera Introduction

EsgynDB Enterprise 2.0 Platform Reference Architecture

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Apache Kylin. OLAP on Hadoop

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Security and Performance advances with Oracle Big Data SQL

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Innovatus Technologies

Configuring and Deploying Hadoop Cluster Deployment Templates

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Exam Questions

Big Data Hadoop Course Content

microsoft

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Shark: Hive (SQL) on Spark

Big Data Infrastructures & Technologies. SQL on Big Data

Large-Scale Data Engineering

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Cloudera Introduction

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cloudera Kudu Introduction

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Prototyping Data Intensive Apps: TrendingTopics.org

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Dremel: Interac-ve Analysis of Web- Scale Datasets

Apache Drill: interactive query and analysis on large-scale datasets

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Introduction to BigData, Hadoop:-

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Cloudera Introduction

Oracle Big Data Fundamentals Ed 1

Importing and Exporting Data Between Hadoop and MySQL

Hadoop Development Introduction

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Cloudera Introduction

Databases 2 (VU) ( / )

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Dremel: Interactive Analysis of Web- Scale Datasets

Hadoop An Overview. - Socrates CCDH

Cloudera Introduction

Apache Spark 2.0. Matei

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

I am: Rana Faisal Munir

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Warehouse- Scale Computing and the BDAS Stack

Large-Scale Data Engineering. Modern SQL-on-Hadoop Systems

Big Data Architect.

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Integrating with Apache Hadoop

Techno Expert Solutions An institute for specialized studies!

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Practical Big Data Processing An Overview of Apache Flink

Databricks, an Introduction

A Fast and High Throughput SQL Query System for Big Data

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Unifying Big Data Workloads in Apache Spark

Cloudera Introduction

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Hortonworks Data Platform

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Transcription:

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera Inc. All rights reserved.

Analytic Workloads on Hadoop: Where Do We Stand? DeWitt Clause prohibits using DBMS vendor name!2

Outline: Hadoop Technologies for Analytic Workloads HDFS: a storage system for analytic workloads Parquet: columnar storage Impala: a modern, open-source SQL engine for Hadoop!3

HDFS: A Storage System for Analytic Workloads Goal: high efficiency: data transfer at or near hardware speed short-circuit reads: bypass DataNode protocol when reading from local disk -> read at 100+MB/s per disk HDFS caching: access explicitly cached data w/o copy or checksumming -> access memory-resident data at memory bus speed!4

HDFS: A Storage System for Analytic Workloads Coming attractions: affinity groups: collocate blocks from different files -> create co-partitioned tables for improved join performance temp-fs: write temp table data straight to memory, bypassing disk -> ideal for iterative interactive data analysis!5

Parquet: Columnar Storage for Hadoop What it is: state-of-the-art, open-source columnar file format that s available for (most) Hadoop processing frameworks: Impala, Hive, Pig, MapReduce, Cascading, co-developed by Twitter and Cloudera with contributors from Criteo, Stripe, Berkeley AMPlab, LinkedIn used in production at Twitter and Criteo!6

Parquet: The Details columnar storage: column-major instead of the traditional row-major layout; used by all high-end analytic DBMSs optimized storage of nested data structures: patterned after Dremel s ColumnIO format extensible set of column encodings: run-length and dictionary encodings in current version (1.2) delta and optimized string encodings in 2.0 embedded statistics: version 2.0 stores inlined column statistics for further optimization of scan efficiency!7

Parquet: Storage Efficiency!8

Parquet: Scan Efficiency!9

Impala: A Modern, Open-Source SQL Engine implementation of a parallel SQL query engine for the Hadoop environment high-performance, modern execution engine, written in C++; developed by Cloudera and fully open-source utilizes standard Hadoop components (HDFS, Hbase, Metastore, Yarn) exposes/interacts with industry-standard interfaces (odbc/ jdbc, Kerberos and LDAP, ANSI SQL)!10

Impala: A Modern, Open-Source SQL Engine history: released as beta in 10/2012 1.0 version available in 05/2013 current version is 1.2, available for CDH5 beta (version 1.2.1 for CDH4.5 to be released in a few weeks)!11

Impala from The User s Perspective create tables as virtual views over data stored in HDFS or Hbase; schema metadata is stored in Metastore (shared with Hive, Pig, etc.; basis of HCatalog) connect via odbc/jdbc; authenticate via Kerberos or LDAP run standard SQL: current version: ANSI SQL-92 (limited to SELECT and bulk insert) minus correlated subqueries, UDFs, UDAs version 2.0: analytic window functions, UDTFs, SQL extensions for nested types (Avro/protocol buffer data)!12

Impala Architecture distributed service: daemon process (impalad) runs on every node with data easily deployed with Cloudera Manager each node can handle user requests; load balancer configuration for multi-user environments recommended query execution phases: client request arrives via odbc/jdbc planner turns request into collection of plan fragments coordinator initiates execution on remote impala s!13

Impala Architecture: Query Planning 2-phase process: single-node plan: left-deep tree of query operators partitioning into plan fragments for distributed parallel execution: maximize scan locality/minimize data movement, parallelize all query operators cost-based join order optimization in version 1.2.1 and later!14

Impala Architecture: Query Execution circumvents MapReduce completely intermediate results are streamed directly between impala s, never hit disk query results are streamed back to client from coordinator in-memory execution: right-hand side inputs of joins and aggregation results are cached in memory example: join with 1TB table, reference 2 of 200 cols, 10% of rows -> need to cache 1GB across all nodes in cluster -> not a limitation for most workloads!15

Impala vs MR for Analytic Workloads Impala vs. SQL-on-MR Impala 1.1.1/Hive 0.12 file formats: Parquet/ORCfile TPC-DS, 3TB data set running on 5-node cluster!16

Impala vs MR for Analytic Workloads Impala speedup: interactive: 8-69x report: 6-68x deep analytics: 10-58x!17

Scalability in Hadoop dimensions of scalability: response time scaling concurrency scaling data size scaling Hadoop s promise of linear scalability: add more nodes to cluster, gain a proportional increase in capabilities -> adapt to any kind of workload changes simply by adding more nodes to cluster!18

Impala Scalability: Latency goal: cut response time in half by doubling the number of nodes test setup: 2 clusters: 18 and 36 nodes 15 TB TPC-DS data set!19

Impala Scalability: Latency!20

Impala Scalability: Concurrency goal: handle 2x concurrent users with 2x nodes (without response time degradation) test setup: same as before, with 10 and 20 concurrent users!21

Impala Scalability: Concurrency!22

Impala Scalability: Data Size goal: query 2x data size with 2x hardware (without response time degradation) test setup: 18- and 36-node clusters with 15TB and 30TB TPC-DS data sets!23

Impala Scalability: Data Size!24

Summary: Hadoop for Analytic Workloads Hadoop has traditional been utilized for offline batch processing: ETL and ELT latest technological innovations add capabilities that originated in high-end proprietary systems: high-performance disk scans and memory caching in HDFS Parquet: columnar storage for analytic workloads Impala: high-performance parallel SQL execution!25

Summary: Hadoop for Analytic Workloads initial results are encouraging: Hadoop-based open-source stack offers performance that is competitive with or better than top-5 proprietary analytic DBMS solution maintains Hadoop s strengths: flexibility, easy of scaling, cost effectiveness what the future holds: further performance gains more complete SQL capabilities improved resource mgmt and ability to handle multiple concurrent workloads in a single cluster!26

!27 The End