Big Data SQL Deep Dive

Similar documents
Oracle Big Data SQL High Performance Data Virtualization Explained

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Just add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big

Security and Performance advances with Oracle Big Data SQL

Part 1 Configuring Oracle Big Data SQL

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

An Introduction to Big Data Formats

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Apache Hive for Oracle DBAs. Luís Marques

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Oracle Big Data SQL User's Guide. Release 3.2.1

Hive SQL over Hadoop

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich

Copyright 2017, Oracle and/or its affiliates. All rights reserved.

Turning Relational Database Tables into Spark Data Sources

Oracle Big Data Connectors

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Stages of Data Processing

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Automating Information Lifecycle Management with

Oracle Database In-Memory By Example

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

IBM Big SQL Partner Application Verification Quick Guide

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Copyright 2017, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data SQL brings SQL and Performance to Hadoop

Safe Harbor Statement

Introduction to Hive Cloudera, Inc.

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

April Copyright 2013 Cloudera Inc. All rights reserved.

microsoft

What is Gluent? The Gluent Data Platform

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Strategies for Incremental Updates on Hive

Big Data Hadoop Stack

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Introduction to BigData, Hadoop:-

Start Working with Parquet!!!!

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Evolving To The Big Data Warehouse

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Importing and Exporting Data Between Hadoop and MySQL

Exam Questions

Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data with Hadoop Ecosystem

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Approaching the Petabyte Analytic Database: What I learned

Integrating with Apache Hadoop

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Building Highly Available and Scalable Real- Time Services with MySQL Cluster

CSE 190D Spring 2017 Final Exam Answers

Part 1: Indexes for Big Data

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Apache Spark and Scala Certification Training

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Oracle Database 18c and Autonomous Database

Unifying Big Data Workloads in Apache Spark

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

Big Data Hadoop Course Content

Shark: Hive (SQL) on Spark

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB

Using Map-Reduce to Teach Parallel Programming Concepts

I am: Rana Faisal Munir

Shine a Light on Dark Data with Vertica Flex Tables

Integration of Apache Hive

Data Access 3. Migrating data. Date of Publish:

Hacking PostgreSQL Internals to Solve Data Access Problems

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

Big Data Facebook

DATA INTEGRATION PLATFORM CLOUD. Experience Powerful Data Integration in the Cloud

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Databricks, an Introduction

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect

Certified Big Data and Hadoop Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

@Pentaho #BigDataWebSeries

APACHE HIVE CIS 612 SUNNIE CHUNG

Transcription:

Big Data SQL Deep Dive Jean-Pierre Dijcks Big Data Product Management DOAG 2016 Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2

Safe Harbor Statement The following is intended to outline our general product direclon. It is intended for informalon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funclonality, and should not be relied upon in making purchasing decisions. The development, release, and Lming of any features or funclonality described for Oracle s products remains at the sole discrelon of Oracle. 3

The Best of Both Worlds SQL Big Data SQL 4

Big Data SQL s Unique Architecture 1 Consolidate Metadata across Silos 2 Enables seamless Blending of any Type of Data 3 Unique Architecture to deliver Performance: OpLmize end-to-end execulon of analylcal queries Pushing down Smart processing into diverse data Lers 5

Session Content 1 2 3 4 5 6 Using Metadata to Read ANY Data Query Flow IO OpLmizaLons Join OpLmizaLons CerLficaLon Update Q&A 6

Big Data SQL Using Metadata to Read Data Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7

How Data is Stored in HDFS data hr salaries.csv website 2016-01 clicks1.json clicks2.json 2016-02 clicks3.json clicks4.json Hanks,Spielberg,1000000 Spielberg,Cameron,2500000 Cameron,Oprah,125000 Oprah,Boss,54000000 {"custid":1354924,"movieid":1948,"genreid":9,"lme":"2012-07-01:00:00:22 } {"custid":1083711,"movieid":null,"genreid":null,"lme":"2012-07-01:00:00:26 } {"custid":1234182,"movieid":11547,"genreid":44,"lme":"2012-07-01:00:00:32 } {"custid":1010220,"movieid":11547,"genreid":44,"lme":"2012-07-01:00:00:42 } 8

Organize and Describe Data with Hive hive data hr warehouse salaries.csv Database data.db website Table hr 2016-01 salaries.csv clicks1.json Table website clicks2.json month=2016-01 ParLLon 2016-02 clicks1.json clicks3.json clicks2.json clicks4.json month=2016-02 ParLLon clicks3.json clicks4.json InformaLon is captured in Hive Metastore HDFS Folders become: Databases Tables ParLLons Table includes metadata for parsing files using Java classes InputFormat defines chunks called splits based on file type RecordReader creates rows out of splits SerDe creates columns 9

How does Hive read ANY data? Hive Metastore Defines: SELECT name FROM my_cust WHERE id = 1 InputFormat RecordReader SerDe SQL ExecuLon /n /n /n /n Any File Type Create Splits Create Records Create Aqributes Select Data 10

Metadata: Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (leverage hive metadata) ORACLE_HDFS (specify metadata) Access parameters used to describe how to idenlfy sources and process data on the hadoop cluster 11

Access Parameters: HDFS Example CREATE TABLE WEB_SALES_CSV ( WS_SOLD_DATE_SK NUMBER, WS_SOLD_TIME_SK NUMBER, WS_ITEM_SK NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HDFS DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=orabig com.oracle.bigdata.fileformat=textfile com.oracle.bigdata.rowformat: DELIMITED FIELDS TERMINATED BY ' com.oracle.bigdata.erroropt: {"aclon": "replace", "value": "-1"} ) LOCATION ('/data/tpcds/benchmarks/bigbench/data/web_sales') ) REJECT LIMIT UNLIMITED; Access Parameters describe source data and processing rules Schema-on-Read 12

Access Parameters: ORACLE_HIVE CREATE TABLE WEB_SALES_CSV ( WS_SOLD_DATE_SK NUMBER, WS_SOLD_TIME_SK NUMBER, WS_ITEM_SK NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=orabig com.oracle.bigdata.tablename: csv.web_sales com.oracle.bigdata.erroropt: {"aclon": "replace", "value": "-1"} com.oracle.bigdata.datamode=automalc ) REJECT LIMIT UNLIMITED; Access Parameters refer to metadata in Hive Add processing rules 13

Viewing Hive Metadata from Oracle Database ALL_HIVE_DATABASES, ALL_HIVE_TABLES, ALL_HIVE_COLUMNS ALL_HIVE_COLUMNS 14

CreaLng Tables SQL Developer with Hive JDBC 1 2 3 Right-click on Hive Table. Use in Oracle Big Data SQL Review generated columns. Update as needed - focusing on data types and precision Add oplonal access parameters. AutomaLcally generate table or save DDL. See: hqps://blogs.oracle.com/datawarehousing/entry/oracle_sql_developer_data_modeler 15

Big Data SQL on top of Hive Data Hive Metastore InputFormat RecordReader SerDe Big Data SQL /n /n /n /n Any File Type Create Splits Create Records Create Aqributes Convert Data & Smart Scan 16

Recommended Approach Use ORACLE_HIVE When Possible Oracle Database query execulon accesses Hive metadata at describe Lme Changes to underlying Hive access parameters will not impact Oracle table one exceplon column list and parllon lists in 12.2 Metadata an enabler for performance oplmizalons ParLLon pruning and predicate pushdown into intelligent sources ULlize tooling for simplified table definilons SQL Developer and DBMS_HADOOP packages 17

Big Data SQL Query Flow Copyright 2016, Oracle and/or its affiliates. All rights reserved. 18

SQL-on-Hadoop Engines Share Metadata, not MapReduce Hive Metastore Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Table DefiniLons: movieapp_log_json Tweets avro_log Metastore maps DDL to Java access classes Copyright 2015, Oracle and/or its affiliates. All rights reserved. 19

A Big Data SQL Query Apply SIs Query Get ParLLons Get Block List Smart no Fetch Data yes Push Predicates Distribute Processing Convert Data Types Project Columns Rows no Create SIs Return Results Filter (Smart Scan) yes Describe Fetch 20

Big Data SQL IO OpTmizaTons Copyright 2016, Oracle and/or its affiliates. All rights reserved. 21

Big Data SQL Performance Features IO ReducTon Features Deliver Compound Results 1 2 3 User Query 100 TB ParLLon Pruning 10 TB Storage Indexing 1 TB Predicate Pushdown 100 GB 22

How does Hive read data FASTER? Hive Metastore SELECT name FROM my_cust WHERE id = 1 Use parllon pruning to reduce IO /n /n /n /n Define more fine-grained schema on files: ParLLoning Create Splits Create Records Create Aqributes Select Data 23

ParLLon Pruning and a Big Data SQL Query Query Get ParLLons Get Block List ParLLon pruning reduces the number of parllons that are going to be scanned Because only a subset of parllons are scanned the block list shrinks equally Distribute Processing Project Columns IO ReducLon Describe Fetch 24

Recommended Approach Use Hive ParTTons when Possible As with Oracle Database, using parllons can dramalcally drive down response Lmes for Big Data SQL queries (as it does for Hive) Schema design and understanding analylcs paqerns are slll beneficial But don t go overboard 25

Big Data SQL Storage Index Example: Find revenue for movies in a category 9 (Comedy) HDFS Min 1 Max 3 Min 12 Max 35 Big Data SQL Storage Index works on HDFS Chunks and creates an SI for each chunk Min / max value is recorded for columns included in a storage index (max # of colums = 32) Movies.json Min 7 Max 10 Storage index provides parllon pruning like performance for unmodeled data sets 26

Storage Indexing and a Big Data SQL Query Query Get ParLLons Get Block List Distribute Processing Project Columns Apply SIs Storage Indexes, before scanning data, further trims down the block list, reducing IO more based on the parllons IO ReducLon Describe Fetch 27

Storage Indexes Things to Think About Performance Cost vs. Benefits An SI is created when a block has not returned any data from predicates This is a second read of the data in the query causing inilal query performance to see performance impact StaLsLcal Relevance: HDFS based data is not parsed, and it is collected in large blocks (BDA = 256MB) An SI is maintained on that block, making the stalslcal chance of filter column values not in the block lower RecommendaLon: Cluster Data if possible or ascertain presence of natural transaclon ordering 28

How does Parquet Work? Create and Query Parquet Files Schema on Write Parquet implements a database storage structure with metadata and parsed data elements Columns Column ProjecLon Select name from my_cust where id = 1 Rows Predicate based Row EliminaLon Metadata for blocks Metadata drives database-like scanning behavior 29

Parquet Benefits Much faster reads because parsing penalty is no longer incurred Schema and data metadata enables parsing, IO skipping etc. providing oplmizalons like done in Database files Columnar format, enables IO oplmizalons for AnalyLcs Draw-backs Schema on write reduces the flexibility of schema on read and makes this just like a database Increases size of data stored as data is now replicated into oplmized formats Columnar formats does penalize IO that is row focused 30

A Big Data SQL Query Smart no yes Push Predicates Query predicates are handed to the Parquet reader classes which use columnar formats and metadata to eliminate IO only reading the subset of data requested Fetch Data IO ReducLon Describe Fetch 31

Parquet (or ORC) Things to Think About Performance Cost vs. Benefits In order to get the performance benefits, data has to parsed (= Inserted) This is a separate step, coslng Lme and increasing complexity When to use Parquet (or ORC) Do: Data Mart like constructs that are long lived Do: StaLc data, not too much changes in your schemas Don t: ETL like jobs that transform data in transient pipelines Don t: Very frequent changes to structures All the rules around schema evolulon in Databases apply 32

Smart Scan Smart no Fetch Data Convert Data Types Smart Scan reduces IO that is sent to the database engine It works on any data format CSV and other dumb formats are most affected But if ORC or Parquet deliver extra data, Smart Scan will cut it down to precise size Filter (Smart Scan) 33

A Big Data SQL Query Query Get ParLLons Apply SIs Get Block List Prq/ orc no yes Push Predicates Return Results Distribute Processing Project Columns Fetch Data Convert Data Types Filter (Smart Scan) No Data Fetch Data Create SIs Describe Fetch 34

Big Data SQL Join OpTmizaTons Copyright 2016, Oracle and/or its affiliates. All rights reserved. 35

Join OpLmizaLon with Bloom Filters A Bloom filter is a low-memory data structure that tests membership in a set correctly indicates when an element is *not* in a set. There could be false posilves is used to filter data in BDS cells - especially when joining large facts to small dimension tables OpLmizer will automalcally ullize Bloom filters to improve performance Can be influenced by PX_JOIN_FILTER/NO_PX_JOIN_FILTER hints Bloom filters are created in Database, so data does flow from HDFS to Database, but impact tends to be very small 36

Join OpLmizaLon with Bloom Filters Oracle Database Store state = CA floorspace > 75000 Create Join Filter s.store = ss.store BDS Cell Store Sales list_price > 100 Use Join Filter Apply Other Filters Query joins store and store sales table Bloom filter (bit vector) is created based on the join column Bloom filter & other filters are applied on the BDS Cell Is this store in the bit vector? PotenLal massive reduclon in data returned to database for join 37

Join OpLmizaLon Example: SELECT st.s_manager, st.s_hours, s.ss_list_price, s.ss_sales_price, s.ss_ext_discount_amt FROM store_sales s, store_orcl st WHERE s.ss_store_sk = st.s_store_sk AND st.s_state='ca' AND st.s_floor_space > 8000000 AND s.ss_list_price > 100; 38

Big Data SQL on HBase (and ORC for fun) Business queston: I need a report of all sales by Lme for each job role (posilon) SQL> SELECT e.position, d.d_year, SUM(s.ss_ext_wholesale_cost), FROM store_sales_orc s, emp_hbase e, date_dim d WHERE e.rowkey = s.ss_customer_sk AND s.ss_sold_date_sk =d. d_date_sk AND e.rowkey > 0 GROUP BY e.position, d.d_year

Big Data SQL on HBase (and ORC for fun)

Big Data SQL CerTficaTon Update Copyright 2016, Oracle and/or its affiliates. All rights reserved. 41

Today s On-Premises Deployment Models 42

Announcing More Deployment OpLons for Big Data SQL 43

44