Oracle Big Data SQL High Performance Data Virtualization Explained

Similar documents
Security and Performance advances with Oracle Big Data SQL

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Big Data SQL Deep Dive

Oracle Big Data SQL brings SQL and Performance to Hadoop

Oracle Big Data Connectors

Part 1 Configuring Oracle Big Data SQL

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich

Just add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

Apache Hive for Oracle DBAs. Luís Marques

Oracle Big Data SQL User's Guide. Release 3.2.1

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Oracle Big Data Fundamentals Ed 1

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Was ist dran an einer spezialisierten Data Warehousing platform?

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Big Data with Hadoop Ecosystem

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

IBM Big SQL Partner Application Verification Quick Guide

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Certified Big Data and Hadoop Course Curriculum

An Introduction to Big Data Formats

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hadoop Online Training

Introduction to BigData, Hadoop:-

Certified Big Data Hadoop and Spark Scala Course Curriculum

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Oracle NoSQL Database Enterprise Edition, Version 18.1

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

What is Gluent? The Gluent Data Platform

Big Data Hadoop Stack

microsoft

Strategies for Incremental Updates on Hive

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Safe Harbor Statement

Stages of Data Processing

Hive SQL over Hadoop

Unified Query for Big Data Management Systems

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Turning Relational Database Tables into Spark Data Sources

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Oracle NoSQL Database Enterprise Edition, Version 18.1

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Techno Expert Solutions An institute for specialized studies!

Oracle Big Data Fundamentals Ed 2

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Oracle Big Data Science

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data The end of Data Warehousing?

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

Integration of Apache Hive

Why All Column Stores Are Not the Same Twelve Low-Level Features That Offer High Value to Analysts

Copyright 2017, Oracle and/or its affiliates. All rights reserved.

Quick Deployment Step- by- step instructions to deploy Oracle Big Data Lite Virtual Machine

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Evolving To The Big Data Warehouse

CISC 7610 Lecture 2b The beginnings of NoSQL

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk. February 13, 2018

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Leverage the Oracle Data Integration Platform Inside Azure and Amazon Cloud

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Impala Intro. MingLi xunzhang

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Informatica Enterprise Information Catalog

Oracle Big Data Appliance X7-2

Oracle GoldenGate for Big Data

Integrating with Apache Hadoop

An Oracle White Paper October 12 th, Oracle Metadata Management v New Features Overview

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Oracle Database 12c Release 2 for Data Warehousing and Big Data O R A C L E W H I T E P A P E R N O V E M B E R

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Oracle Big Data Science IOUG Collaborate 16

Importing and Exporting Data Between Hadoop and MySQL

Big Data Hadoop Course Content

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Oracle Data Integrator 12c: Integration and Administration

Cloudera Kudu Introduction

Data Lake Based Systems that Work

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

Copyright 2017, Oracle and/or its affiliates. All rights reserved.

Introducing Oracle R Enterprise 1.4 -

If you have ever appeared for the Hadoop interview, you must have experienced many Hadoop scenario based interview questions.

Oracle Data Integrator 12c: Integration and Administration

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

A Glimpse of the Hadoop Echosystem

April Copyright 2013 Cloudera Inc. All rights reserved.

Spatial Analytics Built for Big Data Platforms

PUBLIC SAP Vora Sizing Guide

Database In-Memory: A Deep Dive and a Future Preview

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

Transcription:

Keywords: Oracle Big Data SQL High Performance Data Virtualization Explained Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data SQL, SQL, Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical session focuses on Oracle Big Data SQL. Big Data SQL The key goals of Big Data SQL are to expose data in its original format, and stored within Hadoop and NoSQL Databases through high performance Oracle SQL being offloaded to Storage resident cells or agents. The architecture of Big Data SQL closely follows the architecture of Oracle Exadata Storage Server Software and is built on the same proven technology. With data in HDFS stored in an undetermined format (schema on read), SQL queries require some constructs to parse and interpret data for it to be processed in rows and columns. For this Big Data SQL leverages all the Hadoop constructs, notably InputFormat and SerDe Java classes optionally through Hive metadata definitions. Big Data SQL then layers the Oracle Big Data SQL Agent on top of this generic Hadoop infrastructure, as can be seen in Figure 1. Figure 1. Architecture leveraging core Hadoop and Oracle together Because Big Data SQL is based on Exadata Storage Server Software, a number of benefits are instantly available. Big Data SQL not only can retrieve data, but can also score Data Mining models at the individual agent, mapping model scoring to an individual HDFS node. Likewise

querying JSON documents stored in HDFS can be done with SQL directly and is executed on the agent itself. How does Hadoop on the Bottom work? The general Hadoop constructs (like InputFormat) mentioned above work in concert to impose schema on read on the data stored in HDFS. The Hive Catalog captures the schema definitions and enables Hive to act as a SQL engine. This metadata is what is instrumentation in leveraging these Hadoop constructs for Big Data SQL. Figure 2. How Hive Metadata Enables SQL Access As can be seen in Figure 2, the InputFormat (a Java Class) is used to parse the file into chunks of data on which the RecordReader then imposes a record format (start of record to end of record). Once that is done, a SerDe imposes attribute (or column) definitions on these records. Note that all of these steps are part of the query execution and happen on the fly. The metadata definition (eg. A table) is captured in the Hive Metastore. How does Oracle on Top work? With the understanding of Figure 2 in mind, it is easy to see how an Oracle SQL query can leverage these constructs, however to enable data local processing an additional component is required. First however, lets draw the picture as to what Big Data SQL does with the notion of reading data with the constructs Hive uses. This is shown below.

Figure 3. Big Data SQL and Hive Metadata Partition Pruning and Predicate Pushdown One of the most effective performance features ever invented is partitioning of data. Partitioning has been implemented in Hive and can now be used by Big Data SQL to speed up IO by eliminating partitions from a query. Big Data SQL leverages the predicates in a query and eliminates partitions from the data selected. This is very similar to what happens in Oracle Database. Partition pruning is the first step taken, and it impacts not just IO but also the number of splits eventually created. Using the same predicates, much more IO can be avoided when the underlying files are preparsed. As an example, Parquet files which contains parsed data (like a database file), can be interrogated intelligently by using the file metadata and the predicates. This way, database like selectivity is achieved, driving up query performance. Big Data SQL leverages both partitioning and predicate pushdown to Parquet, ORC or HBase as examples to ensure excellent performance. Storage Indexes Storage Indexes (SI) provide the same benefits of IO elimination to Big Data SQL as they provide to SQL on Exadata. The big difference is that in Big Data SQL the SI work on an HDFS block (on BDA 256MB of data) and span 32 columns instead of the usual 8. SI is fully transparent to both Oracle Database and to the underlying HDFS environment. As with Exadata, the SI is a memory construct managed by the Big Data SQL software and invalidated automatically when the underlying files change.

Figure 4. Storage Indexes work on HDFS Blocks and speed up IO by skipping blocks SI works on data exposed via Oracle External tables using both the ORACLE_HIVE and ORACLE_HDFS types. Fields are mapped to these External Tables and the SI is attached to the Oracle columns, so that when a query references the column(s), the SI - when appropriate - kicks in. In the current version, SI does not support tables defined with Storage Handlers (ex: HBase or Oracle NoSQL Database). Bloom Filters To ensure joins can be pushed down to the Hadoop nodes instead of requiring all data to be database resident are now also supported in Big Data SQL. Figure 5. An example of Bloom Filtering

Bloom filters are used in for example, Exadata as well to convert joins into scans, enabling the joining of data to take place in the storage tier. The same is implemented in Big Data SQL, where the bloom filters are pushed to the cell and execute a join on the Hadoop node with local data. Smart Scan Within the Big Data SQL Agent, similar functionality exists as is available in Exadata Storage Server Software. Smart Scans apply the filter and row projections from a given SQL query on the data streaming from the HDFS Data Nodes, reducing the data that is flowing to the Database to fulfill the data request of that given query. The benefits of Smart Scan for Hadoop data are even more pronounced than for Oracle Database as tables are often very wide and very large. Because of the elimination of data at the individual HDFS node, queries across large tables are now possible within reasonable time limits enabling data warehouse style queries to be spread across data stored in both HDFS and Oracle Database. Compound Benefits Smart Scan, Predicate Pushdown, Partitioning and Storage Index features deliver compound benefits. Where Predicate Pushdown, Partitioning and Storage Indexes reduce the IO done, Smart Scan enacts row filtering and column projection when the IO reduction in not granular enough. This latter step remains important as it reduces the data transferred between systems and serves as a lock on the system preventing it from overflowing. Update on Availability and Support for Big Data SQL As Big Data SQL matures as a technology its applicable platforms are expanded. The upcoming (at the time of writing) version of Big Data SQL is going to add a large number of supported configurations as shown below: Figure 6. Expanded Deployment Options The graphic shown above shows the rounding out of deployment options, with support for Sparc SuperCluster to Big Data Appliance and various hybrid options, including Oracle Exadata to generic Hadoop cluster running either Hortonworks or Cloudera distributions.

Virtual Machine to try out Big Data SQL and all other components Of course, having all these new features in the platform is a lot of fun. But how do I get to try all of this? It s simple. Oracle packages all of these features into a virtual machine called Oracle Big Data Lite VM. This is updated for each new version of BDA software and picks up the components in the stack. This VM not only includes all of Oracle Data Integration including the Big Data add-on but it now also includes Oracle Big Data Discovery. It is the perfect client for a BDA, and the perfect test bed for any of your investigations. You can find the VM here: http://www.oracle.com/technetwork/database/bigdata-appliance/oraclebigdatalite-2104726.html Contact address: Jean-Pierre Dijcks Oracle 500 Oracle Parkway MS 4op7 Redwood City, CA 94065 Phone: +1 650 607 5394 Email Jean-Pierre.Dijcks@oracle.com Internet: www.oracle.com/bigdata