Big Data SQL Deep Dive
|
|
- Gertrude Williams
- 6 years ago
- Views:
Transcription
1
2 Big Data SQL Deep Dive Jean-Pierre Dijcks Big Data Product Management DOAG 2016 Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2
3 Safe Harbor Statement The following is intended to outline our general product direclon. It is intended for informalon purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or funclonality, and should not be relied upon in making purchasing decisions. The development, release, and Lming of any features or funclonality described for Oracle s products remains at the sole discrelon of Oracle. 3
4 The Best of Both Worlds SQL Big Data SQL 4
5 Big Data SQL s Unique Architecture 1 Consolidate Metadata across Silos 2 Enables seamless Blending of any Type of Data 3 Unique Architecture to deliver Performance: OpLmize end-to-end execulon of analylcal queries Pushing down Smart processing into diverse data Lers 5
6 Session Content Using Metadata to Read ANY Data Query Flow IO OpLmizaLons Join OpLmizaLons CerLficaLon Update Q&A 6
7 Big Data SQL Using Metadata to Read Data Copyright 2016, Oracle and/or its affiliates. All rights reserved. 7
8 How Data is Stored in HDFS data hr salaries.csv website clicks1.json clicks2.json clicks3.json clicks4.json Hanks,Spielberg, Spielberg,Cameron, Cameron,Oprah, Oprah,Boss, {"custid": ,"movieid":1948,"genreid":9,"lme":" :00:00:22 } {"custid": ,"movieid":null,"genreid":null,"lme":" :00:00:26 } {"custid": ,"movieid":11547,"genreid":44,"lme":" :00:00:32 } {"custid": ,"movieid":11547,"genreid":44,"lme":" :00:00:42 } 8
9 Organize and Describe Data with Hive hive data hr warehouse salaries.csv Database data.db website Table hr salaries.csv clicks1.json Table website clicks2.json month= ParLLon clicks1.json clicks3.json clicks2.json clicks4.json month= ParLLon clicks3.json clicks4.json InformaLon is captured in Hive Metastore HDFS Folders become: Databases Tables ParLLons Table includes metadata for parsing files using Java classes InputFormat defines chunks called splits based on file type RecordReader creates rows out of splits SerDe creates columns 9
10 How does Hive read ANY data? Hive Metastore Defines: SELECT name FROM my_cust WHERE id = 1 InputFormat RecordReader SerDe SQL ExecuLon /n /n /n /n Any File Type Create Splits Create Records Create Aqributes Select Data 10
11 Metadata: Extend Oracle External Tables CREATE TABLE movielog ( click VARCHAR2(4000)) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename logs com.oracle.bigdata.cluster mycluster )) REJECT LIMIT UNLIMITED; New types of external tables ORACLE_HIVE (leverage hive metadata) ORACLE_HDFS (specify metadata) Access parameters used to describe how to idenlfy sources and process data on the hadoop cluster 11
12 Access Parameters: HDFS Example CREATE TABLE WEB_SALES_CSV ( WS_SOLD_DATE_SK NUMBER, WS_SOLD_TIME_SK NUMBER, WS_ITEM_SK NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HDFS DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=orabig com.oracle.bigdata.fileformat=textfile com.oracle.bigdata.rowformat: DELIMITED FIELDS TERMINATED BY ' com.oracle.bigdata.erroropt: {"aclon": "replace", "value": "-1"} ) LOCATION ('/data/tpcds/benchmarks/bigbench/data/web_sales') ) REJECT LIMIT UNLIMITED; Access Parameters describe source data and processing rules Schema-on-Read 12
13 Access Parameters: ORACLE_HIVE CREATE TABLE WEB_SALES_CSV ( WS_SOLD_DATE_SK NUMBER, WS_SOLD_TIME_SK NUMBER, WS_ITEM_SK NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=orabig com.oracle.bigdata.tablename: csv.web_sales com.oracle.bigdata.erroropt: {"aclon": "replace", "value": "-1"} com.oracle.bigdata.datamode=automalc ) REJECT LIMIT UNLIMITED; Access Parameters refer to metadata in Hive Add processing rules 13
14 Viewing Hive Metadata from Oracle Database ALL_HIVE_DATABASES, ALL_HIVE_TABLES, ALL_HIVE_COLUMNS ALL_HIVE_COLUMNS 14
15 CreaLng Tables SQL Developer with Hive JDBC Right-click on Hive Table. Use in Oracle Big Data SQL Review generated columns. Update as needed - focusing on data types and precision Add oplonal access parameters. AutomaLcally generate table or save DDL. See: hqps://blogs.oracle.com/datawarehousing/entry/oracle_sql_developer_data_modeler 15
16 Big Data SQL on top of Hive Data Hive Metastore InputFormat RecordReader SerDe Big Data SQL /n /n /n /n Any File Type Create Splits Create Records Create Aqributes Convert Data & Smart Scan 16
17 Recommended Approach Use ORACLE_HIVE When Possible Oracle Database query execulon accesses Hive metadata at describe Lme Changes to underlying Hive access parameters will not impact Oracle table one exceplon column list and parllon lists in 12.2 Metadata an enabler for performance oplmizalons ParLLon pruning and predicate pushdown into intelligent sources ULlize tooling for simplified table definilons SQL Developer and DBMS_HADOOP packages 17
18 Big Data SQL Query Flow Copyright 2016, Oracle and/or its affiliates. All rights reserved. 18
19 SQL-on-Hadoop Engines Share Metadata, not MapReduce Hive Metastore Oracle Big Data SQL SparkSQL Hive Impala Hive Metastore Table DefiniLons: movieapp_log_json Tweets avro_log Metastore maps DDL to Java access classes Copyright 2015, Oracle and/or its affiliates. All rights reserved. 19
20 A Big Data SQL Query Apply SIs Query Get ParLLons Get Block List Smart no Fetch Data yes Push Predicates Distribute Processing Convert Data Types Project Columns Rows no Create SIs Return Results Filter (Smart Scan) yes Describe Fetch 20
21 Big Data SQL IO OpTmizaTons Copyright 2016, Oracle and/or its affiliates. All rights reserved. 21
22 Big Data SQL Performance Features IO ReducTon Features Deliver Compound Results User Query 100 TB ParLLon Pruning 10 TB Storage Indexing 1 TB Predicate Pushdown 100 GB 22
23 How does Hive read data FASTER? Hive Metastore SELECT name FROM my_cust WHERE id = 1 Use parllon pruning to reduce IO /n /n /n /n Define more fine-grained schema on files: ParLLoning Create Splits Create Records Create Aqributes Select Data 23
24 ParLLon Pruning and a Big Data SQL Query Query Get ParLLons Get Block List ParLLon pruning reduces the number of parllons that are going to be scanned Because only a subset of parllons are scanned the block list shrinks equally Distribute Processing Project Columns IO ReducLon Describe Fetch 24
25 Recommended Approach Use Hive ParTTons when Possible As with Oracle Database, using parllons can dramalcally drive down response Lmes for Big Data SQL queries (as it does for Hive) Schema design and understanding analylcs paqerns are slll beneficial But don t go overboard 25
26 Big Data SQL Storage Index Example: Find revenue for movies in a category 9 (Comedy) HDFS Min 1 Max 3 Min 12 Max 35 Big Data SQL Storage Index works on HDFS Chunks and creates an SI for each chunk Min / max value is recorded for columns included in a storage index (max # of colums = 32) Movies.json Min 7 Max 10 Storage index provides parllon pruning like performance for unmodeled data sets 26
27 Storage Indexing and a Big Data SQL Query Query Get ParLLons Get Block List Distribute Processing Project Columns Apply SIs Storage Indexes, before scanning data, further trims down the block list, reducing IO more based on the parllons IO ReducLon Describe Fetch 27
28 Storage Indexes Things to Think About Performance Cost vs. Benefits An SI is created when a block has not returned any data from predicates This is a second read of the data in the query causing inilal query performance to see performance impact StaLsLcal Relevance: HDFS based data is not parsed, and it is collected in large blocks (BDA = 256MB) An SI is maintained on that block, making the stalslcal chance of filter column values not in the block lower RecommendaLon: Cluster Data if possible or ascertain presence of natural transaclon ordering 28
29 How does Parquet Work? Create and Query Parquet Files Schema on Write Parquet implements a database storage structure with metadata and parsed data elements Columns Column ProjecLon Select name from my_cust where id = 1 Rows Predicate based Row EliminaLon Metadata for blocks Metadata drives database-like scanning behavior 29
30 Parquet Benefits Much faster reads because parsing penalty is no longer incurred Schema and data metadata enables parsing, IO skipping etc. providing oplmizalons like done in Database files Columnar format, enables IO oplmizalons for AnalyLcs Draw-backs Schema on write reduces the flexibility of schema on read and makes this just like a database Increases size of data stored as data is now replicated into oplmized formats Columnar formats does penalize IO that is row focused 30
31 A Big Data SQL Query Smart no yes Push Predicates Query predicates are handed to the Parquet reader classes which use columnar formats and metadata to eliminate IO only reading the subset of data requested Fetch Data IO ReducLon Describe Fetch 31
32 Parquet (or ORC) Things to Think About Performance Cost vs. Benefits In order to get the performance benefits, data has to parsed (= Inserted) This is a separate step, coslng Lme and increasing complexity When to use Parquet (or ORC) Do: Data Mart like constructs that are long lived Do: StaLc data, not too much changes in your schemas Don t: ETL like jobs that transform data in transient pipelines Don t: Very frequent changes to structures All the rules around schema evolulon in Databases apply 32
33 Smart Scan Smart no Fetch Data Convert Data Types Smart Scan reduces IO that is sent to the database engine It works on any data format CSV and other dumb formats are most affected But if ORC or Parquet deliver extra data, Smart Scan will cut it down to precise size Filter (Smart Scan) 33
34 A Big Data SQL Query Query Get ParLLons Apply SIs Get Block List Prq/ orc no yes Push Predicates Return Results Distribute Processing Project Columns Fetch Data Convert Data Types Filter (Smart Scan) No Data Fetch Data Create SIs Describe Fetch 34
35 Big Data SQL Join OpTmizaTons Copyright 2016, Oracle and/or its affiliates. All rights reserved. 35
36 Join OpLmizaLon with Bloom Filters A Bloom filter is a low-memory data structure that tests membership in a set correctly indicates when an element is *not* in a set. There could be false posilves is used to filter data in BDS cells - especially when joining large facts to small dimension tables OpLmizer will automalcally ullize Bloom filters to improve performance Can be influenced by PX_JOIN_FILTER/NO_PX_JOIN_FILTER hints Bloom filters are created in Database, so data does flow from HDFS to Database, but impact tends to be very small 36
37 Join OpLmizaLon with Bloom Filters Oracle Database Store state = CA floorspace > Create Join Filter s.store = ss.store BDS Cell Store Sales list_price > 100 Use Join Filter Apply Other Filters Query joins store and store sales table Bloom filter (bit vector) is created based on the join column Bloom filter & other filters are applied on the BDS Cell Is this store in the bit vector? PotenLal massive reduclon in data returned to database for join 37
38 Join OpLmizaLon Example: SELECT st.s_manager, st.s_hours, s.ss_list_price, s.ss_sales_price, s.ss_ext_discount_amt FROM store_sales s, store_orcl st WHERE s.ss_store_sk = st.s_store_sk AND st.s_state='ca' AND st.s_floor_space > AND s.ss_list_price > 100; 38
39 Big Data SQL on HBase (and ORC for fun) Business queston: I need a report of all sales by Lme for each job role (posilon) SQL> SELECT e.position, d.d_year, SUM(s.ss_ext_wholesale_cost), FROM store_sales_orc s, emp_hbase e, date_dim d WHERE e.rowkey = s.ss_customer_sk AND s.ss_sold_date_sk =d. d_date_sk AND e.rowkey > 0 GROUP BY e.position, d.d_year
40 Big Data SQL on HBase (and ORC for fun)
41 Big Data SQL CerTficaTon Update Copyright 2016, Oracle and/or its affiliates. All rights reserved. 41
42 Today s On-Premises Deployment Models 42
43 Announcing More Deployment OpLons for Big Data SQL 43
44 44
45
Oracle Big Data SQL High Performance Data Virtualization Explained
Keywords: Oracle Big Data SQL High Performance Data Virtualization Explained Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data SQL, SQL, Big Data, Hadoop, NoSQL Databases, Relational Databases,
More informationDo-It-Yourself 1. Oracle Big Data Appliance 2X Faster than
Oracle Big Data Appliance 2X Faster than Do-It-Yourself 1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such
More informationJust add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big
Just add Magic Enterprise Parquet Jean-Pierre Dijcks Product Management, Big Data @jpdijcks Program Agenda 1 2 3 Context Enterprise Parquet Q&A 3 Context 4 Use Cases and Non-Use Cases The entre presentaton
More informationSecurity and Performance advances with Oracle Big Data SQL
Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,
More informationPart 1 Configuring Oracle Big Data SQL
Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop or a combination of
More informationOracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data
Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationLecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018
Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 K. Zhang (pic source: mapr.com/blog) Copyright BUDT 2016 758 Where
More informationApache Hive for Oracle DBAs. Luís Marques
Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,
More informationHive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to
More informationOracle Big Data SQL User's Guide. Release 3.2.1
Oracle Big Data SQL User's Guide Release 3.2.1 E87609-06 May 2018 Oracle Big Data SQL User's Guide, Release 3.2.1 E87609-06 Copyright 2012, 2018, Oracle and/or its affiliates. All rights reserved. This
More informationHive SQL over Hadoop
Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationEine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich
Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 2 sumit AG Consulting and
More informationCopyright 2017, Oracle and/or its affiliates. All rights reserved.
Using Oracle Columnar Technologies Across the Information Lifecycle Roger MacNicol Software Architect Data Storage Technology Safe Harbor Statement The following is intended to outline our general product
More informationTurning Relational Database Tables into Spark Data Sources
Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationVerarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016
Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016 Hans Viehmann Product Manager EMEA ORACLE Corporation 12. Mai 2016 Safe Harbor Statement The following
More informationHadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationImpala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam
Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?
More informationAutomating Information Lifecycle Management with
Automating Information Lifecycle Management with Oracle Database 2c The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated
More informationOracle Database In-Memory By Example
Oracle Database In-Memory By Example Andy Rivenes Senior Principal Product Manager DOAG 2015 November 18, 2015 Safe Harbor Statement The following is intended to outline our general product direction.
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationIBM Big SQL Partner Application Verification Quick Guide
IBM Big SQL Partner Application Verification Quick Guide VERSION: 1.6 DATE: Sept 13, 2017 EDITORS: R. Wozniak D. Rangarao Table of Contents 1 Overview of the Application Verification Process... 3 2 Platform
More informationFrom Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019
From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways
More informationCopyright 2017, Oracle and/or its affiliates. All rights reserved.
Using Oracle Columnar Technologies Across the Information Lifecycle Roger MacNicol Software Architect Data Storage Technology Safe Harbor Statement The following is intended to outline our general product
More informationOracle Big Data SQL brings SQL and Performance to Hadoop
Oracle Big Data SQL brings SQL and Performance to Hadoop Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data SQL, Hadoop, Big Data Appliance, SQL, Oracle, Performance, Smart Scan Introduction
More informationSafe Harbor Statement
Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment
More informationIntroduction to Hive Cloudera, Inc.
Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationHadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 Files Formats not just CSV - Key factor in Big Data processing and query performance - Schema Evolution - Compression and Splittability - Data Processing Write performance Partial
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationWhat is Gluent? The Gluent Data Platform
What is Gluent? The Gluent Data Platform The Gluent Data Platform provides a transparent data virtualization layer between traditional databases and modern data storage platforms, such as Hadoop, in the
More informationShark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko
Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines
More informationAccelerating BI on Hadoop: Full-Scan, Cubes or Indexes?
White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more
More informationStrategies for Incremental Updates on Hive
Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationInteractive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData
Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData ` Ronen Ovadya, Ofir Manor, JethroData About JethroData Founded 2012 Raised funding from Pitango in 2013 Engineering in Israel,
More informationCopyright 2018, Oracle and/or its affiliates. All rights reserved.
Oracle Database In- Memory Implementation Best Practices and Deep Dive [TRN4014] Andy Rivenes Database In-Memory Product Management Oracle Corporation Safe Harbor Statement The following is intended to
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationStart Working with Parquet!!!!
My Goal Tonight. Start Working with Parquet!!!! Parquet Query Performance Origin of Parquet Parquet Storage Query Request Usage with Hadoop Tools Customer Examples Topics Parquet Defined Storage & Encoding
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationImporting and Exporting Data Between Hadoop and MySQL
Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationBig Data XML Parsing in Pentaho Data Integration (PDI)
Big Data XML Parsing in Pentaho Data Integration (PDI) Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Selecting
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that
More information#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.
Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending Presented by: Trishla Maru Agenda Overview MultiSource Data Federation Use Cases Design Considerations Data
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationIntegrating with Apache Hadoop
HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 10/10/2017 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in
More informationThis is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
More informationMicrosoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo
Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 HOTSPOT You install the Microsoft Hive ODBC Driver on a computer that runs Windows
More informationBuilding Highly Available and Scalable Real- Time Services with MySQL Cluster
Building Highly Available and Scalable Real- Time Services with MySQL Cluster MySQL Sales Consulting Director Philip Antoniades April, 3rd, 2012 1 Copyright 2012, Oracle and/or its affiliates. All rights
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationOracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA
Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA Keywords: Big Data, Oracle Big Data Appliance, Hadoop, NoSQL, Oracle
More informationsqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010
sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 Your database Holds a lot of really valuable data! Many structured tables of several hundred GB Provides fast access
More informationBig Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018
Big Data com Hadoop Impala, Hive e Spark VIII Sessão - SQL Bahia 03/03/2018 Diógenes Pires Connect with PASS Sign up for a free membership today at: pass.org #sqlpass Internet Live http://www.internetlivestats.com/
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationORC Files. Owen O June Page 1. Hortonworks Inc. 2012
ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1 Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce
More informationCombine Native SQL Flexibility with SAP HANA Platform Performance and Tools
SAP Technical Brief Data Warehousing SAP HANA Data Warehousing Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools A data warehouse for the modern age Data warehouses have been
More informationApache Spark and Scala Certification Training
About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any
More informationOracle Database 18c and Autonomous Database
Oracle Database 18c and Autonomous Database Maria Colgan Oracle Database Product Management March 2018 @SQLMaria Safe Harbor Statement The following is intended to outline our general product direction.
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationThe following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material,
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationBring Context To Your Machine Data With Hadoop, RDBMS & Splunk
Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationPagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB
Pagely.com implements log analytics with AWS Glue and Amazon Athena using Beyondsoft s ConvergDB Pagely is the market leader in managed WordPress hosting, and an AWS Advanced Technology, SaaS, and Public
More informationUsing Map-Reduce to Teach Parallel Programming Concepts
Using Map-Reduce to Teach Parallel Programming Concepts Dick Brown, St. Olaf College Libby Shoop, Macalester College Joel Adams, Calvin College Workshop site CSinParallel.org -> Workshops -> WMR Workshop
More informationI am: Rana Faisal Munir
Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction
More informationShine a Light on Dark Data with Vertica Flex Tables
White Paper Analytics and Big Data Shine a Light on Dark Data with Vertica Flex Tables Hidden within the dark recesses of your enterprise lurks dark data, information that exists but is forgotten, unused,
More informationIntegration of Apache Hive
Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 Agenda Overview of Hive and HBase Hive + HBase Features and Improvements Future of Hive and HBase Q&A Page
More informationData Access 3. Migrating data. Date of Publish:
3 Migrating data Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents Data migration to Apache Hive... 3 Moving data from databases to Apache Hive...3 Create a Sqoop import command...4 Import
More informationHacking PostgreSQL Internals to Solve Data Access Problems
Hacking PostgreSQL Internals to Solve Data Access Problems Sadayuki Furuhashi Treasure Data, Inc. Founder & Software Architect A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure
More informationOracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data
Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data June 2006 Note: This document is for informational purposes. It is not a commitment to deliver any material, code, or functionality,
More informationBig Data Facebook
Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Big Data @ FB: Scale
More informationDATA INTEGRATION PLATFORM CLOUD. Experience Powerful Data Integration in the Cloud
DATA INTEGRATION PLATFORM CLOUD Experience Powerful Integration in the Want a unified, powerful, data-driven solution for all your data integration needs? Oracle Integration simplifies your data integration
More informationHAWQ: A Massively Parallel Processing SQL Engine in Hadoop
HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar
More informationBIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG
BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationRecent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect
Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect Copyright 2017, Oracle and/or its affiliates. All rights reserved. Safe Harbor Statement The following is intended to
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More information@Pentaho #BigDataWebSeries
Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of
More informationAPACHE HIVE CIS 612 SUNNIE CHUNG
APACHE HIVE CIS 612 SUNNIE CHUNG APACHE HIVE IS Data warehouse infrastructure built on top of Hadoop enabling data summarization and ad-hoc queries. Initially developed by Facebook. Hive stores data in
More information