Just add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big

Size: px
Start display at page:

Download "Just add Magic. Enterprise Parquet. Jean-Pierre Dijcks Product Management, Big"

Transcription

1

2 Just add Magic Enterprise Parquet Jean-Pierre Dijcks Product Management, Big

3 Program Agenda Context Enterprise Parquet Q&A 3

4 Context 4

5 Use Cases and Non-Use Cases The entre presentaton focuses on AnalyTcs Write once, read many Tmes The use cases involve file based data stores, like: Hadoop Distributed File System (HDFS) Object Stores 5

6 Metadata Metadata Metadata Metadata Sharing Metadata Sharing Metadata Metadata Metadata 6

7 Files Metadata Challenges Cause issue with: Security Fine-grained access controls in Object Stores and HDFS Data Governance / LegislaTve Pressure (GDPR etc.) Performance Performance penaltes in Schema on Read cases Agility ETL challenges on many changing file formats Cost Avoid file sprawl and storage explosion through copies to solve 7

8 Hive: Adding Metadata to enable SQL Hive Metastore Defines: SELECT name FROM my_cust WHERE id = 1 InputFormat RecordReader SerDe SQL ExecuTon /n /n /n /n Any File Type Create Splits Create Records Create Agributes Select Data 8

9 How does Apache Parquet Work? Create and Query Parquet Files Schema on Write Parquet implements a database storage structure with metadata and parsed data elements Columns Column ProjecTon Select name from my_cust where id = 1 Rows Predicate based Row EliminaTon Metadata for blocks Metadata drives database-like scanning behavior 9

10 How does Parquet Work? Create and Query Parquet Files Schema on Write Parquet implements a database storage structure with metadata and parsed data elements Schema on READ Rows Columns Column ProjecTon Select name from my_cust where id = 1 Schema on WRITE Predicate based Row EliminaTon Metadata for blocks Metadata drives database-like scanning behavior 10

11 Apache Parquet, solving one Performance Angle Columnar Database File Format Schema on Write: So all your data is parsed You must do ETL Some data elements will not fit Does track metadata and schema Columnar IO avoidance Beger performance Need to keep the original data for archive purposes 11

12 Enterprise Parquet 12

13 Enterprise Parquet Fine grained, in-file access control Columnar IO + Original Data Faster Schema-on-Read Full 1234 Masked 12XX Tokenized wxyz SQL Columnar IO,,,,,, UnZIP Full Original Data Schema on Write Hive SerDe Schema on Read Oracle SerDe Compress 13

14 Standard Use Case - Baseline Two Files = Apache Parquet + Text file Text.gz, Source Text in GZIP 2x Files on Disk ~1.6-2x Footprint Parquet Columnar Slow text-mode Ad-hoc queries Columnar Queries 14

15 Enterprise Parquet Enhancements Overview Single File, New Access Modes, Less Storage Space Text.gz, Source Text in GZIP Two Files on Disk ~1.6x - 2x space Parquet Columnar Single File on Disk ~1.05x space Parquet Columnar Parquet partton Backward CompaTble 15

16 Enterprise Parquet Enhancements Overview Single File, New Access Modes, Less Storage Space Text.gz, Source Text in GZIP Two Files on Disk ~1.6x - 2x space Parquet Columnar Parquet Columnar Parquet partton Backward CompaTble Hidden Enhanced Oracle Extras Enterprise Extra Single File on Disk ~1.05x space 16

17 Enterprise Parquet Enhancements Overview Single File, New Access Modes, Less Storage Space Text.gz, Source Text in GZIP Two Files on Disk ~1.6x - 2x space Single File on Disk ~1.05x space Parquet Columnar Parquet Columnar Slow text-mode Ad-hoc queries Columnar Queries CompaCble Mode: Parquet Queries 17

18 Enterprise Parquet Enhancements Overview Single File, New Access Modes, Less Storage Space Text.gz, Source Text in GZIP Two Files on Disk ~1.6x 2x space Single File on Disk ~1.05x space Parquet Columnar Parquet Columnar Enterprise Extra Slow text-mode Ad-hoc queries Columnar Queries CompaCble Mode: Parquet Queries Dynamic / Text Mode True Unzip Faster Ad-hoc Text queries 18

19 Enterprise Parquet Enhancements #1b RedacTon and Masking : New StaTc Mode Access Text.gz, Source Text in GZIP Parquet Columnar Key Enterprise Parquet Columnar Enterprise Extra StaCc Mode Enhanced Binary Access Redacted / Unredacted Values... 19

20 Ingest - Security Upon Ingest specific fields are marked to redact with details like: EncrypTon Key RedacTon Pagern RedacTon or tokenizaton columns Etc. 20

21 Ingest - Security 21

22 Ingest - AutomaTon Industry Standard and Documented files are ingested with a DefiniTon based on this Standard Example: HL7 Web Logs Trading Data Etc. 22

23 Ingest AutomaTon 23

24 Query Security Enterprise Parquet Redacted Privileged 24

25 Query Security 25

26 Unzip Parquet File 26

27 Unzip 27

28 High Performance Schema-on-Read Ingest as one schema, read as another schema Single file acts as many files 28

29 High Performance Schema-on-Read 5x Speedup 29

30 Summary Security Access Controls inside Files Performance Speed up like Parquet Storage No DuplicaTon of Data CompaTbility 100% Parquet CompaTble 30

31 To Do Finalize File Work for GA IntegraTon Big Data SQL and other SQL engines Key Management and AuthenTcaTon Frameworks Kaxa Pipelines Big Data Manager Autonomous Ingest for any format More GDPR s The right to be forgogen Autonomous RedacTon on SensiTve elements Data Provenance in Files Your Requirements here 31

32 QuesTons and Answers 32

33 33

34

Oracle Big Data SQL High Performance Data Virtualization Explained

Oracle Big Data SQL High Performance Data Virtualization Explained Keywords: Oracle Big Data SQL High Performance Data Virtualization Explained Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data SQL, SQL, Big Data, Hadoop, NoSQL Databases, Relational Databases,

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Big Data SQL Deep Dive

Big Data SQL Deep Dive Big Data SQL Deep Dive Jean-Pierre Dijcks Big Data Product Management DOAG 2016 Copyright 2016, Oracle and/or its affiliates. All rights reserved. 2 Safe Harbor Statement The following is intended to outline

More information

Hive SQL over Hadoop

Hive SQL over Hadoop Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses

More information

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than Oracle Big Data Appliance 2X Faster than Do-It-Yourself 1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Stay Informed During and AEer OpenWorld

Stay Informed During and AEer OpenWorld Stay Informed During and AEer OpenWorld TwiIer: @OracleBigData, @OracleExadata, @Infrastructure Follow #CloudReady LinkedIn: Oracle IT Infrastructure Oracle Showcase Page Oracle Big Data Oracle Showcase

More information

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,

More information

Oracle Big Data SQL brings SQL and Performance to Hadoop

Oracle Big Data SQL brings SQL and Performance to Hadoop Oracle Big Data SQL brings SQL and Performance to Hadoop Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data SQL, Hadoop, Big Data Appliance, SQL, Oracle, Performance, Smart Scan Introduction

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 Files Formats not just CSV - Key factor in Big Data processing and query performance - Schema Evolution - Compression and Splittability - Data Processing Write performance Partial

More information

Data Warehouse Tuning. Without SQL Modification

Data Warehouse Tuning. Without SQL Modification Data Warehouse Tuning Without SQL Modification Agenda About Me Tuning Objectives Data Access Profile Data Access Analysis Performance Baseline Potential Model Changes Model Change Testing Testing Results

More information

IBM Big SQL Partner Application Verification Quick Guide

IBM Big SQL Partner Application Verification Quick Guide IBM Big SQL Partner Application Verification Quick Guide VERSION: 1.6 DATE: Sept 13, 2017 EDITORS: R. Wozniak D. Rangarao Table of Contents 1 Overview of the Application Verification Process... 3 2 Platform

More information

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko Shark: SQL and Rich Analytics at Scale Michael Xueyuan Han Ronny Hajoon Ko What Are The Problems? Data volumes are expanding dramatically Why Is It Hard? Needs to scale out Managing hundreds of machines

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

Oracle 1Z Oracle Big Data 2017 Implementation Essentials. Oracle 1Z0-449 Oracle Big Data 2017 Implementation Essentials https://killexams.com/pass4sure/exam-detail/1z0-449 QUESTION: 63 Which three pieces of hardware are present on each node of the Big Data Appliance?

More information

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research

Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research Tatsuhiro Chiba, Takeshi Yoshimura, Michihiro Horie and Hiroshi Horii IBM Research IBM Research 2 IEEE CLOUD 2018 / Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs à à Application

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Open Data Standards for Administrative Data Processing

Open Data Standards for Administrative Data Processing University of Pennsylvania ScholarlyCommons 2018 ADRF Network Research Conference Presentations ADRF Network Research Conference Presentations 11-2018 Open Data Standards for Administrative Data Processing

More information

Introduction to Hive Cloudera, Inc.

Introduction to Hive Cloudera, Inc. Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich

Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich Eine für Alle - Oracle DB für Big Data, In-memory und Exadata Dr.-Ing. Holger Friedrich Agenda Introduction Old Times Exadata Big Data Oracle In-Memory Headquarters Conclusions 2 sumit AG Consulting and

More information

I am: Rana Faisal Munir

I am: Rana Faisal Munir Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction

More information

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here 2013-11-12 Copyright 2013 Cloudera

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Albis: High-Performance File Format for Big Data Systems

Albis: High-Performance File Format for Big Data Systems Albis: High-Performance File Format for Big Data Systems Animesh Trivedi, Patrick Stuedi, Jonas Pfefferle, Adrian Schuepbach, Bernard Metzler, IBM Research, Zurich 2018 USENIX Annual Technical Conference

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Shine a Light on Dark Data with Vertica Flex Tables

Shine a Light on Dark Data with Vertica Flex Tables White Paper Analytics and Big Data Shine a Light on Dark Data with Vertica Flex Tables Hidden within the dark recesses of your enterprise lurks dark data, information that exists but is forgotten, unused,

More information

Integrating with Apache Hadoop

Integrating with Apache Hadoop HPE Vertica Analytic Database Software Version: 7.2.x Document Release Date: 10/10/2017 Legal Notices Warranty The only warranties for Hewlett Packard Enterprise products and services are set forth in

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 Your database Holds a lot of really valuable data! Many structured tables of several hundred GB Provides fast access

More information

Data Storage Infrastructure at Facebook

Data Storage Infrastructure at Facebook Data Storage Infrastructure at Facebook Spring 2018 Cleveland State University CIS 601 Presentation Yi Dong Instructor: Dr. Chung Outline Strategy of data storage, processing, and log collection Data flow

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012 ORC Files Owen O Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com June 2013 Page 1 Who Am I? First committer added to Hadoop in 2006 First VP of Hadoop at Apache Was architect of MapReduce

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that

More information

Enabling Secure Hadoop Environments

Enabling Secure Hadoop Environments Enabling Secure Hadoop Environments Fred Koopmans Sr. Director of Product Management 1 The future of government is data management What s your strategy? 2 Cloudera s Enterprise Data Hub makes it possible

More information

Integration of Apache Hive

Integration of Apache Hive Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 Agenda Overview of Hive and HBase Hive + HBase Features and Improvements Future of Hive and HBase Q&A Page

More information

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData ` Ronen Ovadya, Ofir Manor, JethroData About JethroData Founded 2012 Raised funding from Pitango in 2013 Engineering in Israel,

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

Apache Kudu. Zbigniew Baranowski

Apache Kudu. Zbigniew Baranowski Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open

More information

Part 1 Configuring Oracle Big Data SQL

Part 1 Configuring Oracle Big Data SQL Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop or a combination of

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

Dremel: Interac-ve Analysis of Web- Scale Datasets

Dremel: Interac-ve Analysis of Web- Scale Datasets Dremel: Interac-ve Analysis of Web- Scale Datasets Google Inc VLDB 2010 presented by Arka BhaEacharya some slides adapted from various Dremel presenta-ons on the internet The Problem: Interactive data

More information

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016 Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016 Hans Viehmann Product Manager EMEA ORACLE Corporation 12. Mai 2016 Safe Harbor Statement The following

More information

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

The Reality of Qlik and Big Data. Chris Larsen Q3 2016 The Reality of Qlik and Big Data Chris Larsen Q3 2016 Introduction Chris Larsen Sr Solutions Architect, Partner Engineering @Qlik Based in Lund, Sweden Primary Responsibility Advanced Analytics (and formerly

More information

Data Access 3. Managing Apache Hive. Date of Publish:

Data Access 3. Managing Apache Hive. Date of Publish: 3 Managing Apache Hive Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents ACID operations... 3 Configure partitions for transactions...3 View transactions...3 View transaction locks... 4

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018 K. Zhang (pic source: mapr.com/blog) Copyright BUDT 2016 758 Where

More information

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam Impala A Modern, Open Source SQL Engine for Hadoop Yogesh Chockalingam Agenda Introduction Architecture Front End Back End Evaluation Comparison with Spark SQL Introduction Why not use Hive or HBase?

More information

Data warehousing on Hadoop. Marek Grzenkowicz Roche Polska

Data warehousing on Hadoop. Marek Grzenkowicz Roche Polska Data warehousing on Hadoop Marek Grzenkowicz Roche Polska Agenda Introduction Case study: StraDa project Source data Data model Data flow and processing Reporting Lessons learnt Ideas for the future Q&A

More information

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Hacking PostgreSQL Internals to Solve Data Access Problems

Hacking PostgreSQL Internals to Solve Data Access Problems Hacking PostgreSQL Internals to Solve Data Access Problems Sadayuki Furuhashi Treasure Data, Inc. Founder & Software Architect A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure

More information

Big Spatial Data Performance With Oracle Database 12c. Daniel Geringer Spatial Solutions Architect

Big Spatial Data Performance With Oracle Database 12c. Daniel Geringer Spatial Solutions Architect Big Spatial Data Performance With Oracle Database 12c Daniel Geringer Spatial Solutions Architect Oracle Exadata Database Machine Engineered System 2 What Is the Oracle Exadata Database Machine? Oracle

More information

TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing

TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto

More information

Time Series Storage with Apache Kudu (incubating)

Time Series Storage with Apache Kudu (incubating) Time Series Storage with Apache Kudu (incubating) Dan Burkert (Committer) dan@cloudera.com @danburkert Tweet about this talk: @getkudu or #kudu 1 Time Series machine metrics event logs sensor telemetry

More information

Department of Computer Engineering 1, 2, 3, 4,5

Department of Computer Engineering 1, 2, 3, 4,5 Components for writing Parquet Format Files Manas Rathi 1, Pratik Jagtap 2, Pranali Jain 3, Anisha Jain 4, Prof. Subhash Tatale 5 1, 2, 3, 4,5 Department of Computer Engineering 1, 2, 3, 4,5 Vishwakarma

More information

Configuring a Hadoop Environment for Test Data Management

Configuring a Hadoop Environment for Test Data Management Configuring a Hadoop Environment for Test Data Management Copyright Informatica LLC 2016, 2017. Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility. AWS Whitepaper

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility. AWS Whitepaper Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility AWS Whitepaper Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility: AWS Whitepaper Copyright 2018 Amazon Web

More information

Cloudera Introduction

Cloudera Introduction Cloudera Introduction Important Notice 2010-2017 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks

More information

BIG DATA Standardisation - Data Lake Ingestion

BIG DATA Standardisation - Data Lake Ingestion BIG DATA Standardisation - Data Lake Ingestion Data Warehousing & Big Data Summer School 3 rd Edition Oana Vasile & Razvan Stoian 21.06.2017 Data & Analytics Data Sourcing and Transformation Content Big

More information

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP 07.29.2015 LANDING STAGING DW Let s start with something basic Is Data Lake a new concept? What is the closest we can

More information

Big Data XML Parsing in Pentaho Data Integration (PDI)

Big Data XML Parsing in Pentaho Data Integration (PDI) Big Data XML Parsing in Pentaho Data Integration (PDI) Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Terms You Should Know... 1 Selecting

More information

Creating Connection With Hive. Version: 16.0

Creating Connection With Hive. Version: 16.0 Creating Connection With Hive Version: 16.0 Copyright 2015 Intellicus Technologies This document and its content is copyrighted material of Intellicus Technologies. The content may not be copied or derived

More information

Importing and Exporting Data Between Hadoop and MySQL

Importing and Exporting Data Between Hadoop and MySQL Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta Achieve Data Democratization with effective Data Integration Saurabh K. Gupta Manager, Data & Analytics, GE www.amazon.com/author/saurabhgupta @saurabhkg Disclaimer: This report has been prepared by the

More information

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web: [ Total Questions: 10]

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web:   [ Total Questions: 10] Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Web: www.marks4sure.com Email: support@marks4sure.com Version: Demo [ Total Questions: 10] IMPORTANT NOTICE Feedback We have developed

More information

Hadoop Overview. Lars George Director EMEA Services

Hadoop Overview. Lars George Director EMEA Services Hadoop Overview Lars George Director EMEA Services 1 About Me Director EMEA Services @ Cloudera Consulting on Hadoop projects (everywhere) Apache Committer HBase and Whirr O Reilly Author HBase The Definitive

More information

MarkLogic Technology Briefing

MarkLogic Technology Briefing MarkLogic Technology Briefing Edd Patterson CTO/VP Systems Engineering, Americas Slide 1 Agenda Introductions About MarkLogic MarkLogic Server Deep Dive Slide 2 MarkLogic Overview Company Highlights Headquartered

More information

Copyright 2017, Oracle and/or its affiliates. All rights reserved.

Copyright 2017, Oracle and/or its affiliates. All rights reserved. Using Oracle Columnar Technologies Across the Information Lifecycle Roger MacNicol Software Architect Data Storage Technology Safe Harbor Statement The following is intended to outline our general product

More information

Working with Pentaho Interactive Reporting and Metadata

Working with Pentaho Interactive Reporting and Metadata Working with Pentaho Interactive Reporting and Metadata Change log (if you want to use it): Date Version Author Changes Contents Overview... 1 Before You Begin... 1 Other Prerequisites... Error! Bookmark

More information

Unique Data Organization

Unique Data Organization Unique Data Organization INTRODUCTION Apache CarbonData stores data in the columnar format, with each data block sorted independently with respect to each other to allow faster filtering and better compression.

More information

Data Lake Based Systems that Work

Data Lake Based Systems that Work Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a

More information

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic 8 Overview of Key Features COPYRIGHT 2014 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. MarkLogic 8 Overview of Key Features Enterprise NoSQL Database Platform Flexible Data Model Store and manage JSON, XML, RDF, and Geospatial data with a documentcentric, schemaagnostic database Search and

More information

Apache Kylin. OLAP on Hadoop

Apache Kylin. OLAP on Hadoop Apache Kylin OLAP on Hadoop Agenda What s Apache Kylin? Tech Highlights Performance Roadmap Q & A http://kylin.io What s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite

More information

Start Working with Parquet!!!!

Start Working with Parquet!!!! My Goal Tonight. Start Working with Parquet!!!! Parquet Query Performance Origin of Parquet Parquet Storage Query Request Usage with Hadoop Tools Customer Examples Topics Parquet Defined Storage & Encoding

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Tuning Intelligent Data Lake Performance

Tuning Intelligent Data Lake Performance Tuning Intelligent Data Lake Performance 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without

More information

FEATURES BENEFITS SUPPORTED PLATFORMS. Reduce costs associated with testing data projects. Expedite time to market

FEATURES BENEFITS SUPPORTED PLATFORMS. Reduce costs associated with testing data projects. Expedite time to market E TL VALIDATOR DATA SHEET FEATURES BENEFITS SUPPORTED PLATFORMS ETL Testing Automation Data Quality Testing Flat File Testing Big Data Testing Data Integration Testing Wizard Based Test Creation No Custom

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011

More information

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기

빅데이터기술개요 2016/8/20 ~ 9/3. 윤형기 빅데이터기술개요 2016/8/20 ~ 9/3 윤형기 (hky@openwith.net) D4 http://www.openwith.net 2 Hive http://www.openwith.net 3 What is Hive? 개념 a data warehouse infrastructure tool to process structured data in Hadoop. Hadoop

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

HCatalog. Table Management for Hadoop. Alan F. Page 1

HCatalog. Table Management for Hadoop. Alan F. Page 1 HCatalog Table Management for Hadoop Alan F. Gates @alanfgates Page 1 Who Am I? HCatalog committer and mentor Co-founder of Hortonworks Tech lead for Data team at Hortonworks Pig committer and PMC Member

More information

Oracle Database 12c R2: New Features for Administrators Part 2 Ed 1

Oracle Database 12c R2: New Features for Administrators Part 2 Ed 1 Oracle Database 12c R2: New Features for Administrators Part 2 Ed 1 Duration 5 Days What you will learn Throughout the lessons of the Oracle Database 12c R2: New Features for Administrators Part 2 course

More information

Integrating Hive and Kafka

Integrating Hive and Kafka 3 Integrating Hive and Kafka Date of Publish: 2018-12-18 https://docs.hortonworks.com/ Contents... 3 Create a table for a Kafka stream...3 Querying live data from Kafka... 4 Query live data from Kafka...

More information

Data Management Glossary

Data Management Glossary Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative

More information