Approaching the Petabyte Analytic Database: What I learned

Similar documents
Oracle Big Data Connectors

Actian Hybrid Data Conference 2018 London

April Copyright 2013 Cloudera Inc. All rights reserved.

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

An Introduction to Big Data Formats

Actian SQL Analytics in Hadoop

Hortonworks and The Internet of Things

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Actian Hybrid Data Conference 2018 London

Modern Data Warehouse The New Approach to Azure BI

What is Gluent? The Gluent Data Platform

Technical Sheet NITRODB Time-Series Database

Microsoft Analytics Platform System (APS)

Actian Hybrid Data Conference 2017 London Actian Corporation

When, Where & Why to Use NoSQL?

Exam Questions

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

MapR Enterprise Hadoop

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

WHITEPAPER. MemSQL Enterprise Feature List

Big Data with Hadoop Ecosystem

Cloud Analytics and Business Intelligence on AWS

SAP HANA Scalability. SAP HANA Development Team

microsoft

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Overview of Data Services and Streaming Data Solution with Azure

Modernizing Business Intelligence and Analytics

BIG DATA COURSE CONTENT

Spatial Analytics Built for Big Data Platforms

Migrate from Netezza Workload Migration

Data Analytics at Logitech Snowflake + Tableau = #Winning

SoftFlash: Programmable Storage in Future Data Centers Jae Do Researcher, Microsoft Research

PUBLIC SAP Vora Sizing Guide

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Building a Data Strategy for a Digital World

Microsoft Exam

Security and Performance advances with Oracle Big Data SQL

Microsoft Big Data and Hadoop

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Stages of Data Processing

Accelerate Big Data Insights

Migrate from Netezza Workload Migration

VOLTDB + HP VERTICA. page

powered by Cloudian and Veritas

Unifying Big Data Workloads in Apache Spark

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Evolving To The Big Data Warehouse

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

IBM DB2 Analytics Accelerator Trends and Directions

Data-Intensive Distributed Computing

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

QLIK INTEGRATION WITH AMAZON REDSHIFT

Progress DataDirect For Business Intelligence And Analytics Vendors

Optimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics

Cloud Analytics Database Performance Report

Flash Storage Complementing a Data Lake for Real-Time Insight

Understanding the latent value in all content

Big data easily, efficiently, affordably. UniConnect 2.1

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

ACCELERATE YOUR ANALYTICS GAME WITH ORACLE SOLUTIONS ON PURE STORAGE

Shine a Light on Dark Data with Vertica Flex Tables

Part 1: Indexes for Big Data

Oracle NoSQL Database Enterprise Edition, Version 18.1

Apache Kylin. OLAP on Hadoop

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Lambda Architecture for Batch and Stream Processing. October 2018

In-Memory Computing EXASOL Evaluation

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

28 February 1 March 2018, Trafo Baden. #techsummitch

Windows 10 IoT Overview. Microsoft Corporation

IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

Big Data solution benchmark

Vision of the Software Defined Data Center (SDDC)

Benchmarks Prove the Value of an Analytical Database for Big Data

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

SQL Server 2017 Power your entire data estate from on-premises to cloud

Database in the Cloud Benchmark

Accelerating Digital Transformation with InterSystems IRIS and vsan

Cloud Computing & Visualization

Leveraging Customer Behavioral Data to Drive Revenue the GPU S7456

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Kognitio Analytical Platform

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

BI ENVIRONMENT PLANNING GUIDE

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

BEST BIG DATA CERTIFICATIONS

Deploying, Managing and Reusing R Models in an Enterprise Environment

FAST SQL SERVER BACKUP AND RESTORE

Energy Management with AWS

DATABASE SCALE WITHOUT LIMITS ON AWS

Five Common Myths About Scaling MySQL

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Transcription:

Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of Actian. This document is not intended to be binding upon Actian to any particular course of business, pricing, product strategy, and/or development. Actian assumes no responsibility for errors or omissions in this document. Actian shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. Actian does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

Approaching the Petabyte Analytic Database: What I learned Keith Bolam Director of Engineering Projects November 2018

One petabyte of data? Where does the data come from? When do we need to access the data? Who is going to be accessing the data Flashback to One Billion Rows what next

After the Terabyte we arrive at the newest data size Petabyte is becoming a norm in regular conversation The human brain has a capacity of about 2.5 petabytes of memories Databases do NOT handle a Petabyte of data with ease Databases need you 4 2018 Actian Corporation

After the Terabyte we arrive at the newest data size Petabyte is becoming a norm in regular conversation The human brain has a capacity of about 2.5 petabytes of memories Databases do NOT handle a Petabyte of data with ease Databases need you 5 2018 Actian Corporation

Play Video Here 6 2018 Actian Corporation

Where does the data come from? OLTP systems Social Platforms Log and timeseries IoT Devices

What is one petabyte of data Today s iphones are 128 gb or more. So 8 of them make a Terabyte. So one petabyte is exactly 8000 phones. 8000 Not a lot when we think of there being 73,734,000 iphones in 2011... But what can we do with it on a database? On the phones. We generally store images so one record may be 4 mb, maybe 12mb. A video could be 2-4 gb. In a database we are more interested in small data but lots of it. For images we would be interested in the metadata only IoT devices can generate many GB per day 8 2018 Actian Corporation

When do we need to access data? Now Frequentl Ad-Hoc? A years Time or longer Rolling Window

Petabyte implications on database analytic queries Try not to allow users access to the whole dataset THEY DO NOT NEED IT Bring Insight from the queries that have been run USE MONITORING TOOLS You do not need all the data in one place PUT IT IN SMALLER CHUNKS Put the data into the database in an appropriate way USE NATURAL CLUSTERING LET USERS ACCESS DATA EARLY See point 2 above 10 2018 Actian Corporation

Who will be accessing the data Data Scientists AI application and automated Insights BI Users Enterprise or Ad-Hoc

Data Scientists Applications : Business User Complex exploratory queries Few in number Long running May generate more data than they consume! Dynamically generated queries Potential for poor SQL No humans involved to 'tune' SQL Rapid request potential Corporate On-demand queries Organised generally on Date Customer Region Product 12 2018 Actian Corporation

Petabyte implications on database analytic queries Try not to allow users access to the whole dataset THEY DO NOT NEED IT Bring Insight from the queries that have been run USE MONITORING TOOLS You do not need all the data in one place PUT IT IN SMALLER CHUNKS Put the data into the database in an appropriate way USE NATURAL CLUSTERING LET USERS ACCESS DATA EARLY See point 2 above 13 2018 Actian Corporation

Spreading the effort on more nodes or bigger nodes? Increasing the nodes size and capability Azure HDInsight D12 tiny 4 vcpu 28 GB D13 small 8 vcpu 56 D16 starter 16 vcpu 128 Then they get much bigger and expensive. 8 Exabytes of Storage The power of MANY Increase Nodes Cores Both cores and nodes Considerations Bigger nodes = higher cost More nodes = greater joining cost More cores = greater Vectorization capability 14 2018 Actian Corporation

Flashback to One Billion Rows what next...

One Billion Rows Many devices and system produce data Not all at the same rate Our perspective on what is happening is affected by our viewpoint 16 2018 Actian Corporation

How did it work out

Some Numbers 22,214.5199 80 bytes of data needs this number of records to make ONE Petabyte Time to load 63430000000 (63bn) 15.203613 2,855,406 Time to load 10m rows Rows per second loaded 18 2018 Actian Corporation

Consuming data while moving it helps Reducing the payload in the first place is even better If we eat while we work I it does get easier Leave the REALLY difficult tasks to someone else 2018 Actian Corporation

Take-away's from this session Planning Preparation Performance Look at the initial payload Identify what can be processed up front and never moved Consume during data movement Scale Slowly and Steadily Onboarding is time & cost sensitive Use insights to manage growth Enable user groups access progressively Users Applications BI/ELT Self-Serve can be the most disruptive of queries AI applications pre-defined by Data Scientist Business Reporting known reports that will be run at scheduled times 21 2018 Actian Corporation

How Actian's products can Change your business Be on the leading edge of Cloud 100GB to 20TB Use Actian Vector on-premise today Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service 20GB to 100TB Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service 100TB + Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service

Actian Vector & Dataflow November 2018

Actian Vector Delivering fast, open, enterprise-grade analytics to top customers Achieve business insights not possible before Connect to all your data sources and systems Get to mission-critical production faster 24

Performance advantage derived through multiple innovations 1. Vectorized Processing 4. Smart Compression Single Instruction Multiple Data Maximize throughput Vectorized decompression in chip Typically 4-6:1 Compression Ratio- 2. Exploiting Chip Cache 5. Storage Indexes Process data in chip not in RAM Created Automatically simplifies schema Quickly identify candidate data blocks for solving queries Minimize I/O 3. Second Gen Columnar 6. Multi-core Parallelism Limit I/O Most efficient real time updates on and off Hadoop Maximize concurrency, parallelism and system resource utilization 25

Actian Vector The world s fastesest analytic database Scans, aggregations, and joins over 1TB, 5TB, 10TB databases, single user and 20 concurrent users, on same underlying configurations Performance advantage over competition grows as data scales, query complexity increases, and user concurrency increases Independently tested by MCG using Berkeley AMPLab Big Data Benchmark 10X Faster 14X Faster 20X Faster 100X Faster 26 Download the reports at https://www.actian.com/analytic-database/vector-cloud/

Benchmarking VectorH Vs SQL in Hadoop Competition How many times faster is VectorH? Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 VectorH 1.34 1.29 3.15 0.18 1.94 0.19 2.37 1.8 11.77 1.21 1.28 0.37 3.69 1.13 1.56 1.73 1.21 1.63 1.29 2.47 1.99 2.96 HAWQ 158.2 21.46 32.06 38.21 36.38 20.19 44.74 48.38 766.4 32.97 12.48 31.75 27.97 19.47 31.58 14.17 173.2 87.08 24.82 42.84 84.7 29.44 SparkSQL 155.4 74.98 62.38 68.27 146.5 5.1 180.2 174.6 264 56.62 30.28 66.97 47.65 6.92 11.16 33.81 244.9 254.7 24.89 31.56 1614 91.18 Impala 585.4 81.81 167.7 163.18 242.5 1.81 369 276.2 1242.9 69.97 35.04 45.67 180.8 13.95 15.19 47.52 581.53 1234 714.7 74.25 880.8 34.81 Hive 490.1 63.57 266.6 59.08DNF 63.63 721.8 625.6 1077 230.5 246.1 65.78 140.7 53.23 556.5 92.51 711.7 454.5 1010 100.5 247.7 81.11 The Benchmark includes two refresh streams that delete and insert 1/1000 th of the data. Note that only HIVE & Vector can complete these tests. The below query times reflect the time taken to complete the refresh streams and execute the query set after the refresh stre ams have been executed. Hive: RF1=34s RF2=112s GeoDiff=138.2% VectorH RF1=25s RF2=12.5s GeoDiff=99.3% VectorH 1.67 1.13 2.9 0.19 1.75 0.21 2.43 1.58 12.69 1.21 1.32 0.35 3.67 0.89 1.48 1.64 1.22 1.67 1.45 2.42 2.14 2.95 Hive 608.4 80.8 335.7 205.4DNF 128 690.7 719.8 1150 334.4 218.7 170.5 143.8 130.7 596.7 101.4 891.2 594.6 1167 153.3 275.6 67.85

Benchmarking VectorH Vs SQL in Hadoop Competition How many times faster is VectorH? Click to add text Click to add text Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 VectorH 1.34 1.29 3.15 0.18 1.94 0.19 2.37 1.8 11.77 1.21 1.28 0.37 3.69 1.13 1.56 1.73 1.21 1.63 1.29 2.47 1.99 2.96 HAWQ 158.2 21.46 32.06 38.21 36.38 20.19 44.74 48.38 Click 766.4 32.97 to add 12.48 text 31.75 27.97 19.47 31.58 14.17 173.2 87.08 24.82 42.84 84.7 29.44 SparkSQL 155.4 74.98 62.38 68.27 146.5 5.1 180.2 174.6 264 56.62 30.28 66.97 47.65 6.92 11.16 33.81 244.9 254.7 24.89 31.56 1614 91.18 Impala 585.4 81.81 167.7 163.18 242.5 1.81 369 276.2 1242.9 69.97 35.04 45.67 180.8 13.95 15.19 47.52 581.53 1234 714.7 74.25 880.8 34.81 Hive 490.1 63.57 266.6 59.08DNF 63.63 721.8 625.6 1077 230.5 246.1 65.78 140.7 53.23 556.5 92.51 711.7 454.5 1010 100.5 247.7 81.11 The Benchmark includes two refresh streams that delete and insert 1/1000 th of the data. Note that only HIVE & Vector can complete these tests. The below query times reflect the time taken to complete the refresh streams and execute the query set after the refresh stre ams have been executed. Hive: RF1=34s RF2=112s GeoDiff=138.2% VectorH RF1=25s RF2=12.5s GeoDiff=99.3% VectorH 1.67 1.13 2.9 0.19 1.75 0.21 2.43 1.58 12.69 1.21 1.32 0.35 3.67 0.89 1.48 1.64 1.22 1.67 1.45 2.42 2.14 2.95 Hive 608.4 80.8 335.7 205.4DNF 128 690.7 719.8 1150 334.4 218.7 170.5 143.8 130.7 596.7 101.4 891.2 594.6 1167 153.3 275.6 67.85

Actian Vector for Hadoop: Enterprise class SQL BI & analytics natively in Hadoop ENTERPRISE GRADE Full ANSI SQL 2003 support leverage existing SQL skills and standard BI tools and apps Fully ACID compliant prevent inaccurate results by bringing transactional integrity to Hadoop Update Capability provide ability to update data in Hadoop without impacting query performance Native DBMS Security sleep well with enterprise class authentication, user and role-based security, data protection, and encryption 29

Actian Vector for Hadoop: Enterprise class SQL BI & analytics natively in Hadoop ENTERPRISE GRADE Full ANSI SQL 2003 support leverage existing SQL skills and standard BI tools and apps Fully ACID compliant prevent inaccurate results by bringing transactional integrity to Hadoop HIGH PERFORMANCE Highly Performant run existing apps faster and grow data without sacrificing performance High Concurrency allow simultaneous users and tasks to run without long wait times Update Capability provide ability to update data in Hadoop without impacting query performance Mature, proven planner and fast optimizer maximize usage of nodes, CPU, memory and cache with highly intelligent query execution plans Native DBMS Security sleep well with enterprise class authentication, user and role-based security, data protection, and encryption Native in-hadoop YARN optimize usage of low-cost Hadoop infrastructure by automatically managing cluster resources across applications 30

Actian Vector for Hadoop: Enterprise class SQL BI & analytics natively in Hadoop ENTERPRISE GRADE HIGH PERFORMANCE OPEN Full ANSI SQL 2003 support leverage existing SQL skills and standard BI tools and apps Fully ACID compliant prevent inaccurate results by bringing transactional integrity to Hadoop Update Capability provide ability to update data in Hadoop without impacting query performance Native DBMS Security sleep well with enterprise class authentication, user and role-based security, data protection, and encryption Highly Performant run existing apps faster and grow data without sacrificing performance High Concurrency allow simultaneous users and tasks to run without long wait times Mature, proven planner and fast optimizer maximize usage of nodes, CPU, memory and cache with highly intelligent query execution plans Native in-hadoop YARN optimize usage of low-cost Hadoop infrastructure by automatically managing cluster resources across applications Cloud get started quickly with flexible deployment options on premise or across multiple cloud infrastructures Hadoop distribution agnostic - avoid vendor lock-in and provide customer flexibility across distributions Collaborative architecture minimize risk by leveraging existing tools and benefitting from cross-industry innovations Open Data Formats query native Hadoop file formats and allow API access to our own block format 31

Actian Vector and DataFlow & Spark Ubiquitous Analytics Custom Apps Streaming ISVs Data DataFlow Spark Remote Data Traditional ETL SQL Vector Cloud Actian Vector Spark Connector Vector serves as a data source to Spark Apps Cloud Data & Applications Data Local Data Sources Data Actian Vector Spark Loader Ingest data from all available Spark sources Using the Spark Loader Actian Vector Spark Connector Spark Vector External Tables Using Spark 32

Processing capability and Scale required example drop table if exists sort_10t_x100; create table sort_10t_x100 ( ID UUID NOT NULL WITH DEFAULT, _c0 varchar(100) ) with PARTITION=(HASH on _c0 25 partitions); --Create the EXTERNAL table drop table if exists sort_10t; create external table sort_10t (_c0 varchar(100) ) USING SPARK WITH REFERENCE = 'adl:///user/actian/datasets/ sort/10tb/pennyinput_10m-9860000000.1987.one', ROWS = 10000000, FORMAT = 'CSV', options= ( 'header' = 'false', 'delimeter' = ' ' ); 33 2018 Actian Corporation create external table sort_10t_full (_c0 varchar(100) ) USING SPARK WITH REFERENCE = 'adl:///user/actian/datasets/sort/10tb/*.one', ROWS = 10000000, FORMAT = 'CSV', options= ( 'header' = 'false', 'delimeter' = ' ' ); insert into sort_10t_x100 (_c0) select * from sort_10t; (10000000 rows in 15.203613 secs) insert into sort_10t_x100 (_c0) select * from sort_10t_full; (63430000000 rows in 22214.519971 secs) select first 2 tid, *, length(_c0) len from sort_10t_x100 order by id desc; 9e87baa2-e5f3-11e8-b382-000d3a0d785a 9e87bb37-e5f3-11e8-b382-000d3a0d785a (2 rows in 2853.020944 secs)

Actian DataFlow Single platform for end-to-end data access, transformation, preparation, and predictive analysis Combines the KNIME (open source data mining platform) drag and drop visual workflow environment Eliminates memory constraints, and data movement prior to analytic processing Desktop, remote server, or clusters -- including Hadoop Transform, cleanse and analyze terabytes of data into actionable insights at recordbreaking speed on commodity hardware 34

Data Integration Some of our Vector Technology Partners Actian X Actian Vector & Vector in Hadoop JDBC 4.2 ODBC 3.5 Business Intelligence & Analysis 35

Thank you!