Approaching the Petabyte Analytic Database: What I learned

Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of Actian. This document is not intended to be binding upon Actian to any particular course of business, pricing, product strategy, and/or development. Actian assumes no responsibility for errors or omissions in this document. Actian shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. Actian does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

Approaching the Petabyte Analytic Database: What I learned Keith Bolam Director of Engineering Projects November 2018

One petabyte of data? Where does the data come from? When do we need to access the data? Who is going to be accessing the data Flashback to One Billion Rows what next

After the Terabyte we arrive at the newest data size Petabyte is becoming a norm in regular conversation The human brain has a capacity of about 2.5 petabytes of memories Databases do NOT handle a Petabyte of data with ease Databases need you 4 2018 Actian Corporation

Play Video Here 6 2018 Actian Corporation

Where does the data come from? OLTP systems Social Platforms Log and timeseries IoT Devices

What is one petabyte of data Today s iphones are 128 gb or more. So 8 of them make a Terabyte. So one petabyte is exactly 8000 phones. 8000 Not a lot when we think of there being 73,734,000 iphones in 2011... But what can we do with it on a database? On the phones. We generally store images so one record may be 4 mb, maybe 12mb. A video could be 2-4 gb. In a database we are more interested in small data but lots of it. For images we would be interested in the metadata only IoT devices can generate many GB per day 8 2018 Actian Corporation

When do we need to access data? Now Frequentl Ad-Hoc? A years Time or longer Rolling Window

Petabyte implications on database analytic queries Try not to allow users access to the whole dataset THEY DO NOT NEED IT Bring Insight from the queries that have been run USE MONITORING TOOLS You do not need all the data in one place PUT IT IN SMALLER CHUNKS Put the data into the database in an appropriate way USE NATURAL CLUSTERING LET USERS ACCESS DATA EARLY See point 2 above 10 2018 Actian Corporation

Who will be accessing the data Data Scientists AI application and automated Insights BI Users Enterprise or Ad-Hoc

Data Scientists Applications : Business User Complex exploratory queries Few in number Long running May generate more data than they consume! Dynamically generated queries Potential for poor SQL No humans involved to 'tune' SQL Rapid request potential Corporate On-demand queries Organised generally on Date Customer Region Product 12 2018 Actian Corporation

Spreading the effort on more nodes or bigger nodes? Increasing the nodes size and capability Azure HDInsight D12 tiny 4 vcpu 28 GB D13 small 8 vcpu 56 D16 starter 16 vcpu 128 Then they get much bigger and expensive. 8 Exabytes of Storage The power of MANY Increase Nodes Cores Both cores and nodes Considerations Bigger nodes = higher cost More nodes = greater joining cost More cores = greater Vectorization capability 14 2018 Actian Corporation

Flashback to One Billion Rows what next...

One Billion Rows Many devices and system produce data Not all at the same rate Our perspective on what is happening is affected by our viewpoint 16 2018 Actian Corporation

How did it work out

Some Numbers 22,214.5199 80 bytes of data needs this number of records to make ONE Petabyte Time to load 63430000000 (63bn) 15.203613 2,855,406 Time to load 10m rows Rows per second loaded 18 2018 Actian Corporation

Consuming data while moving it helps Reducing the payload in the first place is even better If we eat while we work I it does get easier Leave the REALLY difficult tasks to someone else 2018 Actian Corporation

Take-away's from this session Planning Preparation Performance Look at the initial payload Identify what can be processed up front and never moved Consume during data movement Scale Slowly and Steadily Onboarding is time & cost sensitive Use insights to manage growth Enable user groups access progressively Users Applications BI/ELT Self-Serve can be the most disruptive of queries AI applications pre-defined by Data Scientist Business Reporting known reports that will be run at scheduled times 21 2018 Actian Corporation

How Actian's products can Change your business Be on the leading edge of Cloud 100GB to 20TB Use Actian Vector on-premise today Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service 20GB to 100TB Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service 100TB + Use Actian Vector on Azure Use Actian Vector on AWS Use Actian Vector as a Service

Actian Vector & Dataflow November 2018

Actian Vector Delivering fast, open, enterprise-grade analytics to top customers Achieve business insights not possible before Connect to all your data sources and systems Get to mission-critical production faster 24

Performance advantage derived through multiple innovations 1. Vectorized Processing 4. Smart Compression Single Instruction Multiple Data Maximize throughput Vectorized decompression in chip Typically 4-6:1 Compression Ratio- 2. Exploiting Chip Cache 5. Storage Indexes Process data in chip not in RAM Created Automatically simplifies schema Quickly identify candidate data blocks for solving queries Minimize I/O 3. Second Gen Columnar 6. Multi-core Parallelism Limit I/O Most efficient real time updates on and off Hadoop Maximize concurrency, parallelism and system resource utilization 25

Actian Vector The world s fastesest analytic database Scans, aggregations, and joins over 1TB, 5TB, 10TB databases, single user and 20 concurrent users, on same underlying configurations Performance advantage over competition grows as data scales, query complexity increases, and user concurrency increases Independently tested by MCG using Berkeley AMPLab Big Data Benchmark 10X Faster 14X Faster 20X Faster 100X Faster 26 Download the reports at https://www.actian.com/analytic-database/vector-cloud/

Benchmarking VectorH Vs SQL in Hadoop Competition How many times faster is VectorH? Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 VectorH 1.34 1.29 3.15 0.18 1.94 0.19 2.37 1.8 11.77 1.21 1.28 0.37 3.69 1.13 1.56 1.73 1.21 1.63 1.29 2.47 1.99 2.96 HAWQ 158.2 21.46 32.06 38.21 36.38 20.19 44.74 48.38 766.4 32.97 12.48 31.75 27.97 19.47 31.58 14.17 173.2 87.08 24.82 42.84 84.7 29.44 SparkSQL 155.4 74.98 62.38 68.27 146.5 5.1 180.2 174.6 264 56.62 30.28 66.97 47.65 6.92 11.16 33.81 244.9 254.7 24.89 31.56 1614 91.18 Impala 585.4 81.81 167.7 163.18 242.5 1.81 369 276.2 1242.9 69.97 35.04 45.67 180.8 13.95 15.19 47.52 581.53 1234 714.7 74.25 880.8 34.81 Hive 490.1 63.57 266.6 59.08DNF 63.63 721.8 625.6 1077 230.5 246.1 65.78 140.7 53.23 556.5 92.51 711.7 454.5 1010 100.5 247.7 81.11 The Benchmark includes two refresh streams that delete and insert 1/1000 th of the data. Note that only HIVE & Vector can complete these tests. The below query times reflect the time taken to complete the refresh streams and execute the query set after the refresh stre ams have been executed. Hive: RF1=34s RF2=112s GeoDiff=138.2% VectorH RF1=25s RF2=12.5s GeoDiff=99.3% VectorH 1.67 1.13 2.9 0.19 1.75 0.21 2.43 1.58 12.69 1.21 1.32 0.35 3.67 0.89 1.48 1.64 1.22 1.67 1.45 2.42 2.14 2.95 Hive 608.4 80.8 335.7 205.4DNF 128 690.7 719.8 1150 334.4 218.7 170.5 143.8 130.7 596.7 101.4 891.2 594.6 1167 153.3 275.6 67.85

Benchmarking VectorH Vs SQL in Hadoop Competition How many times faster is VectorH? Click to add text Click to add text Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 VectorH 1.34 1.29 3.15 0.18 1.94 0.19 2.37 1.8 11.77 1.21 1.28 0.37 3.69 1.13 1.56 1.73 1.21 1.63 1.29 2.47 1.99 2.96 HAWQ 158.2 21.46 32.06 38.21 36.38 20.19 44.74 48.38 Click 766.4 32.97 to add 12.48 text 31.75 27.97 19.47 31.58 14.17 173.2 87.08 24.82 42.84 84.7 29.44 SparkSQL 155.4 74.98 62.38 68.27 146.5 5.1 180.2 174.6 264 56.62 30.28 66.97 47.65 6.92 11.16 33.81 244.9 254.7 24.89 31.56 1614 91.18 Impala 585.4 81.81 167.7 163.18 242.5 1.81 369 276.2 1242.9 69.97 35.04 45.67 180.8 13.95 15.19 47.52 581.53 1234 714.7 74.25 880.8 34.81 Hive 490.1 63.57 266.6 59.08DNF 63.63 721.8 625.6 1077 230.5 246.1 65.78 140.7 53.23 556.5 92.51 711.7 454.5 1010 100.5 247.7 81.11 The Benchmark includes two refresh streams that delete and insert 1/1000 th of the data. Note that only HIVE & Vector can complete these tests. The below query times reflect the time taken to complete the refresh streams and execute the query set after the refresh stre ams have been executed. Hive: RF1=34s RF2=112s GeoDiff=138.2% VectorH RF1=25s RF2=12.5s GeoDiff=99.3% VectorH 1.67 1.13 2.9 0.19 1.75 0.21 2.43 1.58 12.69 1.21 1.32 0.35 3.67 0.89 1.48 1.64 1.22 1.67 1.45 2.42 2.14 2.95 Hive 608.4 80.8 335.7 205.4DNF 128 690.7 719.8 1150 334.4 218.7 170.5 143.8 130.7 596.7 101.4 891.2 594.6 1167 153.3 275.6 67.85

Actian Vector for Hadoop: Enterprise class SQL BI & analytics natively in Hadoop ENTERPRISE GRADE Full ANSI SQL 2003 support leverage existing SQL skills and standard BI tools and apps Fully ACID compliant prevent inaccurate results by bringing transactional integrity to Hadoop HIGH PERFORMANCE Highly Performant run existing apps faster and grow data without sacrificing performance High Concurrency allow simultaneous users and tasks to run without long wait times Update Capability provide ability to update data in Hadoop without impacting query performance Mature, proven planner and fast optimizer maximize usage of nodes, CPU, memory and cache with highly intelligent query execution plans Native DBMS Security sleep well with enterprise class authentication, user and role-based security, data protection, and encryption Native in-hadoop YARN optimize usage of low-cost Hadoop infrastructure by automatically managing cluster resources across applications 30

Actian Vector for Hadoop: Enterprise class SQL BI & analytics natively in Hadoop ENTERPRISE GRADE HIGH PERFORMANCE OPEN Full ANSI SQL 2003 support leverage existing SQL skills and standard BI tools and apps Fully ACID compliant prevent inaccurate results by bringing transactional integrity to Hadoop Update Capability provide ability to update data in Hadoop without impacting query performance Native DBMS Security sleep well with enterprise class authentication, user and role-based security, data protection, and encryption Highly Performant run existing apps faster and grow data without sacrificing performance High Concurrency allow simultaneous users and tasks to run without long wait times Mature, proven planner and fast optimizer maximize usage of nodes, CPU, memory and cache with highly intelligent query execution plans Native in-hadoop YARN optimize usage of low-cost Hadoop infrastructure by automatically managing cluster resources across applications Cloud get started quickly with flexible deployment options on premise or across multiple cloud infrastructures Hadoop distribution agnostic - avoid vendor lock-in and provide customer flexibility across distributions Collaborative architecture minimize risk by leveraging existing tools and benefitting from cross-industry innovations Open Data Formats query native Hadoop file formats and allow API access to our own block format 31

Actian Vector and DataFlow & Spark Ubiquitous Analytics Custom Apps Streaming ISVs Data DataFlow Spark Remote Data Traditional ETL SQL Vector Cloud Actian Vector Spark Connector Vector serves as a data source to Spark Apps Cloud Data & Applications Data Local Data Sources Data Actian Vector Spark Loader Ingest data from all available Spark sources Using the Spark Loader Actian Vector Spark Connector Spark Vector External Tables Using Spark 32

Processing capability and Scale required example drop table if exists sort_10t_x100; create table sort_10t_x100 ( ID UUID NOT NULL WITH DEFAULT, _c0 varchar(100) ) with PARTITION=(HASH on _c0 25 partitions); --Create the EXTERNAL table drop table if exists sort_10t; create external table sort_10t (_c0 varchar(100) ) USING SPARK WITH REFERENCE = 'adl:///user/actian/datasets/ sort/10tb/pennyinput_10m-9860000000.1987.one', ROWS = 10000000, FORMAT = 'CSV', options= ( 'header' = 'false', 'delimeter' = ' ' ); 33 2018 Actian Corporation create external table sort_10t_full (_c0 varchar(100) ) USING SPARK WITH REFERENCE = 'adl:///user/actian/datasets/sort/10tb/*.one', ROWS = 10000000, FORMAT = 'CSV', options= ( 'header' = 'false', 'delimeter' = ' ' ); insert into sort_10t_x100 (_c0) select * from sort_10t; (10000000 rows in 15.203613 secs) insert into sort_10t_x100 (_c0) select * from sort_10t_full; (63430000000 rows in 22214.519971 secs) select first 2 tid, *, length(_c0) len from sort_10t_x100 order by id desc; 9e87baa2-e5f3-11e8-b382-000d3a0d785a 9e87bb37-e5f3-11e8-b382-000d3a0d785a (2 rows in 2853.020944 secs)

Actian DataFlow Single platform for end-to-end data access, transformation, preparation, and predictive analysis Combines the KNIME (open source data mining platform) drag and drop visual workflow environment Eliminates memory constraints, and data movement prior to analytic processing Desktop, remote server, or clusters -- including Hadoop Transform, cleanse and analyze terabytes of data into actionable insights at recordbreaking speed on commodity hardware 34

Data Integration Some of our Vector Technology Partners Actian X Actian Vector & Vector in Hadoop JDBC 4.2 ODBC 3.5 Business Intelligence & Analysis 35

Thank you!