GPU-Accelerated Analytics on your Data Lake.

Size: px

Start display at page:

Download "GPU-Accelerated Analytics on your Data Lake."

Roxanne Edwards
5 years ago
Views:

1 GPU-Accelerated Analytics on your Data Lake.

2 Data Lake

3 Data Swamp

ETL Hell DATA LAKE 0001010100001001011010110 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>>

01010101100001 01011010100100 01011010100001 01010110100001 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> >>>>>>>>>> >>>

4 ETL Hell DATA LAKE >>>>>>>>>>>>>>>>>>>>>> >>>>>>>> >>>>>> >>>>>>>>>>>>>>>>> >>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> >>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>>>> >>>> >>>>> >>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>> >>>> >>>>> >>>>> >>>> >>>> >>>>>>>>>> >>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>

5 COMMON DATA LAYER

6 Simplify Data Storage SCHEMA METADATA DATA

7 SQL Warehouse on Data Lake

BlazingDB How it works DATA LAKE 0001010100001001011010110 Compression/Decompression Filtering (Predicate Pushdown)

8 BlazingDB How it works DATA LAKE Compression/Decompression Filtering (Predicate Pushdown) Aggregations Transformations Joins Sorting/Ordering Local Disk HDFS AWS S3 RAM Cache (Hot) Disk Cache (Medium) HDD SSD

9 BlazingDB Multi-nodal Cluster

10 Shared Data Architecture DATA LAKE

11 The Nays No Ingest No Duplication No BlazingDB Specific ETL No Consistency Management No Vendor Lock-in

12 The Yays Incredibly Fast SQL Scalable, On Demand Data Warehouse Multi-Terabyte Queries Data Sharing (Across Clusters And Other Tools) High Concurrency

13 DEMO

14 Demo - Architecture HDFS on Azure Azure GPU Servers NC24 V1 4 Servers

15 SECONDS Queries: BlazingDB 4 Node Query times (Lower is better) Cold Medium (Disk cache only) Hot Query 1 Query 2 Query 3 Query 4 Query 5 QUERIES

16 SECONDS Query 1 Query select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendeprice) as sum_disc_price, sum(l_extendeprice*(1-l_discount)) as sum_base_price, sum(l_extendeprice*(1-l_discount)*(1+l_tax)) as sum_charge, avg(l_quatity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(l_quantity) as count_order from lineitem where l_shipdate <= group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; Cold Query 1 Medium (Disk cache only) Hot Data Points 6 billion row table Many aggregations/transformations

SECONDS Query 2 Query2 1 2 3 4 5 6 7 8 9 10 11 12 13 select lineitem.l_orderkey, sum(lineitem.l_extendedprice*(1- lineitem.l_discount)) as revenue, orders.o_orderdate, orders.

17 SECONDS Query 2 Query select lineitem.l_orderkey, sum(lineitem.l_extendedprice*(1- lineitem.l_discount)) as revenue, orders.o_orderdate, orders.o_shippriority from customer inner join orders on customer.c_custkey = orders.o_custkey inner join lineitem on lineitem.l_orderkey = orders.o_orderkey where customer.c_mktsegment = 'BUILDING' and orders.o_orderdate < ' ' and lineitem.l_shipdate > ' ' group by lineitem.l_orderkey, orders.o_orderdate, orders.o_shippriority order by revenue desc,orders.o_orderdate; Cold Query 2 Medium (Disk cache only) Hot Data Points Join 6B rows to 1.5B rows to 150M rows Many aggregations/transformations Order (sorting)

SECONDS Query 3 Query3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 select nation.name, sum(lineitem.l_extendedprice * (1 - lineitem.l_discount)) as revenue from customer inner join orders on customer.

18 SECONDS Query 3 Query select nation.name, sum(lineitem.l_extendedprice * (1 - lineitem.l_discount)) as revenue from customer inner join orders on customer.cust_key = orders.o_custkey inner join lineitem on lineitem.l_orderkey = orders.o_orderkey inner join supplier on lineitem.l_suppkey = supplier.s_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey where supplier.s_nationkey = nation.nation_key and region.r_name = 'ASIA' and orders.o_orderdate >= ' ' and orders.o_orderdate < ' ' group by nation.name order by revenue desc Cold Query 3 Medium (Disk cache only) Hot Data Points Join 6B rows to 1.5B rows to 150M rows (and many small joins) Multiple aggregations/transformations Order (sorting)

19 SECONDS Query 4 Query select sum(l_extendedprice) as sum_exprice, sum(l_discount) as sum_discount from lineitem where l_shipdate >= ' ' and l_shipdate < ' ' and l_discount >= 0.05 and l_discount <= 0.07 and l_quantity < 24 Cold Query 4 Medium (Disk cache only) Hot Data Points 6B row table Multiple aggregations/transformations

SECONDS Query 5 Query1 select supplier.s_acctbal, supplier.s_suppkey, nation.name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.

20 SECONDS Query 5 Query1 select supplier.s_acctbal, supplier.s_suppkey, nation.name, part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone, supplier.s_comment from supplier inner join partsupp on supplier.s_suppkey = partsupp.ps_suppkey inner join nation on supplier.s_nationkey = nation.nation_key inner join region on nation.region_key = region.r_regionkey inner join part on part.p_partkey = partsupp.ps_partkey where part.p_size = 15 and part.p_type in ('ECONOMY ANODIZED BRASS', 'ECONOMY BRUSHED BRASS', 'ECONOMY BURNISHED BRASS', 'ECONOMY PLATED BRASS', 'ECONOMY POLISHED BRASS', 'LARGE ANODIZED BRASS', LARGE BRUSHED BRASS','LARGE BURNISHED BRASS','LARGE PLATED BRASS', 'LARGE POLISHED BRASS', 'SMALL ANODIZED BRASS', 'SMALL BRUSHED BRASS', 'SMALL BURNISHED BRASS', SMALL PLATED BRASS', 'SMALL POLISHED BRASS', 'STANDARD ANODIZED BRASS', 'STANDARD BRUSHED BRASS', 'STANDARD BURNISHED BRASS', 'STANDARD PLATED BRASS', 'STANDARD POLISHED BRASS') and region.r_name = 'EUROPE' order by supplier.s_acctbal desc, supplier.s_suppkey, nation.name, part.p_partkey Cold Query 5 Medium (Disk cache only) Hot Data Points Join multiple tables Many aggregations/transformations String comparisons

21 Data Pipeline Common Data Layer Coming Soon STORAGE (Data Lake) GPU Data Frame Apache Arrow INGEST

22 Questions?

Vectorized Postgres (VOPS extension) Konstantin Knizhnik Postgres Professional

Vectorized Postgres (VOPS extension) Konstantin Knizhnik Postgres Professional Why Postgres is slow on OLAP queries? 1. Unpacking tuple overhead (heap_deform_tuple) 2. Interpretation overhead (invocation