Using Data Virtualization to Accelerate Time-to-Value From Your Data Integrating Distributed Data in Real Time
Speaker Paul Moxon VP Data Architectures and Chief Evangelist @ Denodo Technologies
Data, Data Everywhere, And Not a Thought to Think 3
Agile Analytics Architecture 4
Data Pipeline Problem 70-80% 20-30% Data Discovery & Preparation Analysis Actions Data Discovery Data Extraction Data Preprocessing Data Analysis Decision Making 5
Data Pipeline Problem 50-60% 40-50% Data Preparation Analysis Actions Data Analysis Decision Making 6
Agile Analytics Architecture - Revisited DATA VIRTUALIZATION 7
What is Data Virtualization? Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data. Create a Road Map For A Real-time, Agile, Self- Service Data Platform, Forrester Research, Dec 16, 2015 Consume in business applications Combine related data into views Connect to disparate data sources 3 2 1 Analytical Multiple Protocols, Formats More Structured DATA CONSUMERS Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users Query, Search, Browse Request/Reply, Event Driven CONNECT COMBINE CONSUME Normalized views of Discover, Transform, Share, Deliver, disparate CONNECT data Prepare, COMBINE Improve PUBLISH Publish, Govern, Quality, Integrate Collaborate SQL, MDX Web Services Big Data APIs DISPARATE DATA SOURCES Operational Secure Delivery Web Automation and Indexing Less Structured Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word... 8
How Does It Work? SQL, SOAP, REST, ODATA, etc. Denodo s Information Self Service Publish Customer 360 Data Virtualization Platform Combine, Transform & Integrate Customer Invoice Product Service Usage Incident Client Address Client Type Company Invoicing Product Service Logs Web Usage Incidents Base View (Source Abstraction) Sources RDBMS/EDW S3 Bucket REST Web Service Salesforce Multidimensional Hadoop Web Site 9
Data Virtualization Connects the Users to the Data That They Need Cliff Notes version (TL;DR) 1. Data Virtualization allows you to connect to any data source 2. You can combine and transform that data into the format needed by the consumer 3. The data can be exposed to the consumers in a format and interface that is usable by them Typically consumers use the tools that they already use they don t have to learn new tools and skills to access the data 4. All of this can be done without copying or moving the data The data stays in the original sources (databases, applications, files, etc.) and is retrieved, in real-time, on demand 10
Example using Microsoft Power BI Accessing data for Reports and Dashboards 11
OK What About Performance? (The first question that everyone asks) 1. Query Delegation Moving the processing to the data 2. Advanced query rewriting for analytical queries Partial aggregation pushdown, JOIN-UNION reordering, branch pruning, etc. 3. Offloading of processing to MPP cluster Take advantage of your Hadoop or Spark cluster 4. Caching Cache data from slow data sources ( Temporary materialization ) The cache can be your Hadoop or Spark cluster 12
Example: Logical Data Warehouse Data Virtualization Platform Time Dimension Fact table (sales) Retailer Dimension Product Dimension SELECT retailer.name, product.name, SUM(sales.amount) FROM sales JOIN retailer ON sales.retailer_fk = retailer.id JOIN product ON sales.product_fk = product.id JOIN time ON sales.time_fk = time.id WHERE time.date < ADDMONTH(NOW(),-1) AND product.brand = ACME GROUP BY product.name, retailer.name EDW MDM Total sales by retailer and product during the last month for the brand ACME 13
Query Before Optimization Data Virtualization Platform GROUP BY product.name, retailer.name JOIN 10,000,000 rows JOIN JOIN 300,000,000 rows 100 rows 10 rows 30 rows SELECT sales.retailer_fk, sales.product_fk, sales.time_fk, sales.amount FROM sales SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME SELECT time.date, time.id FROM time WHERE time.date < add_months(current_timestamp, -1) 14
Step 1 Apply JOIN Re-ordering to Maximize Delegation Data Virtualization Platform GROUP BY product.name, retailer.name 10,000,000 rows JOIN JOIN 30,000,000 rows SELECT sales.retailer_fk, sales.product_fk, sales.amount FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp, -1) 100 rows 10 rows SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME 15
Step 2 Partial Aggregation Pushdown The JOIN is on foreign keys (1-to-many) and the GROUP BY is on attributes from the dimensions. Data Virtualization Platform JOIN GROUP BY product.name, retailer.name JOIN 1,000 rows Partial aggregation push-down optimization applied. 10,000 rows SELECT sales.retailer_fk, sales.product_fk, SUM(sales.amount) FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp,-1) GROUP BY sales.retailer_fk, sales.product_fk 100 rows 10 rows SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME 16
Step 3 Choose Best JOIN Methods Selects the right JOIN strategy based on costs for data volume estimations. Data Virtualization Platform NESTED JOIN GROUP BY product.name, retailer.name HASH JOIN 1,000 rows 10 rows 1,000 rows 100 rows SELECT product.name, product.id FROM product WHERE produc.brand = ACME SELECT sales.retailer_fk, sales.product_fk, SUM(sales.amount) FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp, -1) GROUP BY sales.retailer_fk, sales.product_fk WHERE product.id IN (1,2, ) SELECT retailer.name, retailer.id FROM retailer 17
Leveraging the Power of a Hadoop Cluster 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part of the execution tree to the MPP Data Virtualization Platform group by State join 5. Fast parallel execution Support for Spark, Presto and Impala for fast analytical processing in inexpensive Hadoop-based solutions 1. Partial Aggregation push down Maximizes source processing dramatically Reduces network traffic 2M rows (sales by customer) group by ID Current Sales 68 M rows Customer 2 M rows 3. On-demand data transfer DV Platform automatically generates and upload Parquet files Hist. Sales 220 M rows 4. Integration with local data The engine detects when data is cached or comes from a local table already in the MPP System Execution Time Optimization Techniques Others ~ 19 min Simple federation No MPP 43 sec Aggregation push-down With MPP 26 sec Aggregation push-down + MPP integration (Impala 4 nodes) 18
Example using Zeppelin Analytics Notebook Accessing data for analytics and ML 19
Three Key Takeaways FIRST Takeaway Data users have access to a vast array of data and the means to process that data to gain insights the bottleneck is finding, gathering, and preparing the data. SECOND Takeaway Up to 80% of a user s time is spent preparing the data and not doing the analysis on that data. Reducing this time increases that valuable analysis and insights that they deliver. THIRD Takeaway Data Virtualization is a technology that allows a variety of users to quickly and easily find, prepare, and access data, from a vast array of data sources, for their analytical and ML models. 20
Thanks! www.denodo.com info@denodo.com Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.