Using Data Virtualization to Accelerate Time-to-Value From Your Data. Integrating Distributed Data in Real Time

Similar documents
From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

microsoft

Fast Innovation requires Fast IT

Data Analytics at Logitech Snowflake + Tableau = #Winning

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

20466C - Version: 1. Implementing Data Models and Reports with Microsoft SQL Server

Oracle BI 11g R1: Build Repositories

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Data Virtualization and the API Ecosystem

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Přehled novinek v SQL Server 2016

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Oracle BI 11g R1: Build Repositories Course OR102; 5 Days, Instructor-led

Modern Data Warehouse The New Approach to Azure BI

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Oracle BI 12c: Build Repositories

Implementing Data Models and Reports with SQL Server 2014

COURSE 20466D: IMPLEMENTING DATA MODELS AND REPORTS WITH MICROSOFT SQL SERVER

Drawing the Big Picture

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

IBM API Connect: Introduction to APIs, Microservices and IBM API Connect

Welcome to the Gathering Intelligence from your Applications and Data: The case for Oracle BI eseminar

Guide Users along Information Pathways and Surf through the Data

Cisco Information Server 6.2

Optimizing and Modeling SAP Business Analytics for SAP HANA. Iver van de Zand, Business Analytics

Phillip Labry Sr. BI Engineer IT development for over 25 years Developer, DBA, BI Consultant Experience with Manufacturing, Telecom, Banking, Retail,

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Performance Issue : More than 30 sec to load. Design OK, No complex calculation. 7 tables joined, 500+ millions rows

ADABAS & NATURAL 2050+

What is Gluent? The Gluent Data Platform

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Oracle Big Data Discovery

INTRODUCTION. Chris Claterbos, Vlamis Software Solutions, Inc. REVIEW OF ARCHITECTURE

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Talend Big Data Sandbox. Big Data Insights Cookbook

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Progress DataDirect For Business Intelligence And Analytics Vendors

Welcome! Power BI User Group (PUG) Copenhagen

BIG DATA COURSE CONTENT

Performance Optimization for Informatica Data Services ( Hotfix 3)

Realizing the Full Potential of MDM 1

SQL in the Hybrid World

Security and Performance advances with Oracle Big Data SQL

Evolving To The Big Data Warehouse

CloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List)

April Copyright 2013 Cloudera Inc. All rights reserved.

SAP Crystal Reports and SAP HANA: Options and Opportunities (0301)

Oracle Big Data Connectors

Top Five Reasons for Data Warehouse Modernization Philip Russom

Talend Big Data Sandbox. Big Data Insights Cookbook

The Evolution of Big Data Platforms and Data Science

Exam Questions

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

FINANCIAL REGULATORY REPORTING ACROSS AN EVOLVING SCHEMA

Esri and MarkLogic: Location Analytics, Multi-Model Data

Introduction to Federation Server

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Informatica PowerExchange for Tableau User Guide

IBM dashdb Local. Using a software-defined environment in a private cloud to enable hybrid data warehousing. Evolving the data warehouse

ANY Data for ANY Application Exploring IBM Data Virtualization Manager for z/os in the era of API Economy

What's New in SAS Data Management

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

The Technology of the Business Data Lake. Appendix

COGNOS BI I) BI introduction Products Introduction Architecture Workflows

How am I going to skim through these data?

SAP HANA Certification Training

WEBMETHODS AGILITY FOR THE DIGITAL ENTERPRISE WEBMETHODS. What you can expect from webmethods

After completing this course, participants will be able to:

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

Evolution of Capabilities Hunter Downey, Solution Advisor

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

Data in the Cloud and Analytics in the Lake

THE RISE OF. The Disruptive Data Warehouse

Hybrid Data Platform

QLIKVIEW ARCHITECTURAL OVERVIEW

Talend Spark Meetup. Edward Ost Talend

How Real Time Are Your Analytics?

Data Management Glossary

Oracle Big Data SQL High Performance Data Virtualization Explained

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Self-Service Data Preparation for Qlik. Cookbook Series Self-Service Data Preparation for Qlik

Data Modeling in Looker

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

IBM Data Virtualization Manager for z/os Leverage data virtualization synergy with API economy to evolve the information architecture on IBM Z

Vendor: SAP. Exam Code: C_HANAIMP_1. Exam Name: SAP Certified Application Associate - SAP HANA 1.0. Version: Demo

Part 1: Indexes for Big Data

Overview of Data Services and Streaming Data Solution with Azure

IBM DATA VIRTUALIZATION MANAGER FOR z/os

20777A: Implementing Microsoft Azure Cosmos DB Solutions

DEEP DIVE. Leave IT Alone: The Vast Value of Self-Service. #DMRadio

Transcription:

Using Data Virtualization to Accelerate Time-to-Value From Your Data Integrating Distributed Data in Real Time

Speaker Paul Moxon VP Data Architectures and Chief Evangelist @ Denodo Technologies

Data, Data Everywhere, And Not a Thought to Think 3

Agile Analytics Architecture 4

Data Pipeline Problem 70-80% 20-30% Data Discovery & Preparation Analysis Actions Data Discovery Data Extraction Data Preprocessing Data Analysis Decision Making 5

Data Pipeline Problem 50-60% 40-50% Data Preparation Analysis Actions Data Analysis Decision Making 6

Agile Analytics Architecture - Revisited DATA VIRTUALIZATION 7

What is Data Virtualization? Data virtualization integrates disparate data sources in real time or near-real time to meet demands for analytics and transactional data. Create a Road Map For A Real-time, Agile, Self- Service Data Platform, Forrester Research, Dec 16, 2015 Consume in business applications Combine related data into views Connect to disparate data sources 3 2 1 Analytical Multiple Protocols, Formats More Structured DATA CONSUMERS Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users Query, Search, Browse Request/Reply, Event Driven CONNECT COMBINE CONSUME Normalized views of Discover, Transform, Share, Deliver, disparate CONNECT data Prepare, COMBINE Improve PUBLISH Publish, Govern, Quality, Integrate Collaborate SQL, MDX Web Services Big Data APIs DISPARATE DATA SOURCES Operational Secure Delivery Web Automation and Indexing Less Structured Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word... 8

How Does It Work? SQL, SOAP, REST, ODATA, etc. Denodo s Information Self Service Publish Customer 360 Data Virtualization Platform Combine, Transform & Integrate Customer Invoice Product Service Usage Incident Client Address Client Type Company Invoicing Product Service Logs Web Usage Incidents Base View (Source Abstraction) Sources RDBMS/EDW S3 Bucket REST Web Service Salesforce Multidimensional Hadoop Web Site 9

Data Virtualization Connects the Users to the Data That They Need Cliff Notes version (TL;DR) 1. Data Virtualization allows you to connect to any data source 2. You can combine and transform that data into the format needed by the consumer 3. The data can be exposed to the consumers in a format and interface that is usable by them Typically consumers use the tools that they already use they don t have to learn new tools and skills to access the data 4. All of this can be done without copying or moving the data The data stays in the original sources (databases, applications, files, etc.) and is retrieved, in real-time, on demand 10

Example using Microsoft Power BI Accessing data for Reports and Dashboards 11

OK What About Performance? (The first question that everyone asks) 1. Query Delegation Moving the processing to the data 2. Advanced query rewriting for analytical queries Partial aggregation pushdown, JOIN-UNION reordering, branch pruning, etc. 3. Offloading of processing to MPP cluster Take advantage of your Hadoop or Spark cluster 4. Caching Cache data from slow data sources ( Temporary materialization ) The cache can be your Hadoop or Spark cluster 12

Example: Logical Data Warehouse Data Virtualization Platform Time Dimension Fact table (sales) Retailer Dimension Product Dimension SELECT retailer.name, product.name, SUM(sales.amount) FROM sales JOIN retailer ON sales.retailer_fk = retailer.id JOIN product ON sales.product_fk = product.id JOIN time ON sales.time_fk = time.id WHERE time.date < ADDMONTH(NOW(),-1) AND product.brand = ACME GROUP BY product.name, retailer.name EDW MDM Total sales by retailer and product during the last month for the brand ACME 13

Query Before Optimization Data Virtualization Platform GROUP BY product.name, retailer.name JOIN 10,000,000 rows JOIN JOIN 300,000,000 rows 100 rows 10 rows 30 rows SELECT sales.retailer_fk, sales.product_fk, sales.time_fk, sales.amount FROM sales SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME SELECT time.date, time.id FROM time WHERE time.date < add_months(current_timestamp, -1) 14

Step 1 Apply JOIN Re-ordering to Maximize Delegation Data Virtualization Platform GROUP BY product.name, retailer.name 10,000,000 rows JOIN JOIN 30,000,000 rows SELECT sales.retailer_fk, sales.product_fk, sales.amount FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp, -1) 100 rows 10 rows SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME 15

Step 2 Partial Aggregation Pushdown The JOIN is on foreign keys (1-to-many) and the GROUP BY is on attributes from the dimensions. Data Virtualization Platform JOIN GROUP BY product.name, retailer.name JOIN 1,000 rows Partial aggregation push-down optimization applied. 10,000 rows SELECT sales.retailer_fk, sales.product_fk, SUM(sales.amount) FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp,-1) GROUP BY sales.retailer_fk, sales.product_fk 100 rows 10 rows SELECT retailer.name, retailer.id FROM retailer SELECT product.name, product.id FROM product WHERE produc.brand = ACME 16

Step 3 Choose Best JOIN Methods Selects the right JOIN strategy based on costs for data volume estimations. Data Virtualization Platform NESTED JOIN GROUP BY product.name, retailer.name HASH JOIN 1,000 rows 10 rows 1,000 rows 100 rows SELECT product.name, product.id FROM product WHERE produc.brand = ACME SELECT sales.retailer_fk, sales.product_fk, SUM(sales.amount) FROM sales JOIN time ON sales.time_fk = time.id WHERE time.date < add_months(current_timestamp, -1) GROUP BY sales.retailer_fk, sales.product_fk WHERE product.id IN (1,2, ) SELECT retailer.name, retailer.id FROM retailer 17

Leveraging the Power of a Hadoop Cluster 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part of the execution tree to the MPP Data Virtualization Platform group by State join 5. Fast parallel execution Support for Spark, Presto and Impala for fast analytical processing in inexpensive Hadoop-based solutions 1. Partial Aggregation push down Maximizes source processing dramatically Reduces network traffic 2M rows (sales by customer) group by ID Current Sales 68 M rows Customer 2 M rows 3. On-demand data transfer DV Platform automatically generates and upload Parquet files Hist. Sales 220 M rows 4. Integration with local data The engine detects when data is cached or comes from a local table already in the MPP System Execution Time Optimization Techniques Others ~ 19 min Simple federation No MPP 43 sec Aggregation push-down With MPP 26 sec Aggregation push-down + MPP integration (Impala 4 nodes) 18

Example using Zeppelin Analytics Notebook Accessing data for analytics and ML 19

Three Key Takeaways FIRST Takeaway Data users have access to a vast array of data and the means to process that data to gain insights the bottleneck is finding, gathering, and preparing the data. SECOND Takeaway Up to 80% of a user s time is spent preparing the data and not doing the analysis on that data. Reducing this time increases that valuable analysis and insights that they deliver. THIRD Takeaway Data Virtualization is a technology that allows a variety of users to quickly and easily find, prepare, and access data, from a vast array of data sources, for their analytical and ML models. 20

Thanks! www.denodo.com info@denodo.com Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.