From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Similar documents
Using Data Virtualization to Accelerate Time-to-Value From Your Data. Integrating Distributed Data in Real Time

Fast Innovation requires Fast IT

Modern Data Warehouse The New Approach to Azure BI

What is Gluent? The Gluent Data Platform

Modernizing Business Intelligence and Analytics

Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP

Drawing the Big Picture

Data Analytics at Logitech Snowflake + Tableau = #Winning

Making Data Integration Easy For Multiplatform Data Architectures With Diyotta 4.0. WEBINAR MAY 15 th, PM EST 10AM PST

WHITEPAPER. MemSQL Enterprise Feature List

VOLTDB + HP VERTICA. page

Talend Spark Meetup. Edward Ost Talend

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Data Lake Based Systems that Work

Microsoft Exam

Capture Business Opportunities from Systems of Record and Systems of Innovation

Oliver Engels & Tillmann Eitelberg. Big Data! Big Quality?

Hitachi Vantara Overview Pentaho 8.0 and 8.1 Roadmap. Pedro Alves

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The Emerging Data Lake IT Strategy

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Improving Your Business with Oracle Data Integration See How Oracle Enterprise Metadata Management Can Help You

Analyze Big Data Faster and Store It Cheaper

Technical Sheet NITRODB Time-Series Database

Virtuoso Infotech Pvt. Ltd.

Digital Enterprise Platform for Live Business. Kevin Liu SAP Greater China, Vice President General Manager of Big Data and Platform BU

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Bringing Data to Life

Data Management Glossary

Microsoft Developer Day

5/24/ MVP SQL Server: Architecture since 2010 MCT since 2001 Consultant and trainer since 1992

Přehled novinek v SQL Server 2016

Data Virtualization and the API Ecosystem

Accelerate Big Data Insights

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

5 Fundamental Strategies for Building a Data-centered Data Center

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

REGULATORY REPORTING FOR FINANCIAL SERVICES

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Ian Choy. Technology Solutions Professional

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

The Evolution of Big Data Platforms and Data Science

IBM Data Virtualization Manager for z/os Leverage data virtualization synergy with API economy to evolve the information architecture on IBM Z

Massive Scalability With InterSystems IRIS Data Platform

Big Data with Hadoop Ecosystem

Data Virtualization in the Time of Big Data

IBM DB2 Analytics Accelerator use cases


Heisenberg and the uncertainty laws of BI. Zoltan Vago, Senior DWH Consultant 03-June-2015

Oliver Engels & Tillmann Eitelberg. Big Data! Big Quality?

Lambda Architecture for Batch and Stream Processing. October 2018

@Pentaho #BigDataWebSeries

Microsoft Analytics Platform System (APS)

Stages of Data Processing

Data-Intensive Distributed Computing

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Metadata and the Rise of Big Data Governance: Active Open Source Initiatives. October 23, 2018

Approaching the Petabyte Analytic Database: What I learned

Big Data Architect.

Informatica Enterprise Information Catalog

Ayush Ganeriwal Senior Principal Product Manager, Oracle. Benjamin Perez-Goytia Principal Solution Architect A-Team, Oracle

Shine a Light on Dark Data with Vertica Flex Tables

Overview of Data Services and Streaming Data Solution with Azure

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

Security and Performance advances with Oracle Big Data SQL

Data Virtualization for the Enterprise

An Introduction to Big Data Formats

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Top Five Reasons for Data Warehouse Modernization Philip Russom

What's New in SAS Data Management

Building a Data Strategy for a Digital World

ETL is No Longer King, Long Live SDD

RDP203 - Enhanced Support for SAP NetWeaver BW Powered by SAP HANA and Mixed Scenarios. October 2013

How to integrate data into Tableau

Cloud Computing Private Cloud

Data in the Cloud and Analytics in the Lake

April Copyright 2013 Cloudera Inc. All rights reserved.

Swimming in the Data Lake. Presented by Warner Chaves Moderated by Sander Stad

GOVERNING HADOOP (AND THE DATA LAKE)

Azure DevOps. Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region

Enterprise Recording and Live Streaming Architecture with VBrick

Schwan Food Company s Journey with SAP HANA

Databricks, an Introduction

Datacenter replication solution with quasardb

Magento U. Getting Started with Magento Business Intelligence Essentials

How to Accelerate Merger and Acquisition Synergies

Designing a Modern Data Warehouse + Data Lake

BIG DATA ANALYTICS A PRACTICAL GUIDE

Low Friction Data Warehousing WITH PERSPECTIVE ILM DATA GOVERNOR

Introduction to Big-Data

Azure Data Lake Store

Transcription:

From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways 2

Data Lakes A data lake is a storage repository that holds a vast amount of raw data in its native format. The data structure and requirements are not defined until the data is needed The current needs for sophisticated data-driven intelligence and data science favored this concept for its simplicity and power Hadoop and its ecosystem provided the foundation that data lakes required: vast storage and processing muscle It also favored the concept of ELT vs ETL: load data first, (maybe) 3

Data Lakes Not a Perfect World Physical Nature Based on Replication. Data Lakes require data to be copied to its physical storage Replication extends development cycles and costs Not all data is suitable for replication Single Purpose Real time needs: Cloud and SaaS APIs Large volumes: existing EDW Laws and restrictions Usage of the data lake is often monopolize by data scientists New data silo. No clear path to share insights with business users Lacks the governance, security and quality that business users are used to (e.g. in the EDW) 5

The Rise of Logical Architectures The Evolution of Analytical Architectures Source: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs Gartner April 2018 6

The Multipurpose Data Lake with Data Virtualization Logical Nature Replication is an option, not a necessity Broaden data access, shorten development times, better insights Tight integration with big data systems. Fast execution with large data volumes Multi-purpose Curated access for non-technical users Better governance and access control Better ROI for the investment of the lake 8

The Multipurpose Data Lake with Data Virtualization A multi-purpose data lake can become an organization s universal data delivery system Architecting the Multi-Purpose Data Lake with Data Virtualization, Rick Van der Lans, April 2018 9

The Virtual Data Lake Access to all Data Sources Single access to all data assets, internal and external: Physical Data Lake (usually based on SQL-on- Hadoop systems) Other databases (EDW, ODS, applications, etc.) SaaS APIs (Salesforce, Google, social media, etc.) Files (local, S3, Azure, etc.) 10

The Virtual Data Lake Ingesting and Caching The physical Data Lake can also be used as Denodo s cache This allows to quickly load any data accessible by Denodo to the Hadoop cluster Caching becomes an alternative to ingestion ELT processes that preserves lineage and governance Load process based on direct load to HDFS: 1. Creation of the target table in Cache system 2. Generation of Parquet files (in chunks) with Snappy compression in the local machine 3. Upload in parallel of Parquet files to HDFS 11

The Virtual Data Lake Using the Lake Processing Engine Denodo optimizer provides native integration with MPP systems to provide one extra key capability: Query Acceleration Denodo can move, on demand, processing to the MPP during execution of a query Parallel power for calculations in the virtual layer Avoids slow processing in-disk when processing buffers don t fit into Denodo s memory (swapped data) 12

Example: Scenario Evolution of sales per ZIP code over the previous years. Scenario: Current data (last 12 months) in EDW Historical data offloaded to Hadoop cluster for cheaper storage Customer master data is used often, so it is cached in the Hadoop cluster union group by ZIP join Very large data volumes: Current Sales 100 million rows Historical Sales 300 million rows Customer 2 million rows (cached) Sales tables have hundreds of millions of rows 13

Example: What are the options? Simple Federation 1) Simple Federation in Virtual Layer Move hundreds of millions of rows for processing in the virtual layer 2) Data Shipping Move Current sales to Hadoop and process content in the cluster Moves 100 million rows Shipping 3) Partial Aggregation Pushdown (Denodo 6) Modifies the execution tree to split the aggregation in two steps: 1. by Customer ID for the JOIN (pushed down to source) 2. by ZIP for the final results (in virtual layer) Reduces significantly network traffic but processing of large amount of data in the virtual layer (aggregation by ZIP) becomes the bottleneck 4) Denodo s MPP Integration (Denodo 7 next slide) group by ID group by ZIP join group by ZIP join 14

The Virtual Data Lake Putting the Pieces Together 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part of the execution tree to the MPP group by ZIP join 5. Fast parallel execution Support for Spark, Presto and Impala for fast analytical processing in inexpensive Hadoop-based solutions 1. Partial Aggregation push down Maximizes source processing dramatically Reduces network traffic 2M rows (sales by customer) group by Customer ID Current Sales 68 M rows 3. On-demand data transfer Denodo automatically generates and upload Parquet files Hist. Sales 220 M rows Customer 2 M rows (Cached) 4. Integration with local data The engine detects when data is cached or comes from a local table already in the MPP System Execution Time Optimization Techniques Others ~ 10 min Simple federation No MPP 43 sec Aggregation push-down With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes) 15

The Virtual Data Lake - Conclusions A Virtual Data Lake improves decision making and shortens development cycles Surfaces all company data from multiple repositories without the need to replicate all data into the lake Eliminates data silos: allows for on-demand combination of data from multiple sources A Virtual Data Lake broadens adoption of the lake and improves its ROI Improves governance and metadata management to avoid data swamps Allows controlled access to the lake to non-technical users A Virtual Data Lake offer performance for the Big Data World Leverages the processing power of the existing cluster controlled by Denodo s optimizer 16

Customer Success Story 17

Customer Case Overview THE CHALLENGE: Find an agile way to integrate data from existing silos, including data warehouse, machine data, and others, that will reduce dependencies from business users on IT and provides quick turnaround and flexibility. BUSINESS NEED Optimize operational efficiency, automate manufacturing processes, and deliver on-demand services to business consumers Find smarter ways to aggregate and analyze data An agile solution that enables the monetization of customer-facing data products Free business users from IT reliance to become self-sufficient with reporting and analysis Founded 1925 Annual revenues (FY 2017) $3,1 B Over 20,000 employees Headquarter Germany World s leading supplier of automation technology and technical education. 18

Customer Case Overview SOLUTION: Festo developed a Big Data Analytics Framework to provide a data marketplace to better support the business Using the Denodo Platform to integrate data from numerous on-prem and cloud systems in real-time A unified layer for consistent data access and governance across different data silos 19

Demo 21

Example What s the impact of a new marketing campaign for each country? Historical sales data offloaded to Hadoop cluster for cheaper storage Marketing campaigns managed in an external cloud app Country is part of the customer details table, stored in the DW join group by state join Sales Campaign Customer Consume Combine, Transfor m & Integrate Base View Source Abstraction Sources 22

Key Takeaways 23

Key Takeaways Hadoop-based Data Lakes are the standard approach to modern analytics within most organizations Physical Data Lakes introduce many complexities (replication, synchronization, governance, etc.) that restrict their use Logical Data Lakes allow users to access data from all sources internal and external to grow value of Data Lake approach Data Virtualization creates multipurpose Data Lakes for all kinds of users data scientists and business users Data Virtualization introduces governance and access controls to the Data Lake without impeding the power users' 24

Q&A

Next steps Denodo Express Test Drive Questions? Accelerate Your Fast Data Strategy with Denodo Express. Try Denodo Express for free Test Drive Denodo Platform on AWS for Agile BI and Analytics Take Denodo for Test Drive Please do reach out for any questions or requests. Send us an Email