From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019
Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways 2
Data Lakes A data lake is a storage repository that holds a vast amount of raw data in its native format. The data structure and requirements are not defined until the data is needed The current needs for sophisticated data-driven intelligence and data science favored this concept for its simplicity and power Hadoop and its ecosystem provided the foundation that data lakes required: vast storage and processing muscle It also favored the concept of ELT vs ETL: load data first, (maybe) 3
Data Lakes Not a Perfect World Physical Nature Based on Replication. Data Lakes require data to be copied to its physical storage Replication extends development cycles and costs Not all data is suitable for replication Single Purpose Real time needs: Cloud and SaaS APIs Large volumes: existing EDW Laws and restrictions Usage of the data lake is often monopolize by data scientists New data silo. No clear path to share insights with business users Lacks the governance, security and quality that business users are used to (e.g. in the EDW) 5
The Rise of Logical Architectures The Evolution of Analytical Architectures Source: Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs Gartner April 2018 6
The Multipurpose Data Lake with Data Virtualization Logical Nature Replication is an option, not a necessity Broaden data access, shorten development times, better insights Tight integration with big data systems. Fast execution with large data volumes Multi-purpose Curated access for non-technical users Better governance and access control Better ROI for the investment of the lake 8
The Multipurpose Data Lake with Data Virtualization A multi-purpose data lake can become an organization s universal data delivery system Architecting the Multi-Purpose Data Lake with Data Virtualization, Rick Van der Lans, April 2018 9
The Virtual Data Lake Access to all Data Sources Single access to all data assets, internal and external: Physical Data Lake (usually based on SQL-on- Hadoop systems) Other databases (EDW, ODS, applications, etc.) SaaS APIs (Salesforce, Google, social media, etc.) Files (local, S3, Azure, etc.) 10
The Virtual Data Lake Ingesting and Caching The physical Data Lake can also be used as Denodo s cache This allows to quickly load any data accessible by Denodo to the Hadoop cluster Caching becomes an alternative to ingestion ELT processes that preserves lineage and governance Load process based on direct load to HDFS: 1. Creation of the target table in Cache system 2. Generation of Parquet files (in chunks) with Snappy compression in the local machine 3. Upload in parallel of Parquet files to HDFS 11
The Virtual Data Lake Using the Lake Processing Engine Denodo optimizer provides native integration with MPP systems to provide one extra key capability: Query Acceleration Denodo can move, on demand, processing to the MPP during execution of a query Parallel power for calculations in the virtual layer Avoids slow processing in-disk when processing buffers don t fit into Denodo s memory (swapped data) 12
Example: Scenario Evolution of sales per ZIP code over the previous years. Scenario: Current data (last 12 months) in EDW Historical data offloaded to Hadoop cluster for cheaper storage Customer master data is used often, so it is cached in the Hadoop cluster union group by ZIP join Very large data volumes: Current Sales 100 million rows Historical Sales 300 million rows Customer 2 million rows (cached) Sales tables have hundreds of millions of rows 13
Example: What are the options? Simple Federation 1) Simple Federation in Virtual Layer Move hundreds of millions of rows for processing in the virtual layer 2) Data Shipping Move Current sales to Hadoop and process content in the cluster Moves 100 million rows Shipping 3) Partial Aggregation Pushdown (Denodo 6) Modifies the execution tree to split the aggregation in two steps: 1. by Customer ID for the JOIN (pushed down to source) 2. by ZIP for the final results (in virtual layer) Reduces significantly network traffic but processing of large amount of data in the virtual layer (aggregation by ZIP) becomes the bottleneck 4) Denodo s MPP Integration (Denodo 7 next slide) group by ID group by ZIP join group by ZIP join 14
The Virtual Data Lake Putting the Pieces Together 2. Integrated with Cost Based Optimizer Based on data volume estimation and the cost of these particular operations, the CBO can decide to move all or part of the execution tree to the MPP group by ZIP join 5. Fast parallel execution Support for Spark, Presto and Impala for fast analytical processing in inexpensive Hadoop-based solutions 1. Partial Aggregation push down Maximizes source processing dramatically Reduces network traffic 2M rows (sales by customer) group by Customer ID Current Sales 68 M rows 3. On-demand data transfer Denodo automatically generates and upload Parquet files Hist. Sales 220 M rows Customer 2 M rows (Cached) 4. Integration with local data The engine detects when data is cached or comes from a local table already in the MPP System Execution Time Optimization Techniques Others ~ 10 min Simple federation No MPP 43 sec Aggregation push-down With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes) 15
The Virtual Data Lake - Conclusions A Virtual Data Lake improves decision making and shortens development cycles Surfaces all company data from multiple repositories without the need to replicate all data into the lake Eliminates data silos: allows for on-demand combination of data from multiple sources A Virtual Data Lake broadens adoption of the lake and improves its ROI Improves governance and metadata management to avoid data swamps Allows controlled access to the lake to non-technical users A Virtual Data Lake offer performance for the Big Data World Leverages the processing power of the existing cluster controlled by Denodo s optimizer 16
Customer Success Story 17
Customer Case Overview THE CHALLENGE: Find an agile way to integrate data from existing silos, including data warehouse, machine data, and others, that will reduce dependencies from business users on IT and provides quick turnaround and flexibility. BUSINESS NEED Optimize operational efficiency, automate manufacturing processes, and deliver on-demand services to business consumers Find smarter ways to aggregate and analyze data An agile solution that enables the monetization of customer-facing data products Free business users from IT reliance to become self-sufficient with reporting and analysis Founded 1925 Annual revenues (FY 2017) $3,1 B Over 20,000 employees Headquarter Germany World s leading supplier of automation technology and technical education. 18
Customer Case Overview SOLUTION: Festo developed a Big Data Analytics Framework to provide a data marketplace to better support the business Using the Denodo Platform to integrate data from numerous on-prem and cloud systems in real-time A unified layer for consistent data access and governance across different data silos 19
Demo 21
Example What s the impact of a new marketing campaign for each country? Historical sales data offloaded to Hadoop cluster for cheaper storage Marketing campaigns managed in an external cloud app Country is part of the customer details table, stored in the DW join group by state join Sales Campaign Customer Consume Combine, Transfor m & Integrate Base View Source Abstraction Sources 22
Key Takeaways 23
Key Takeaways Hadoop-based Data Lakes are the standard approach to modern analytics within most organizations Physical Data Lakes introduce many complexities (replication, synchronization, governance, etc.) that restrict their use Logical Data Lakes allow users to access data from all sources internal and external to grow value of Data Lake approach Data Virtualization creates multipurpose Data Lakes for all kinds of users data scientists and business users Data Virtualization introduces governance and access controls to the Data Lake without impeding the power users' 24
Q&A
Next steps Denodo Express Test Drive Questions? Accelerate Your Fast Data Strategy with Denodo Express. Try Denodo Express for free Test Drive Denodo Platform on AWS for Agile BI and Analytics Take Denodo for Test Drive Please do reach out for any questions or requests. Send us an Email