Modern Data Warehouse The New Approach to Azure BI
History
On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform
On-Premise SQL Server Big Data Solutions Modern Analytics Platform
What is a modern data warehouse? Source: Russom, P. (2013) The Modern Data Warehouse: What Enterprises Must Have Today and What They ll Need in the Future, TWDI
Data Analysis Paradigm Shift OLD WAY: Structure -> Ingest -> Analyze NEW WAY: Ingest -> Analyze -> Structure This solves the two biggest reasons why many EDW projects fail: Too much time spent modeling when you don t know all of the questions your data needs to answer Wasted time spent on ETL where the net effect is a star schema that doesn t actually show value
Data lake is the center of a big data solution A storage repository, that holds a vast amount of raw data in its native format until it is needed. Inexpensively store unlimited data Collect all data just in case Store data with no modeling Schema on read Complements EDW Frees up expensive EDW resources Quick user access to data ETL Hadoop tools Easily scalable Active archive (federated queries) Data Science workspaces Areas of curated data Supports structured, semi-structured and unstructured data
Data Lake layers Raw data layer Raw events are stored for historical reference. Also called staging layer or landing area Cleansed data layer Raw events are transformed (cleaned and mastered) into directly consumable data sets. Aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings). Also called conformed layer Application data layer Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. DW application, advanced analysis process, etc). This is also called by a lot of other names: workspace, trusted, gold, secure, production ready, governed Sandbox data layer Optional layer to be used to play in. Also called exploration layer or data science workspace Still need data governance so your data lake does not turn into a data swamp!
Data platform continuum Shared lower cost On-premises Hybrid cloud Off-premises Dedicated higher cost Higher administration Lower administration
SMP vs MPP SMP - Symmetric Multiprocessing Multiple CPUs used to complete individual processes simultaneously All CPUs share the same memory, disks, and network controllers (scale-up) All SQL Server implementations up until now have been SMP Mostly, the solution is housed on a shared SAN MPP - Massively Parallel Processing Uses many separate CPUs running in parallel to execute a single program Shared Nothing: Each CPU has its own memory and disk (scale-out) Segments communicate using high-speed network between nodes
On-premises Cloud Microsoft SMP options On-premises SMP (Data Warehouse Fast Track or custom) Full SQL Server surface area. Known, deployed, owned by customer. 5TB to145+ TB compute; 5TB to 1.2 PB+ storage. Relational Azure SQL Data Warehouse SQL Server in Azure VMs SQL Server 2016 Fast Track for Azure VMs Beyond relational Azure Data Lake Azure HDInsight Azure Marketplace Cloud SMP (SQL Server 2016 Fast Track for Azure VMs) Full SQL Server surface area. PolyBase Insights Known, deployed by customer, hosted by Microsoft. Certified VM sizes include GS5 (32 cores, 448GB memory, 64TB). Certified to 16 TB storage. Integrate with non-relational data SQL Server 2016 Data Warehouse Fast Track Analytics Platform System Third-party Hadoop distributions Hadoop, Cloudera, Hortonworks, Map R. Language translation: SQL Server 2016 PolyBase. Flexibility
Options to store and process data
Control Node Interacts with apps & connections; coordinates activities of the compute nodes. Compute Nodes Provide the computational engines to process data. Distributions Every row of data is stored in a distribution. The method of distributing data is critical to achieving good performance. MPP Architecture
PolyBase Query relational and non-relational data with T-SQL PolyBase is interactive while U-SQL is batch. PolyBase extents T-SQL onto data via views while U-SQL natively operates on data and virtualizes access to other SQL data sources (no metadata needed) and supports more formats (JSON) and libraries/udos
When to consider a Virtual Machine Consider when you want to: Closely resemble a traditional DW implementation Run an SMP DB larger than Azure SQL DB supports Quickly migrate an existing solution to the cloud Run the software or DB platform of your choice with full feature parity Run all aspects of SQL Server (SSIS, SSAS MD, MDS) Have full control & administer all aspects
When to consider a SQL DB Consider when you want to: Create a new DW solution Run a small to medium-sized DW workload (up to 4TB currently) Take advantage of PaaS & reduced administration effort Optionally utilize automatic tuning features
When to consider a Azure SQL DW Consider when you want to: Run a large-size DW solution (1-4TB+) Scale up/down, or pause, based on demand Integrate with multistructured data
BIG DATA STORAGE Reduced Administration BIG DATA ANALYTICS K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S CONTROL EASE OF USE Azure Databricks Azure Data Lake Analytics Azure HDInsight Azure Marketplace HDP CDH MapR Any Hadoop technology, any distribution Workload optimized, managed clusters Frictionless & Optimized Spark clusters Data Engineering in a Job-as-a-service model IaaS Clusters Managed Clusters Big Data as-a-service Azure Data Lake Analytics Azure Data Lake Store Azure Storage
A Z U R E D A T A B R I C K S Azure Databricks Collaborative Workspace IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Machine learning models Cloud storage Deploy Production Jobs & Workflows BI tools MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits
Evolving to a Modern Data Warehouse
Realise business value from the data
Common Data Service for Analytics
CDS for Analytics Resources and Video Links https://powerbi.microsoft.com/en-us/blog/coming-soon-to-power-bicommon-data-service-for-analytics/ https://www.youtube.com/watch?v=xaa5c1bowpe https://www.youtube.com/watch?v=1vq0hlnz06a
Resources https://azure.microsoft.com/en-us/blog/technical-reference-implementation-for-enterprise-bi-andreporting/ https://www.sqlchick.com/entries/2017/1/9/defining-the-components-of-a-modern-data-warehouse-aglossary http://www.jamesserra.com/archive/2014/12/the-modern-data-warehouse/ https://skylandtech.net/2014/09/22/a-modern-data-warehouse-architecture-part-1-add-a-data-lake/
Thank you