data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012 Data sources
5 2 Real-time data 1 Increasing data volumes New data sources & types Data sources Non-Relational Data 3 4 Cloud-born data
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Dashboards Apps
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Dashboards Ingest (EL) Original Data Apps
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load
Data Sources (Import From) Ingest Data Hub (Storage & Compute) BI Tools Data Marts Move data among Hubs Data Lake(s) Data Sources (Import From) Ingest Data Hub (Storage & Compute) Move to data mart, etc Dashboards Apps Information Production: Connect & Collect Transform & Enrich Publish
Data Sources (Import From) Data Connector: Import from source to Hub Data Hub (Storage & Compute) Coordination & Scheduling Monitoring & Mgmt Data Lineage BI Tools Data Connector: Import/Export among Hubs Data Marts Data Lake(s) Data Sources (Import From) Data Connector: Import from source to Hub Data Hub (Storage & Compute) Data Connector: Export from Hub to data store Dashboards Apps Information Production: Connect & Collect Transform & Enrich Publish
Example Scenario: Data warehouse sales to Azure pipeline
Raw sales (Custom view on top of DW tables) Sales by category by day Hive processing Qty Unit OrderDate Company Category Sales Order Ordered Price 6/1/2004 Action Bicycle Specialists Accessories 1716 22.0393SO71784 6/1/2004 Action Bicycle Specialists Bikes 2288 864.0452SO71784 6/1/2004 Action Bicycle Specialists Clothing 2340 26.8155SO71784 6/1/2004 Action Bicycle Specialists Components 598 329.8538SO71784 6/1/2004 Aerobic Exercise Company Components 338 133.8744SO71915 6/1/2004Action Bicycle Specialists Accessories 910 25.1057SO71938
Data Factory Walkthrough
New-AzureDataFactory -Name DW-Demo2 -Location West-US New-AzureDataFactory -Name HaloTelemetry -Location West-US
Azure Data Factory New User View On Premises SQL Server Azure Blob Storage
View Of Azure Data Factory New Sales Aggregated sales AdventureWorksLTDW2014 On Premises SQL Server Azure Blob Storage
View Of View Of Azure Data Factory Pipeline New Sales Copy NewSales to Blob Storage Cloud New Sales New User Activity New User View On Premises SQL Server Azure Blob Storage
View Of Azure Data Factory Pipeline New Sales Copy New Sales to Blob Storage Cloud New Sales Pipeline OnPrem SSIS package Aggregated Sales Cloud New Sales Aggregate AggregatedSales HDInsight New User View On Premises SQL Server Azure Blob Storage
"availability": { "frequency": "Day", interval": 6 } Activity: (e.g. Hive): Hourly 12-6 6-12 12-6 AggregatesSales
Hourly Sales From DW 12-1 1-2 2-3 Daily Monday Daily Sales Dataset3 Hive Activity Tuesday Daily other source Dataset2 Wednesday Monday Tuesday Wednesday
Is my data successfully getting produced? Is it produced on time? Am I alerted quickly of failures? What about troubleshooting information? Are there any policy warnings or errors?
Easily move data to my existing data marts for consumption by my existing BI tools Azure DB SQL Server on premises Oracle Files Azure Blob content
Coordination: Rich scheduling Complex dependencies Incremental rerun Authoring: JSON & Powershell/C# Management: Lineage Data production policies (late data, rerun, latency, etc) Hub: Azure Hub (HDInsight + Blob storage) Activities: Hive, Pig, C# Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, Oracle
Contact me: ChristianCote@IA-TechConsulting.com
http://channel9.msdn.com/events/teched www.microsoft.com/learning http://microsoft.com/technet http://developer.microsoft.com