data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing in 2012 Data sources
5 2 Real-time data 1 Increasing data volumes New data sources & types Data sources Non-Relational Data 3 4 Cloud-born data
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Dashboards Apps
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Dashboards Ingest (EL) Original Data Apps
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load
Extract Transform Load Original Data ETL Tool (SSIS, etc) Transformed Data EDW (SQL Svr, Teradata, etc) BI Tools Data Marts Data Lake(s) Ingest (EL) Original Data Scale-out Storage & Compute (HDFS, Blob Storage, etc) Dashboards Apps Streaming data Transform & Load
Data Sources (Import From) Ingest Data Hub (Storage & Compute) BI Tools Data Marts Move data among Hubs Data Lake(s) Data Sources (Import From) Ingest Data Hub (Storage & Compute) Move to data mart, etc Dashboards Apps Information Production: Connect & Collect Transform & Enrich Publish
Data Sources (Import From) Data Connector: Import from source to Hub Data Hub (Storage & Compute) Coordination & Scheduling Monitoring & Mgmt Data Lineage BI Tools Data Connector: Import/Export among Hubs Data Marts Data Lake(s) Data Sources (Import From) Data Connector: Import from source to Hub Data Hub (Storage & Compute) Data Connector: Export from Hub to data store Dashboards Apps Information Production: Connect & Collect Transform & Enrich Publish
Coordination: Rich scheduling Complex dependencies Incremental rerun Authoring: JSON & Powershell/C# Management: Lineage Data production policies (late data, rerun, latency, etc) Hub: Azure Hub (HDInsight + Blob storage) Activities: Hive, Pig, C# (custom), Azure ML Data Connectors: Blobs, Tables, Azure DB, On Prem SQL Server, Oracle, PostGreSQL, Sybase, DB2, MySQL
Example Scenario: Data warehouse sales to Azure pipeline
Raw sales (Custom view on top of DW tables) Sales by category by day Hive processing Qty Unit OrderDate Company Category Sales Order Ordered Price 6/1/2004 Action Bicycle Specialists Accessories 1716 22.0393SO71784 6/1/2004 Action Bicycle Specialists Bikes 2288 864.0452SO71784 6/1/2004 Action Bicycle Specialists Clothing 2340 26.8155SO71784 6/1/2004 Action Bicycle Specialists Components 598 329.8538SO71784 6/1/2004 Aerobic Exercise Company Components 338 133.8744SO71915 6/1/2004Action Bicycle Specialists Accessories 910 25.1057SO71938
Data Factory Walkthrough
New-AzureDataFactory -Name DW-Demo -Location West-US New-AzureDataFactory -Name HaloTelemetry -Location West-US
New-AzureDataFactoryLinkedService -Name HDInsightLinkedService -DataFactory DW-Demo" -File HDIResource.json New-AzureDataFactoryLinkedService -Name DW_BlobStorage" -DataFactory DW-Demo" -File BlobResource.json
Azure Data Factory New User View On Premises SQL Server Azure Blob Storage
Azure Data Factory New Sales AdventureWorksLTDW2014 On Premises SQL Server
File in blob Azure Data Factory Pipeline New Sales Copy NewSales to Blob Storage Cloud New Sales View Of New User View On Premises SQL Server Azure Blob Storage
Hive Azure Data Factory Pipeline 1 : AdventureWorksDWSalesViewPipelineOnPrem New Sales Copy New Sales to Blob Storage Cloud New Sales Pipeline 2: HiveAggregateData View Cloud New Sales Aggregate AggregatedSales HDInsight New User View On Premises SQL Server Azure Blob Storage
Hive Azure Data Factory Pipeline 1 : AdventureWorksDWSalesViewPipelineOnPrem New Sales Copy New Sales to Blob Storage Cloud New Sales Pipeline 3: HivePipelineOnPrem View Pipeline 2: HiveAggregateData Cloud New Sales Aggregate HDInsight AggregatedSales Copy Aggregated Sales to DW E xternal table file Aggregated Sales New User View DW staging table On Premises SQL Server Azure Blob Storage On Premises SQL Server
// Deploy Table New-AzureDataFactoryTable -DataFactory DW_Demo -File AdventureWorksLTDW2014SalesView.json // Deploy Pipeline New-AzureDataFactoryPipeline -DataFactory DW_Demo -File AdventureWorksDWSalesViewPipelineOnPrem.json // Start Pipeline Set-AzureDataFactoryPipelineActivePeriod -Name AdventureWorksDWSalesViewPipelineOnPre -DataFactory DW_Demo -StartTime 06/27/2015 12:00:00
"availability": { "frequency": "Day", interval": 6 } Activity: (e.g. Hive): Hourly 12-6 6-12 12-6 AggregatesSales
Hourly Sales From DW 12-1 1-2 2-3 Daily Monday Daily Sales Dataset3 Hive Activity Tuesday Daily other source Dataset2 Wednesday Monday Tuesday Wednesday
Is my data successfully getting produced? Is it produced on time? Am I alerted quickly of failures? What about troubleshooting information? Are there any policy warnings or errors?
ADF Pricing Per Month Automation & Management Data Transformation & Movement Low Frequency $0.3164 $0.2531 $0.7909 $0.6327 High Frequency $0.5263 $0.4218 $1.3182 $1.0545 0 (6)-100 activities Cloud 100+ activities 0 (6)-100 activities On Premises 100+ activities Automation/Coordination Layer (Coordination, Scheduling, Management) Note: prices may change at GA. Low Frequency: first 5 activities are free. Resources Used to Execute Activities in a Pipeline: HDInsight (hrs) Compute/VM (hrs) Data Transfer (GB) Execution Layer (Data Storage & Processing)
Contact me: ChristianCote@IA-TechConsulting.com
Thank You! local PASS Community & Sponsors!