Oskari Heikkinen New capabilities of Azure Data Factory v2
Oskari Heikkinen Lead Cloud Architect at BIGDATAPUMP Microsoft P-TSP Azure Advisors Numerous projects on Azure Worked with Microsoft Data Platform since 2011 oskari.heikkinen@bigdatapump.com @oskarialex +358 40 561 8481 https://www.linkedin.com/in/oskariheikkinen/
Agenda Brief history of Integration on Azure Azure Data Factory v1 Azure Data Factory v2 Comparison Demo time!
Brief history of Integration on Azure Until October 2014: SQL Server Integration Services (SSIS) is the only solution for data movement and transformation purposes October 2014: Azure Data Factory v1 public preview August 2015: Azure Data Factory v1 GA September 2017: Azure Data Factory v2 public preview
Brief history of Integration on Azure Until October 2014: SQL Server Integration Services (SSIS) is the only solution for data movement and transformation purposes October 2014: Azure Data Factory v1 public preview August 2015: Azure Data Factory v1 GA September 2017: Azure Data Factory v2 public preview
Azure Data Factory v1 Azure Data Factory is the data integration service in Azure: Ingest data from data stores Transforming data by e.g. pushing down commands/queries to Hadoop, Data Lake Analytics, SQL databases Data Factory does not contain the capability to transform data in itself Publish output data to data stores A Data Factory workflow is implemented as one or more pipelines, which orchestrate and automate data movement and transformation. Supports several on-premises and cloud sources. Offers monitoring capability.
Azure Data Factory v1 The Diagram View of Data Factory provides a pane for monitoring a data factory and its assets.
Azure Data Factory v1
On-premise integration scenario: Direct connection to data source Azure E xpressroute or VPN connection through VNet Proxy server SQL*Net (1521) Data Management Ga teway TLS/TDS (1443) SQL Server Oracle Customer
On-premise integration scenario: Flat file integration Azure E xpressroute or VPN connection through VNet Proxy server SMB 3.0 (445) Data Management Ga teway File share Oracle SQL Server Customer
On-premise integration scenario: OData integration Azure E xpressroute or VPN conne ction through VNet Proxy server Data Management Ga teway HTTPS (443) OData interface Data source Customer
Solution planning for example scenario Use case: Team has developed architecture for real-time analytics but are missing batch processing. We need to create PowerBI reports on one hour intervals. The data is currently saved to Data Lake Store in real-time. Report should be able to show data dynamically from different days/hours/weeks. Logical thought process - What is the business need? We don t need stream analytics. Batch-based processing. - Azure Data Factory vs SSIS? Data factory processes data in slices. No need for custom solution in case of problems in upcoming stream (rerun, alerts). - Data Factory is chosen. What will I use for data transformation? - Source is Data Lake Store, target is SQL DB (Data model is simple, amount of data is small) - If I choose data copy & stored procedure in SQL DB it s not scalable (moving unnecessary data to SQL DB before prosessing. Processing should happen on top of Data Lake Store. - Data Lake Analytics is faster to spin up than HDInsight or DataBricks. I choose Data Lake Analytics and build the code.
Data Lake Analytics filters hourly data and creates pre-calculations Copy pre-calculated results to database Run stored procedure to make history comparison (between hours) and move data from Staging to DW Architecture for example scenario Data Factory Data Factory Data Factory Data Lake Analytics Data Lake Store Azure SQL Database Data Factory Cloud Custom Edge IoT devices Customer
Azure Data Factory v2
Azure Data Factory v2: Object types Linked Service: is a connection object for data sources, data destinations as well as compute resources required Dataset: defines the structure of the data Activity: a single task in a pipeline. There are three types of activities: control, data movement, data transformation. Pipeline: a set of activities orchestrated sequentially and/or in parallel to execute the whole end-to-end logic.
Azure Data Factory v2: Integration Runtime Provides integration and transformation capabilities over different network environments. Enables: Data movement SSIS package execution Activity dispatch Integration Runtime types: Azure Azure-SSIS Self-hosted Integration Runtime Public network Private network Azure Data movement, activity dispatch - Azure-SSIS SSIS package execution SSIS package execution Self-hosted Data movement, activity dispatch Data movement, activity dispatch
Azure Data Factory v2: Running the pipeline Using a trigger Schedule trigger Tumbling window trigger (similar to slices in v1) On-demand: Powershell: Invoke-AzureRmDataFactoryV2Pipeline -DataFactory $df - PipelineName "Adfv2Pipeline" -ParameterFile.\PipelineParameters.json REST API: https://management.azure.com/subscriptions/mysubid/resourcegroups/myreso urcegroup/providers/microsoft.datafactory/factories/mydatafactory/pipelines/c opypipeline/createrun?api-version=2017-03-01-preview.net: client.pipelines.createrunwithhttpmessagesasync(resourcegroup, datafactoryname, pipelinename, parameters)
Comparison Functionality Data Factory v1 Data Factory v2 Parameters No Parameters are key-value pairs that can be defined at the beginning of run (trigger & on-demand execution) Pipeline runs No A single instance of a pipeline execution. Activity runs No An instance of an activity execution within a pipeline. Trigger runs No An instance of a trigger execution Scheduling No Scheduler trigger or execution via external scheduler. Run SSIS packages No Yes, with Integration Runtime On-Demand Spark No Yes, both HDInsight and DataBricks Control flow No Yes
Demo time! Create Azure Data Factory v2 Create pipeline for transferring data from On-Premise to Azure Data Lake Store Use Azure DataBricks for Machine Learning Push the predictions to Azure SQL Data Warehouse [Visualize with Power BI]