Transitioning From SSIS to Azure Data Factory Meagan Longoria, Solution Architect, BlueGranite
Microsoft Data Platform MVP I enjoy contributing to and learning from the Microsoft data community. Blogger I blog about business intelligence, data visualization, and consulting at DataSavvy.me Meagan Longoria Solution Architect, BlueGranite /meaganlongoria @mmarie Contributor to a new book I had the pleasure of writing a chapter for Let Her Finish Series: Voices from the Data Platform Owner of an English Bulldog My twitter account is business intelligence with a side of bulldog.
About You SSIS developers? Data Factory developers? Accidentally walked into the room and decided not to leave?
What Is Integration Services? Microsoft Integration Services is a platform for building enterprise-level data integration and data transformations solutions Basically: A data migration and ETL tool It is a component of SQL Server that has existed since SQL Server 2005. https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services
SSIS Overview
Inside an SSIS Solution 1 or more projects containing 1 or more packages 0 or more project-level connection managers 0 or project parameters Solution Project(s) Package(s) Task(s)
Inside an SSIS Package Each package contains Control Flow Data Flow Connection Managers Package Control Flow Task Data Flow Data Flow Task Source Transformation Destination
Inside an SSIS Package, Continued The following objects extend the functionality of a package: Parameters Variables Event Handlers Configurations Logging and Log Providers
Example SSIS Package
ADF Overview
What is Azure Data Factory ADF is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation Basically: A data orchestration tool It is a Platform-as-a-Service offering in Azure that was released in 2015 and updated in 2017.
Inside an Azure Data Factory V1 Solution Linked Service defines connection properties Datasets pointer to the data you want to process/have processed, sometimes defining the schema Pipelines combine datasets and activities and define an execution schedule Data Management Gateway allows ADF to retrieve data from an on-premises data source
Inside an Azure Data Factory V1 Solution Pipeline Dataset Dataset Activity Activity Dataset Dataset Activities consume and produce datasets Pipelines are logical groupings of activities
Example ADF V1 Pipeline
New In Azure Data Factory V2 Integration Runtime compute infrastructure used to provide integration capabilities across networks; can be on prem, managed, or IaaS Control Flow activity dependencies, parameters, foreach loops, activity outputs Trigger-based flows on demand (coming soon) or at a certain time Monitoring pipeline runs rather than just activities (SDK only right now) SSIS in Azure
A Few Notes on ADF V2 Brand new SDKs for ADF V2.NET PowerShell Python Future: Java Only available in East US and East US2 For now, must be created programmatically; cloud-based GUI designer coming soon
Compare Solutions This Photo by Unknown Author is licensed under CC BY-SA
Similarities Between SSIS & ADF
Some Things Are The Same Both are developed using Visual Studio Both can copy data to and from Azure Both can fire up HDInsight clusters and run Hive and Pig scripts Both use role-based security Both can trigger alerts upon encountering an error Both have logging Both can be automated (Biml, PowerShell,.NET)
Dependencies/Order SSIS ADF V1 ADF V2
Orchestration ADF V1 is a data orchestration tool ADF V2 is a data orchestration and integration tool SSIS can be used as a data orchestration tool Have you ever seen anyone use SSIS just to execute stored procedures?
Differences Between SSIS & ADF
Not Better, Not Worse, Just Different ADF is usually used for ELT (as opposed to ETL) ADF V1 is built around the concept of timeslices, V2 has options ADF scheduling is in the pipeline, SSIS needs SQL Agent or Azure Automation (or another tool)
ADF Gaps Use SSIS, C#, or Spark for transformations, nothing built-in V1 has no built-in error handling, V2 has On Failure activity V1 has no GUI in VS, V2 will have a GUI in Azure but no source control connectivity V1 Config files in VS cause multiple copies of files per environment Can t execute more frequently than 15 minutes with native scheduler V1: Not as many data sources as SSIS Sharpen your C# Skills to get around this!
ADF Strength PaaS requires no infrastructure Easier to scale out - great with Big Data ADF JSON is easier to source control than the GUI-created XML of SSIS Updated more frequently than SSIS (new features!) Native support for zip/unzip Dynamic partitioning for folder and file name Data lineage
ADF Lessons Learned
Lessons Learned From Loading SQL DW No transforms means get ready to write a bunch of SQL Time required for deployment to Azure can vary by a couple of hours ADF is in UTC be careful of Daylight Savings In V1, one-time pipelines don t auto-execute on deployment and don t appear in the Monitor & Manage app ADF cannot natively move, only copy
Lessons Learned From Loading SQL DW Be sure to use Service Principal auth with ADLS Beware the missing JRE when converting to ORC files Deploy with PowerShell so you don t have to re-deploy datasets and affect their availability Automate with Biml! Check out Gerhard s ADF monitor made in Power BI: https://github.com/gbrueckl/azure.datafactory.powerbimonitor)
Lessons Learned: Final Thoughts ADF solutions may contain 5 or more different languages. Don t be afraid to mix and match technologies for the best fit. For now, embrace the custom activity. ADF solutions can better handle different types of data (big/small/tabular/semi-structured)
Thank You Learn more from Meagan Longoria @mmarie DataSavvy.me