Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks

Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data related events Blog: asankap.wordpress.com Linked In: linkedin.com/in/asankapadmakumara Twitter: @asanka_e

Where am I from?

Your Feedbacks are important to me!!!

ETL 2.0? Not a real term or concept New way of handling ETL Why the need for ETL 2.0? Difficult to handle large data volume Unable to handle real time data extractions Difficult to handle unstructured data Platform dependent Dedicated hardware Performance issues Less flexibility

ETL Options in Cloud Azure SSIS Integration Runtime Azure Data Factory (Dataflow) HD Insight PolyBase Azure Data Lake Analytics Databricks PowerBI Dataflow http://www.jamesserra.com/archive/2019/01/what-product-to-use-to-transform-my-data/

Microsoft recommends Databricks for Advanced analytics on big data Real-time analytics Modern data warehousing

Microsoft recommends Databricks for Advanced Analytics on Big Data

Microsoft recommends Databricks for Real-time Analytics

Microsoft recommends Databricks for Modern data warehousing

What is Azure Databricks? Apache Spark-based analytics platform on Azure Designed with the founders of Spark Fully managed spark clusters on Azure

Azure Databricks: Main Features One-click deployment Auto scaling/auto termination Optimized connectors to Azure storage platforms Azure AD integration Enterprise grade security

Azure Databricks Ecosystem

Apache Spark Distributed data processing engine In-memory data processing Supports: Java, Python, Scala, R and SQL Used in Data Integration Machine Learning Stream Processing Interactive analytics

Demo 1: Walkthrough of Azure Databricks

Data Engineering using Databricks

1. DataFrame API Untyped API: Columns, Rows Same concept as SQL table or Excel spreadsheet Immutable Partitioned across multiple nodes

1. DataFrame API flightdata2015 = spark.read.option("inferschema", "true").option("header", "true").csv("/data/flight-data/csv/2015-summary.csv") flightdata2015.groupby("dest_country_name").sum("count").withcolumnrenamed("sum(count)", "destination_total").sort(desc("destination_total")).limit(5).show()

2. Datasets API Strongly Typed Collection (Class) Not available in Python and R Slightly slower than DataFrames Allows lambda functions When type-safety is needed When Data Frames does not support required operations case class Flight( DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt ) val flightsdf = spark.read.parquet("/data/flight- data/parquet/2010- summary.parquet/") val flights = flightsdf.as[flight]

3. SQL API Allows to execute SQL queries against Big Data ANSI SQL:2003 Standard Uses Hive metadata store to maintain tables Spark SQL fully compatible with Hive QL

Comparing the APIs https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Transformations and Actions Transformation Instructions to modify a DataFrame Creates a new DataFrame Examples filter, where, join Action Event that triggers transformations 3 main actions View data Collect data Write data Examples count, collect, show, take(n)

Which language to choose? Scala Python R SQL Native Language of Spark Faster compared to other languages Difficult syntax Less community help One of the mostly used language Better community Good set of libraries such as Pyspark Comparably slower than Scala (in case of large data) Lots of data science libraries Databricks supports R Studio Allows SQL against big data ANSI SQL:2003 Use Hive Metadata store to maintain tables Tools like Power BI, Tableau can use JDBC to connect SQL tables

Demo 2: Read and Transform Data E T L Invoice Transaction Fact Invoice People Dimension

Security Authentication and Authorization via Azure AD base security Data Security Data Table Level Security Provides Read, write, modify permission On Database, tables, views and functions Compute Security Cluster Level Security Levels No Permissions Can Attach To Can restart Can Manage Code Security Workspace Level Security Notebook Level Security Levels Can Read Can Run Can Edit Can Manage

Source Control and Scheduling Source Control Schedule Inbuild basic version controller Notebook revisions Allow to select and restore any last saved version of a notebook Add a comment to a version Support GitHub, Bitbucket Cloud, or Azure DevOps Can schedule to Run Notebook Execute JAR Run spark submit From month to minute Can configure to send a mail when Job Start Job Success Job Fail

Your Feedbacks are important to me!!!

Q & A