Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks
Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data related events Blog: asankap.wordpress.com Linked In: linkedin.com/in/asankapadmakumara Twitter: @asanka_e
Where am I from?
Your Feedbacks are important to me!!!
ETL 2.0? Not a real term or concept New way of handling ETL Why the need for ETL 2.0? Difficult to handle large data volume Unable to handle real time data extractions Difficult to handle unstructured data Platform dependent Dedicated hardware Performance issues Less flexibility
ETL Options in Cloud Azure SSIS Integration Runtime Azure Data Factory (Dataflow) HD Insight PolyBase Azure Data Lake Analytics Databricks PowerBI Dataflow http://www.jamesserra.com/archive/2019/01/what-product-to-use-to-transform-my-data/
Microsoft recommends Databricks for Advanced analytics on big data Real-time analytics Modern data warehousing
Microsoft recommends Databricks for Advanced Analytics on Big Data
Microsoft recommends Databricks for Real-time Analytics
Microsoft recommends Databricks for Modern data warehousing
What is Azure Databricks? Apache Spark-based analytics platform on Azure Designed with the founders of Spark Fully managed spark clusters on Azure
Azure Databricks: Main Features One-click deployment Auto scaling/auto termination Optimized connectors to Azure storage platforms Azure AD integration Enterprise grade security
Azure Databricks Ecosystem
Apache Spark Distributed data processing engine In-memory data processing Supports: Java, Python, Scala, R and SQL Used in Data Integration Machine Learning Stream Processing Interactive analytics
Demo 1: Walkthrough of Azure Databricks
Data Engineering using Databricks
1. DataFrame API Untyped API: Columns, Rows Same concept as SQL table or Excel spreadsheet Immutable Partitioned across multiple nodes
1. DataFrame API flightdata2015 = spark.read.option("inferschema", "true").option("header", "true").csv("/data/flight-data/csv/2015-summary.csv") flightdata2015.groupby("dest_country_name").sum("count").withcolumnrenamed("sum(count)", "destination_total").sort(desc("destination_total")).limit(5).show()
2. Datasets API Strongly Typed Collection (Class) Not available in Python and R Slightly slower than DataFrames Allows lambda functions When type-safety is needed When Data Frames does not support required operations case class Flight( DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt ) val flightsdf = spark.read.parquet("/data/flight- data/parquet/2010- summary.parquet/") val flights = flightsdf.as[flight]
3. SQL API Allows to execute SQL queries against Big Data ANSI SQL:2003 Standard Uses Hive metadata store to maintain tables Spark SQL fully compatible with Hive QL
Comparing the APIs https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Transformations and Actions Transformation Instructions to modify a DataFrame Creates a new DataFrame Examples filter, where, join Action Event that triggers transformations 3 main actions View data Collect data Write data Examples count, collect, show, take(n)
Which language to choose? Scala Python R SQL Native Language of Spark Faster compared to other languages Difficult syntax Less community help One of the mostly used language Better community Good set of libraries such as Pyspark Comparably slower than Scala (in case of large data) Lots of data science libraries Databricks supports R Studio Allows SQL against big data ANSI SQL:2003 Use Hive Metadata store to maintain tables Tools like Power BI, Tableau can use JDBC to connect SQL tables
Demo 2: Read and Transform Data E T L Invoice Transaction Fact Invoice People Dimension
Security Authentication and Authorization via Azure AD base security Data Security Data Table Level Security Provides Read, write, modify permission On Database, tables, views and functions Compute Security Cluster Level Security Levels No Permissions Can Attach To Can restart Can Manage Code Security Workspace Level Security Notebook Level Security Levels Can Read Can Run Can Edit Can Manage
Source Control and Scheduling Source Control Schedule Inbuild basic version controller Notebook revisions Allow to select and restore any last saved version of a notebook Add a comment to a version Support GitHub, Bitbucket Cloud, or Azure DevOps Can schedule to Run Notebook Execute JAR Run spark submit From month to minute Can configure to send a mail when Job Start Job Success Job Fail
Your Feedbacks are important to me!!!
Q & A
More about this topic
Distributed computing? Why not Map reducer? Large batch processing but not real time stream Disk base - but not in-memory Comprehensive but not easy to use Slow in data sharing between operations If size matters- but not speed Iterative operations are painful So.. They (UC Berkeley AMPLab team) Created Spark-> Contributed to Apache -> Create Databricks Company
Extract Default Source Types CSV, JSON, Parquet, ORC, JDBC/ODBC connections, Plain-text files Hundreds of connectors by the community and Microsoft Azure Cosmos BD, Azure SQL Data Warehouse, Mongo DB, Cassandra, etc.. Support parallel reading ( based on the source) Upload small file directly to DBFS DBFS A distributed file system on Databricks clusters Files persist to Azure Blob storage Can mount Azure Blob Storage and Azure Data Lake Store Gen 1
Transform pyspark SQL Library (Python) Spark SQL (Scala, SQL) Python or Scala User Define Functions (UDF) Define own functions limited to the session Executes row by row for a data frame Scala over Python in performance Custom Libraries written in Python, Java, Scala, and R
Load Save mode Append Overwrite errorifexists Ignore Does not support update If designation does not support Truncate, it recreate table/file Support parallel writing ( base of the destination)
DataFrame API Untyped API- Columns, Rows Same concept as SQL table or Excel spared sheet Immutable Distributed across multiple nodes Partitions Break a DataFrame across the cluster of machines Can define schema manually or can take from source Great for data scientists who have worked with Python Pandas or R DataFrames
SQL API Allow to execute SQL queries against Big Data ANSI SQL:2003 Standard Uses Hive metadata store to maintain tables Spark SQL fully compatible with Hive QL Database Table View Global Table Managed Table Unmanaged Table Local Table Local Temp View Global Temp View Familiar to BI Analysts https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rddsdataframes-and-datasets.html
Transformations and Actions Transformation Instructions to modify a DataFrame Creates a new DataFrame Narrow transformations and Wide Transformations Lazy Evaluation Transformations build logical plan Execute the plan only in an action Example filter, where, join Action Event that trigger transformations 3 main actions View data Collect data Write data Example count, collect, show, take(n)