Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Size: px

Start display at page:

Download "Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks"

Homer Wilcox
5 years ago
Views:

1 Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks

2 Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data related events Blog: asankap.wordpress.com Linked In: linkedin.com/in/asankapadmakumara

3 Where am I from?

4 Your Feedbacks are important to me!!!

5 ETL 2.0? Not a real term or concept New way of handling ETL Why the need for ETL 2.0? Difficult to handle large data volume Unable to handle real time data extractions Difficult to handle unstructured data Platform dependent Dedicated hardware Performance issues Less flexibility

6 ETL Options in Cloud Azure SSIS Integration Runtime Azure Data Factory (Dataflow) HD Insight PolyBase Azure Data Lake Analytics Databricks PowerBI Dataflow

7 Microsoft recommends Databricks for Advanced analytics on big data Real-time analytics Modern data warehousing

8 Microsoft recommends Databricks for Advanced Analytics on Big Data

9 Microsoft recommends Databricks for Real-time Analytics

10 Microsoft recommends Databricks for Modern data warehousing

11 What is Azure Databricks? Apache Spark-based analytics platform on Azure Designed with the founders of Spark Fully managed spark clusters on Azure

12 Azure Databricks: Main Features One-click deployment Auto scaling/auto termination Optimized connectors to Azure storage platforms Azure AD integration Enterprise grade security

13 Azure Databricks Ecosystem

14 Apache Spark Distributed data processing engine In-memory data processing Supports: Java, Python, Scala, R and SQL Used in Data Integration Machine Learning Stream Processing Interactive analytics

15 Demo 1: Walkthrough of Azure Databricks

16 Data Engineering using Databricks

17 1. DataFrame API Untyped API: Columns, Rows Same concept as SQL table or Excel spreadsheet Immutable Partitioned across multiple nodes

18 1. DataFrame API flightdata2015 = spark.read.option("inferschema", "true").option("header", "true").csv("/data/flight-data/csv/2015-summary.csv") flightdata2015.groupby("dest_country_name").sum("count").withcolumnrenamed("sum(count)", "destination_total").sort(desc("destination_total")).limit(5).show()

19 2. Datasets API Strongly Typed Collection (Class) Not available in Python and R Slightly slower than DataFrames Allows lambda functions When type-safety is needed When Data Frames does not support required operations case class Flight( DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: BigInt ) val flightsdf = spark.read.parquet("/data/flight- data/parquet/2010- summary.parquet/") val flights = flightsdf.as[flight]

20 3. SQL API Allows to execute SQL queries against Big Data ANSI SQL:2003 Standard Uses Hive metadata store to maintain tables Spark SQL fully compatible with Hive QL

21 Comparing the APIs

22 Transformations and Actions Transformation Instructions to modify a DataFrame Creates a new DataFrame Examples filter, where, join Action Event that triggers transformations 3 main actions View data Collect data Write data Examples count, collect, show, take(n)

23 Which language to choose? Scala Python R SQL Native Language of Spark Faster compared to other languages Difficult syntax Less community help One of the mostly used language Better community Good set of libraries such as Pyspark Comparably slower than Scala (in case of large data) Lots of data science libraries Databricks supports R Studio Allows SQL against big data ANSI SQL:2003 Use Hive Metadata store to maintain tables Tools like Power BI, Tableau can use JDBC to connect SQL tables

24 Demo 2: Read and Transform Data E T L Invoice Transaction Fact Invoice People Dimension

25 Security Authentication and Authorization via Azure AD base security Data Security Data Table Level Security Provides Read, write, modify permission On Database, tables, views and functions Compute Security Cluster Level Security Levels No Permissions Can Attach To Can restart Can Manage Code Security Workspace Level Security Notebook Level Security Levels Can Read Can Run Can Edit Can Manage

26 Source Control and Scheduling Source Control Schedule Inbuild basic version controller Notebook revisions Allow to select and restore any last saved version of a notebook Add a comment to a version Support GitHub, Bitbucket Cloud, or Azure DevOps Can schedule to Run Notebook Execute JAR Run spark submit From month to minute Can configure to send a mail when Job Start Job Success Job Fail

27 Your Feedbacks are important to me!!!

28 Q & A

29 More about this topic

30 Distributed computing? Why not Map reducer? Large batch processing but not real time stream Disk base - but not in-memory Comprehensive but not easy to use Slow in data sharing between operations If size matters- but not speed Iterative operations are painful So.. They (UC Berkeley AMPLab team) Created Spark-> Contributed to Apache -> Create Databricks Company

31 Extract Default Source Types CSV, JSON, Parquet, ORC, JDBC/ODBC connections, Plain-text files Hundreds of connectors by the community and Microsoft Azure Cosmos BD, Azure SQL Data Warehouse, Mongo DB, Cassandra, etc.. Support parallel reading ( based on the source) Upload small file directly to DBFS DBFS A distributed file system on Databricks clusters Files persist to Azure Blob storage Can mount Azure Blob Storage and Azure Data Lake Store Gen 1

32 Transform pyspark SQL Library (Python) Spark SQL (Scala, SQL) Python or Scala User Define Functions (UDF) Define own functions limited to the session Executes row by row for a data frame Scala over Python in performance Custom Libraries written in Python, Java, Scala, and R

33 Load Save mode Append Overwrite errorifexists Ignore Does not support update If designation does not support Truncate, it recreate table/file Support parallel writing ( base of the destination)

34 DataFrame API Untyped API- Columns, Rows Same concept as SQL table or Excel spared sheet Immutable Distributed across multiple nodes Partitions Break a DataFrame across the cluster of machines Can define schema manually or can take from source Great for data scientists who have worked with Python Pandas or R DataFrames

35 SQL API Allow to execute SQL queries against Big Data ANSI SQL:2003 Standard Uses Hive metadata store to maintain tables Spark SQL fully compatible with Hive QL Database Table View Global Table Managed Table Unmanaged Table Local Table Local Temp View Global Temp View Familiar to BI Analysts

36 Transformations and Actions Transformation Instructions to modify a DataFrame Creates a new DataFrame Narrow transformations and Wide Transformations Lazy Evaluation Transformations build logical plan Execute the plan only in an action Example filter, where, join Action Event that trigger transformations 3 main actions View data Collect data Write data Example count, collect, show, take(n)

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized