Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Similar documents
Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

Decision Support Systems aka Analytical Systems

Data Warehouse and Data Mining

CHAPTER 3 Implementation of Data warehouse in Data Mining

Data Warehouse and Data Mining

Sql Fact Constellation Schema In Data Warehouse With Example

Exam Datawarehousing INFOH419 July 2013

DATA WAREHOUSE EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

ETL and OLAP Systems

Data Mining Concepts & Techniques

Advanced Data Management Technologies Written Exam

Data Warehousing Introduction. Toon Calders

Adnan YAZICI Computer Engineering Department

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

1. Analytical queries on the dimensionally modeled database can be significantly simpler to create than on the equivalent nondimensional database.

An Overview of Data Warehousing and OLAP Technology

Data Warehousing Introduction. Esteban Zimanyi Slides from Toon Calders

collection of data that is used primarily in organizational decision making.

Information Management course

Data Warehouse and Data Mining

Call: SAS BI Course Content:35-40hours

Fig 1.2: Relationship between DW, ODS and OLTP Systems

Data Warehouse and Data Mining

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

BUSINESS INTELLIGENCE. SSAS - SQL Server Analysis Services. Business Informatics Degree

Oracle Database 11g: Data Warehousing Fundamentals

UNIT -1 UNIT -II. Q. 4 Why is entity-relationship modeling technique not suitable for the data warehouse? How is dimensional modeling different?

Data Warehousing and OLAP

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Logical design DATA WAREHOUSE: DESIGN Logical design. We address the relational model (ROLAP)

Data Warehouses. Yanlei Diao. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Evolution of Database Systems

After completing this course, participants will be able to:

Decision Support, Data Warehousing, and OLAP

SQL Server Analysis Services

On-Line Analytical Processing (OLAP) Traditional OLTP

Summary of Last Chapter. Course Content. Chapter 2 Objectives. Data Warehouse and OLAP Outline. Incentive for a Data Warehouse

What is a Data Warehouse?

Syllabus. Syllabus. Motivation Decision Support. Syllabus

Data Warehousing and Data Mining

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1396

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Deccansoft Software Services Microsoft Silver Learning Partner. SSAS Syllabus

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Warehouses and OLAP. Database and Information Systems. Data Warehouses and OLAP. Data Warehouses and OLAP

Call: Datastage 8.5 Course Content:35-40hours Course Outline

Oracle 1Z0-515 Exam Questions & Answers

Aggregating Knowledge in a Data Warehouse and Multidimensional Analysis

Data warehouse design

Data warehouse design

STRATEGIC INFORMATION SYSTEMS IV STV401T / B BTIP05 / BTIX05 - BTECH DEPARTMENT OF INFORMATICS. By: Dr. Tendani J. Lavhengwa

A Novel Approach of Data Warehouse OLTP and OLAP Technology for Supporting Management prospective

Data Mining and Data Warehousing Introduction to Data Mining

REPORTING AND QUERY TOOLS AND APPLICATIONS

Processing of Very Large Data

CS 245: Database System Principles. Warehousing. Outline. What is a Warehouse? What is a Warehouse? Notes 13: Data Warehousing

Data Mining. Data warehousing. Hamid Beigy. Sharif University of Technology. Fall 1394

Dr.G.R.Damodaran College of Science

Chapter 3. The Multidimensional Model: Basic Concepts. Introduction. The multidimensional model. The multidimensional model

Data Warehousing and Decision Support

DATA WAREHOUING UNIT I

DATA MINING AND WAREHOUSING

CSE 544 Principles of Database Management Systems. Fall 2016 Lecture 14 - Data Warehousing and Column Stores

Data Warehousing. Ritham Vashisht, Sukhdeep Kaur and Shobti Saini

Data Warehouse Logical Design. Letizia Tanca Politecnico di Milano (with the kind support of Rosalba Rossato)

On-Line Application Processing

Data Warehousing and Decision Support. Introduction. Three Complementary Trends. [R&G] Chapter 23, Part A

02 Hr/week. Theory Marks. Internal assessment. Avg. of 2 Tests

Advanced Data Management Technologies

Information Management course

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

CT75 DATA WAREHOUSING AND DATA MINING DEC 2015

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

ETL TESTING TRAINING

Business Intelligence Roadmap HDT923 Three Days

Chapter 4, Data Warehouse and OLAP Operations

Data Mining: Exploring Data. Lecture Notes for Chapter 3

CHAPTER 8 DECISION SUPPORT V2 ADVANCED DATABASE SYSTEMS. Assist. Prof. Dr. Volkan TUNALI

DW Performance Optimization (II)

Data Warehouse and Mining

Warehousing. Data Mining

CS614 - Data Warehousing - Midterm Papers Solved MCQ(S) (1 TO 22 Lectures)

Data Warehousing and Decision Support

Exam /Course 20767B: Implementing a SQL Data Warehouse

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Full file at

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Introduction to Data Warehousing

CS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University

SQL Server 2005 Analysis Services

Advanced Data Management Technologies

Decision Support Systems

A Star Schema Has One To Many Relationship Between A Dimension And Fact Table

Data Analysis and Data Science

Teradata Aggregate Designer

1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.

Data Warehousing 2. ICS 421 Spring Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa

Introduction to DWH / BI Concepts

Transcription:

Data Warehousing ETL Esteban Zimányi ezimanyi@ulb.ac.be Slides by Toon Calders 1

Overview Picture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Marts ROLAP Server Data Sources Data Storage OLAP Engine Front-End Tools

Until now other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Sources Conceptual modeling (DFM) Data Marts Logical model (star, snowflake) Indices; view materialization Data Storage ROLAP Server Data cube; slice & dice; cube browsing OLAP Engine Front-End Tools

This lecture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Sources How to fill the DW? How keep it up to date? Conceptual modeling (DFM) Data Marts Logical model (star, snowflake) Indices; view materialization Data Storage ROLAP Server Data cube; slice & dice; cube browsing OLAP Engine Front-End Tools

Summary Data Warehouse architectures Single-, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load Sec. 1.3, 1.4; Ch. 10 of Golfarelli and Rizzi book 5

Single Layer Architecture Operational data Source layer Data warehouse layer Middleware Analysis Reporting tool OLAP tool 6

Single Layer Architecture Not frequently used Reconciled data is not materialized Middleware makes heterogeneity transparent to the user + No data duplication - No separation of analytical and transactional processing 7

Two-Layer Architecture Operational data Source layer Data staging Data warehouse layer ETL tools Data Warehouse Data marts Meta-data Analysis Reporting tool OLAP tool 8

Two-Layered Architecture Introduction of data warehouse layer Data is materialized Historical data management ETL runs regularly to keep data warehouse upto-date - Data duplication + Separation of analytics and operations 9

Source layer Operational data Three-Layer Architecture Data staging Reconciled layer Loading Data warehouse layer ETL tools Reconciled data ETL tools Data Warehouse Data marts Meta-data Analysis Reporting tool OLAP tool 10

Three-Layer Architecture Data from different sources needs to be reconciled Different schemas need to be integrated Source data needs to be cleansed In 3-tier architecture the reconciled data is materialized as well Useful in its own respect Makes ETL more transparent Neither historical, nor dimensional 11

Data Staging Often the data staging area is materialized as well Store helper tables Take burden away from source systems 12

Summary Data Warehouse architectures Single-, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 13

Extract Relevant data obtained from data sources First load: extract all data Subsequent loads: extract changes In 3-Tier infrastructure: From source data to reconciled database Schema integration, transformation, cleansing From reconciled database to data warehouse De-normalization, introduction of surrogate keys Calculation of derived data 14

Extract Static vs incremental extraction Incremental & immediate Application-assisted Trigger-based extraction Log-based Incremental & delayed Timestamp-based File Comparison 15

ETL tools Application-Assisted Update of data warehouse deeply integrated into the application software Immediate Requires adaptation of existing applications Hard to maintain Application DBMS Database List of Updates Data warehouse 16

ETL tools Trigger-Based Closely related to application-based Triggers to store updates Huge performance hit for database More transparent for application layer Only possible for data maintained in database Application DBMS trigger DB Delta tables Extract Delta Tables Data warehouse 17

ETL tools Log-Based Many database systems keep change logs To ensure durability; in-between checkpoints all transactions to the database are logged Use these change logs for updating data warehouse Transparent to the user Application DBMS Database DB log Data warehouse 18

ETL tools Timestamp-Based Operational database can be changed to keep whole history Different levels of timestamps Timestamp per tuple Temporal database Application DBMS Temporal Database Timestamp Based extraction Data warehouse 19

Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 20

Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 D2 21

Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 3 D3 Betty F 18 1 D1 Martin M 64 2 D3 D2 D3 22

Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Karl? John s segment? Capture Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 3 D3 Betty F 18 1 D1 Martin M 64 2 D3 D2 D3 23

Timestamp-Based Based on the application field and update frequency, one timestamp per tuple may be acceptable Type-2 change for attributes that change several times between updates?! If more fine-grained behavior is required we can go to full temporal database Register valid time (start-end) 24

Compare ETL tools File Comparison Easy; works even for flat files Does not capture intermediate changes Application DBMS 1. Database 2. Changes Data warehouse Previous Copy 25

Method Extraction Methods - Overview Maintain full history? Impact on performance operational level Complexity extraction procedures DBMS dependent Requires changes to application layer Maintenance Static No No Low No None Easy Application assisted Trigger based Yes Medium High No Many Difficult Yes Medium Medium Yes None Medium Log based Yes No Low Limited None Easy Timestamp based File comparison (yes) Some Low Limited Some Medium Not all No Medium No None Easy 26

Summary Data Warehouse architectures Single, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 27

Transform/Cleanse Conversion Different date types Enrichment Combine information of different attributes to create a new one Joining data without key; Entity resolution Correcting errors De-duplication 28

De-duplication & Entity Resolution De-duplication Recognize duplicate entries Entity resolution Recognize if two entities from different sources are actually referring to the same object Not always obvious if there is no key shared between different data sources In most extreme case need for methods to directly compare strings 29

Example: Cleansing customer data John White Downing St. 10 TW1A 2AA London (UK) Fname: John Sname: White Adress: Downing St. 10 ZIP: TW1A 2AA normalizing City: London Country: UK Fname: John Sname: White Adress: 10, Downing St. standardizing ZIP: TW1A 2AA City: London Fname: John Country: United Kingdom Sname: White Adress: 10, Downing St. ZIP: SW1A 2AA correcting City: London Country: United Kingdom Example from Golfarelli and Rizzi book 30

Comparing Strings How to compare strings? Patrick Smith vs. Patrik Smyth Popular distance measure: edit distance Insert, Delete, Overwrite Shortest sequence of operations to turn one string into another Patrick Smith D Patrik Smith R Patrik Smyth 31

Edit Distance Relatively easy to compute via dynamic programming Smyth vs Simte d(s,s) d(smyt,si) S m y t h 0 1 2 3 4 5 S 1 0 1 2 3 4 i 2 1 1 2 3 4 m 3 2 1 2 3 4 t 4 3 2 2 2 3 e 5 4 3 3 3 3 d(smyt,simt) d(smyth,simte) 32

Edit Distance Compute content of a cell as follows: Example: d(smyth,simte): d(smyth,simt) + insert e at the end or delete h + d(smyt,simte) or d(smyt,simt) + overwrite h with e 33

Edit Distance Compute content of a cell as follows: Example: d(smyth,simte) = min( d(smyth,simt) + 1 d(smyt,simte) + 1, d(smyt,simt) + 1 ) 34

Edit Distance Compute content of a cell as follows: Example: d(smyt,simt): d(smyt,sim) + insert t at the end or delete t + d(smy,simt) or d(smy,sim) 35

Edit Distance Compute content of a cell as follows: Example: d(smyt,simt) = min( d(smyt,sim) + 1 d(smy,simt) + 1, d(smy,sim) + 0 ) 36

Edit Distance Compute content of a cell as follows: if string 1 [k] = string 2 [l] M[k,l] = min(m[k-1,l-1], M[k-1,l]+1, M[k,l-1]+1) else: M[k,l] = min(m[k-1,l-1]+1, M[k-1,l]+1, M[k,l-1]+1) 37

Summary Data Warehousing architectures Single, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 38

Populating Dimension Tables Type 1 and type 2 changes May require dimension table to be loaded in the data staging area Introduction of surrogate keys Permanently maintain mapping tables in the staging area In case of a type-1 change: change the value in the dimension table In case of a type-2 change: introduce a new tuple with a new surrogate key 39

Populating Fact Tables Always update dimension tables before fact table (if possible) Referential integrity Not always possible: introduce dummy value in dimension tables if dimension key lookup fails Update materialized views Can be optimized if measures are distributive 40

Computation of Aggregations Part of the data warehouse loading involves computation of aggregations materialized views Order in which they are computed can be important Sort-based or Hash-based See, e.g.: Agarwal et al. On the computation of multidimensional aggregates. VLDB 1996 41

Sort-Based Aggregation A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; 42

Sort-Based Aggregation A B C count 1 5 6 8 2 6 6 9 1 8 6 10 SORT 1 7 5 6 3 3 3 5 2 1 4 8 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; 43

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN 44

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 8 45

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 46

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 47

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 8 48

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 49

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 3 3 3 5 50

Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 3 3 3 5 51

Sort-Based Aggregation Observation: Table sorted on ABC = sorted on AB = sorted on A One sort supports 3 aggregations A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 52

Pipe-Sort Sort on ABC Scan sorted relation As long as next tuple is the same as previous Else Update aggregated tuple; add count of current Ship aggregated tuple to disk Pipe tuple to procedure computing aggregation on AB 53

Pipe-Sort Key problem: divide materialized views lattice into pipes, minimizing sorts ABCD BCD AB BC A B D () 54

Pipe-Sort Key problem: divide materialized views lattice into pipes, minimizing sorts sort BCDA BCD sort AB BC A B D sort () 55

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 56

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 57

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 2 1 4 9 58

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 2 1 4 9 1 8 6 10 59

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 9 1 8 6 10 60

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 9 1 8 6 10 3 3 3 5 61

Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 17 1 8 6 10 3 3 3 5 62

Hash-Based Optimization Multiple hash tables may fit at the same time A B C count A B C SUM 1 5 6 8 1 5 6 14 2 1 4 9 2 1 4 17 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 1 8 6 10 3 3 3 5 A C SUM 1 6 24 C SUM 2 4 17 3 3 5 6 24 4 17 3 5 63

Conclusion ETL is often the most time consuming parts of a datawarehousing project Reported to take up to 80% of time Different ways to capture changes Application-assisted Trigger-based extraction Log-based Timestamp-based File Comparison 64

Conclusion Transform/Cleanse involves Normalization Standardization De-duplication Entity resolution Load Surrogate keys (lookup tables in staging area) First update dimension tables, only then fact table 65