Data Warehousing ETL Esteban Zimányi ezimanyi@ulb.ac.be Slides by Toon Calders 1
Overview Picture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Marts ROLAP Server Data Sources Data Storage OLAP Engine Front-End Tools
Until now other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Sources Conceptual modeling (DFM) Data Marts Logical model (star, snowflake) Indices; view materialization Data Storage ROLAP Server Data cube; slice & dice; cube browsing OLAP Engine Front-End Tools
This lecture other sources Metadata Monitor & Integrator OLAP Server Analysis Operational DBs Extract Transform Load Refresh Data Warehouse Serve Query/Reporting Data Mining Data Sources How to fill the DW? How keep it up to date? Conceptual modeling (DFM) Data Marts Logical model (star, snowflake) Indices; view materialization Data Storage ROLAP Server Data cube; slice & dice; cube browsing OLAP Engine Front-End Tools
Summary Data Warehouse architectures Single-, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load Sec. 1.3, 1.4; Ch. 10 of Golfarelli and Rizzi book 5
Single Layer Architecture Operational data Source layer Data warehouse layer Middleware Analysis Reporting tool OLAP tool 6
Single Layer Architecture Not frequently used Reconciled data is not materialized Middleware makes heterogeneity transparent to the user + No data duplication - No separation of analytical and transactional processing 7
Two-Layer Architecture Operational data Source layer Data staging Data warehouse layer ETL tools Data Warehouse Data marts Meta-data Analysis Reporting tool OLAP tool 8
Two-Layered Architecture Introduction of data warehouse layer Data is materialized Historical data management ETL runs regularly to keep data warehouse upto-date - Data duplication + Separation of analytics and operations 9
Source layer Operational data Three-Layer Architecture Data staging Reconciled layer Loading Data warehouse layer ETL tools Reconciled data ETL tools Data Warehouse Data marts Meta-data Analysis Reporting tool OLAP tool 10
Three-Layer Architecture Data from different sources needs to be reconciled Different schemas need to be integrated Source data needs to be cleansed In 3-tier architecture the reconciled data is materialized as well Useful in its own respect Makes ETL more transparent Neither historical, nor dimensional 11
Data Staging Often the data staging area is materialized as well Store helper tables Take burden away from source systems 12
Summary Data Warehouse architectures Single-, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 13
Extract Relevant data obtained from data sources First load: extract all data Subsequent loads: extract changes In 3-Tier infrastructure: From source data to reconciled database Schema integration, transformation, cleansing From reconciled database to data warehouse De-normalization, introduction of surrogate keys Calculation of derived data 14
Extract Static vs incremental extraction Incremental & immediate Application-assisted Trigger-based extraction Log-based Incremental & delayed Timestamp-based File Comparison 15
ETL tools Application-Assisted Update of data warehouse deeply integrated into the application software Immediate Requires adaptation of existing applications Hard to maintain Application DBMS Database List of Updates Data warehouse 16
ETL tools Trigger-Based Closely related to application-based Triggers to store updates Huge performance hit for database More transparent for application layer Only possible for data maintained in database Application DBMS trigger DB Delta tables Extract Delta Tables Data warehouse 17
ETL tools Log-Based Many database systems keep change logs To ensure durability; in-between checkpoints all transactions to the database are logged Use these change logs for updating data warehouse Transparent to the user Application DBMS Database DB log Data warehouse 18
ETL tools Timestamp-Based Operational database can be changed to keep whole history Different levels of timestamps Timestamp per tuple Temporal database Application DBMS Temporal Database Timestamp Based extraction Data warehouse 19
Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 20
Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 D2 21
Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 3 D3 Betty F 18 1 D1 Martin M 64 2 D3 D2 D3 22
Timestamp-Based Only last update date may lose information Customer Gender Age Segment Date John M 36 1 D1 Capture D1 Karl M 52 2 D1 Betty F 18 1 D1 Karl? John s segment? Capture Customer Gender Age Segment Date John M 36 2 D2 Betty F 18 1 D1 Customer Gender Age Segment Date John M 36 3 D3 Betty F 18 1 D1 Martin M 64 2 D3 D2 D3 23
Timestamp-Based Based on the application field and update frequency, one timestamp per tuple may be acceptable Type-2 change for attributes that change several times between updates?! If more fine-grained behavior is required we can go to full temporal database Register valid time (start-end) 24
Compare ETL tools File Comparison Easy; works even for flat files Does not capture intermediate changes Application DBMS 1. Database 2. Changes Data warehouse Previous Copy 25
Method Extraction Methods - Overview Maintain full history? Impact on performance operational level Complexity extraction procedures DBMS dependent Requires changes to application layer Maintenance Static No No Low No None Easy Application assisted Trigger based Yes Medium High No Many Difficult Yes Medium Medium Yes None Medium Log based Yes No Low Limited None Easy Timestamp based File comparison (yes) Some Low Limited Some Medium Not all No Medium No None Easy 26
Summary Data Warehouse architectures Single, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 27
Transform/Cleanse Conversion Different date types Enrichment Combine information of different attributes to create a new one Joining data without key; Entity resolution Correcting errors De-duplication 28
De-duplication & Entity Resolution De-duplication Recognize duplicate entries Entity resolution Recognize if two entities from different sources are actually referring to the same object Not always obvious if there is no key shared between different data sources In most extreme case need for methods to directly compare strings 29
Example: Cleansing customer data John White Downing St. 10 TW1A 2AA London (UK) Fname: John Sname: White Adress: Downing St. 10 ZIP: TW1A 2AA normalizing City: London Country: UK Fname: John Sname: White Adress: 10, Downing St. standardizing ZIP: TW1A 2AA City: London Fname: John Country: United Kingdom Sname: White Adress: 10, Downing St. ZIP: SW1A 2AA correcting City: London Country: United Kingdom Example from Golfarelli and Rizzi book 30
Comparing Strings How to compare strings? Patrick Smith vs. Patrik Smyth Popular distance measure: edit distance Insert, Delete, Overwrite Shortest sequence of operations to turn one string into another Patrick Smith D Patrik Smith R Patrik Smyth 31
Edit Distance Relatively easy to compute via dynamic programming Smyth vs Simte d(s,s) d(smyt,si) S m y t h 0 1 2 3 4 5 S 1 0 1 2 3 4 i 2 1 1 2 3 4 m 3 2 1 2 3 4 t 4 3 2 2 2 3 e 5 4 3 3 3 3 d(smyt,simt) d(smyth,simte) 32
Edit Distance Compute content of a cell as follows: Example: d(smyth,simte): d(smyth,simt) + insert e at the end or delete h + d(smyt,simte) or d(smyt,simt) + overwrite h with e 33
Edit Distance Compute content of a cell as follows: Example: d(smyth,simte) = min( d(smyth,simt) + 1 d(smyt,simte) + 1, d(smyt,simt) + 1 ) 34
Edit Distance Compute content of a cell as follows: Example: d(smyt,simt): d(smyt,sim) + insert t at the end or delete t + d(smy,simt) or d(smy,sim) 35
Edit Distance Compute content of a cell as follows: Example: d(smyt,simt) = min( d(smyt,sim) + 1 d(smy,simt) + 1, d(smy,sim) + 0 ) 36
Edit Distance Compute content of a cell as follows: if string 1 [k] = string 2 [l] M[k,l] = min(m[k-1,l-1], M[k-1,l]+1, M[k,l-1]+1) else: M[k,l] = min(m[k-1,l-1]+1, M[k-1,l]+1, M[k,l-1]+1) 37
Summary Data Warehousing architectures Single, two-, and three-layer architectures ETL process Extract Transform/Cleanse Load 38
Populating Dimension Tables Type 1 and type 2 changes May require dimension table to be loaded in the data staging area Introduction of surrogate keys Permanently maintain mapping tables in the staging area In case of a type-1 change: change the value in the dimension table In case of a type-2 change: introduce a new tuple with a new surrogate key 39
Populating Fact Tables Always update dimension tables before fact table (if possible) Referential integrity Not always possible: introduce dummy value in dimension tables if dimension key lookup fails Update materialized views Can be optimized if measures are distributive 40
Computation of Aggregations Part of the data warehouse loading involves computation of aggregations materialized views Order in which they are computed can be important Sort-based or Hash-based See, e.g.: Agarwal et al. On the computation of multidimensional aggregates. VLDB 1996 41
Sort-Based Aggregation A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; 42
Sort-Based Aggregation A B C count 1 5 6 8 2 6 6 9 1 8 6 10 SORT 1 7 5 6 3 3 3 5 2 1 4 8 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; 43
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN 44
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 8 45
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 46
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 47
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 8 48
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 49
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 3 3 3 5 50
Sort-Based Aggregation A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 SELECT A, B, C, sum(count) FROM R GROUP BY A, B, C; SCAN A B C SUM 1 5 6 14 1 8 6 10 2 1 4 17 3 3 3 5 51
Sort-Based Aggregation Observation: Table sorted on ABC = sorted on AB = sorted on A One sort supports 3 aggregations A B C count 1 5 6 8 1 5 6 6 1 8 6 10 2 1 4 8 2 1 4 9 3 3 3 5 52
Pipe-Sort Sort on ABC Scan sorted relation As long as next tuple is the same as previous Else Update aggregated tuple; add count of current Ship aggregated tuple to disk Pipe tuple to procedure computing aggregation on AB 53
Pipe-Sort Key problem: divide materialized views lattice into pipes, minimizing sorts ABCD BCD AB BC A B D () 54
Pipe-Sort Key problem: divide materialized views lattice into pipes, minimizing sorts sort BCDA BCD sort AB BC A B D sort () 55
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 56
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 57
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 2 1 4 9 58
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 8 2 1 4 9 1 8 6 10 59
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 9 1 8 6 10 60
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 9 1 8 6 10 3 3 3 5 61
Hash-Based Optimization If aggregate tables fit into memory A B C count 1 5 6 8 2 1 4 9 1 8 6 10 1 5 6 6 SCAN 3 3 3 5 2 1 4 8 A B C SUM 1 5 6 14 2 1 4 17 1 8 6 10 3 3 3 5 62
Hash-Based Optimization Multiple hash tables may fit at the same time A B C count A B C SUM 1 5 6 8 1 5 6 14 2 1 4 9 2 1 4 17 1 8 6 10 1 5 6 6 3 3 3 5 2 1 4 8 1 8 6 10 3 3 3 5 A C SUM 1 6 24 C SUM 2 4 17 3 3 5 6 24 4 17 3 5 63
Conclusion ETL is often the most time consuming parts of a datawarehousing project Reported to take up to 80% of time Different ways to capture changes Application-assisted Trigger-based extraction Log-based Timestamp-based File Comparison 64
Conclusion Transform/Cleanse involves Normalization Standardization De-duplication Entity resolution Load Surrogate keys (lookup tables in staging area) First update dimension tables, only then fact table 65