DW Performance Optimization (II)

Size: px

Start display at page:

Download "DW Performance Optimization (II)"

Marjorie McLaughlin
5 years ago
Views:

1 DW Performance Optimization (II)

2 Overview Data Cube in ROLAP and MOLAP ROLAP Technique(s) Efficient Data Cube Computation MOLAP Technique(s) Prefix Sum Array Multiway Augmented Tree Aalborg University 8 - DWML course

3 Data Cube Datacube queries compute aggregates over fact tables at different granularities CUBE BY: Product, Time, Location Aggregation Function: SUM(Sales) Cube table (i.e., results) Prod. Time Loc. Sales TV TV VCR Q Q 4Q A Canada Mexico 5 TV PC VCR sum Product Time Qtr Qtr Qtr 4Qtr sum U.S.A Canada Mexico Location A Canada 8 sum Mexico Aalborg University 8 - DWML course

4 ROLAP vs MOLAP ROLAP data cube Stored in a relational table Good for sparse data cube? scalable storage use response time MOLAP data cube Stored in special multidimensional data structures Good for dense data cube? storage use (foreign keys not needed) response time scalable ROLAP data cube d d SUM 5 4 MOLAP data cube d \ d 5 4 Aalborg University 8 - DWML course 4

5 ROLAP Technique(s) Aalborg University 8 - DWML course 5

6 ROLAP data cube computation Problem: How do we compute efficiently a data cube from a fact table? Constraints/background Fact table is huge, e.g., sales fact with terabyte Main memory size is relatively small, e.g., gigabyte The memory CANNOT fit the whole fact table Need to apply methods like external memory sorting, data partitioning We focus on I/O cost as I/O time >> CPU time External mergesort Aalborg University 8 - DWML course

7 Computing Data Cube Dimensions Product (P) Time (T) Location (L) Data cube and the lattice model This cube is a lattice with 8 nodes 8 different colored parts in the cube How do we compute this cube from the fact table? Each node is a GROUP BY query White: GROUP BY product, time, location Light green: GROUP BY product, location Gray: GROUP BY location Computation cost: 8 times of external sorting over the fact table TV PC VCR sum Product PTL PL PT TL P T L none Time Qtr Qtr Qtr 4Qtr sum U.S.A Canada Mexico sum Location Aalborg University 8 - DWML course

8 Computing Data Cube Computation sharing along a lattice path E.g., path: PTL PT P none Along a path, GROUP BY is performed for the first node Results of other nodes can be obtained at the same time Remaining paths to consider PL L TL T Cost: times of external sorting over fact table i.e., number of nodes in the middle layer size same PTL same PT same P PTL PL PT TL P T L none Base table (sorted by PTL) Prod. Time Loc. Sales PC Q Canada PC Q Canada PC Q Mexico PC Q Mexico 4 PC Q A 8 PC Q Canada 5 PC 4Q A 9 TV Q Canada Aalborg University 8 - DWML course 8

9 Sparseness Sparse relation / base table Large number of CUBE BY attributes (i.e., large lattice) Large Domain with CUBE BY attributes Base table size is a small fraction of the cross product size of attribute domains Existing methods are not efficient E.g., even a single external sorting operation is expensive, requiring multiple passes over the fact table Aalborg University 8 - DWML course 9

10 Partitioned-Cube Computing data cube efficiently for sparse data Fast Computation of Sparse Datacubes, in VLDB 99 Main memory has a fixed size and we cannot read the whole fact table into main memory Partitioning is faster than using external sorting Partitioned-Cube Partition the large relations into fragments that can fit into the memory It follows the recursive structure of datacubes A sub-datacube is obtained by fixing each possible value of a CUBE BY attribute Aalborg University 8 - DWML course

11 Partitioned-Cube (cont.) Algorithm Partition-Cube(R, {B,, B m }, A, G) R: a set of tuples {B,, B m }: CUBE BY attributes A: measure value G: aggregate function F: finest granularity datacube tuples D: remaining tuples (those with ) : if (R fits in memory) then return Memory-Cube(R, {B,, B m }, A, G) : choose an attribute B j among {B,, B m }, then scan R and partition on B j into {R,, R n } : for (i = to n) (F i, D i ) = Partition-Cube(R i, {B,, B m }, A, G) : let F = union of F i s 4: let (F, D ) = Partition-Cube(F, {B,, B j-, B j+, B m }, A, G) 5: let D = union of F, D and D i s : return (F, D) Note: n min{ m, # slots in memory } Country Aalborg University 8 - DWML course Relation R B B A Year G=SUM Sales 5 8 8

12 Outline of Steps for the Example B B B B Attributes: Country (B ), Year (B ) Choose attribute B, partition the base table on B Compute the cube B B Compute other cubes B * (excluding B B ), i.e., cube B Consider the cube B B, project out the attribute B Remaining attribute(s): B Compute the cube B Compute the cube * (excluding B ), i.e., cube None none Relation R B B A Country Year Sales Aalborg University 8 - DWML course

13 Partitioned-Cube (cont.) STEP #. Memory-Cube. Applicable when the relation fits in main memory Read the input relation into memory Compute the datacube (in memory only) by using the computation sharing method on slide #8 Aalborg University 8 - DWML course

14 Partitioned-Cube (cont.) STEP # select an attribute (say, Country) partition the large relation into fragments that can be fit into the memory (assuming the memory can hold 4 tuples in this example) R Country Year Sales 5 R Country Year Sales Country Year Sales 8 R 8 Aalborg University 8 - DWML course 4

15 Partitioned-Cube (cont.) STEP #.: Process the tuples in R Now R fits in main memory We execute step # to compute sub-datacubes Compute F : GROUP BY Country, Year Compute D : GROUP BY any other combination with Country E.g., GROUP BY Country GROUP BY Country, Year R F Country Year Sales 5 Country Year Sales GROUP BY Country Country Year Sales 9 Aalborg University 8 - DWML course 5 D

16 Partitioned-Cube (cont.) STEP #.: Process the tuples in R In the same way, we compute F and D GROUP BY Country, Year R F Country Year Sales 8 Country Year Sales 5 GROUP BY Country Country Year Sales 8 D Aalborg University 8 - DWML course

17 Partitioned-Cube (cont.) Step #: F = F F Step #4: set Country to in F (i.e., Country not in GROUP BY) call Partition-Cube on F, to obtain F and D F Country Year Sales F 8 Country Year Sales 5 5 Country Year Sales 8 F 5 5 GROUP BY Year Country Year Sales 4 Aalborg University 8 - DWML course F D GROUP BY none Country Year Sales 5

18 Partitioned-Cube (cont.) Step : D = (D F ) i= D i Step : return F, D Country Year Sales 4 Country Year Sales F 8 F 5 D D Country Year Sales 9 5 D Country Year Sales 8 D Country Year Sales 5 Aalborg University 8 - DWML course 8

19 Partitioned-Cube (cont.) Recursively execute STEP # if there are more than attributes Not in this example but in the next exercise Aalborg University 8 - DWML course 9

20 Partitioned-Cube Exercise Run the Partitioned-Cube Alg. on this example Attributes: B, B, B Verify your final result by using the fact table The following outline is given to you Choose attribute B, partition the base table on B Relation R B B B A Prod. Loc. Year Sales PC 8 Compute the cube B B B Compute other cubes B * (excluding B B B ), i.e., cubes B B, B B, B TV PC 4 Consider the cube B B B, project out the attribute B TV Remaining attribute(s): B B TV Choose attribute B, partition the cube B B on B PC Compute the cube B B Compute other cubes B * (excluding B B ), i.e., cube B PC Consider the cube B B, project out the attribute B Remaining attribute(s): B Compute the cube B Compute the cube * (excluding B ), i.e., cube None TV 5 G=SUM Aalborg University 8 - DWML course

21 MOLAP Technique(s) Prefix Sum Array Multiway Augmented Tree Aalborg University 8 - DWML course

22 Range Sum Query Range Sum Query Given a MOLAP data cube Specify a range for each (numeric) dimension Compute the SUM of these values Example: Measure: salary Numeric attributes: age, time Find the revenue from customers with an age from to 5, in years from to 5 Brute-force approach Accumulate the SUM value while visiting relevant cells What happens if the query covers many cells in the cube? Better solution Year Range Queries in OLAP Data Cubes, in ACM SIGMOD Age MOLAP data cube Aalborg University 8 - DWML course

23 Prefix Sum Array Consider a D array as a data cube for the moment Age as attribute v, Year as attribute v (values starting from ) Construct a prefix sum array P P[i,j] = Σ v=..i Σ v=..j A[v,v ] Fast computation of P by visiting them in lexicographic order and reusing previous values A range sum query is of the form RangeSum([l,h ], [l,h ]) Σ v=l..h Σ v=l..h A[v,v ] Using prefix sum array to answer query fast RangeSum([,], [,]) Easy. That s 4. RangeSum([,], [,]) Wait Index v \ v Index v \ v Cube A Prefix-Sum Array P (of A) Aalborg University 8 - DWML course

24 Query Processing RangeSum([,], [,]) can be rewritten as the sum of + RangeSum([,], [,]) Index v \ v Cube A 4 RangeSum([,], [,]) 5 RangeSum([,], [,]) 8 + RangeSum([,], [,]) 4 Using the prefix-sum array P, we have +4 + = Advantage Prefix-Sum Array P (of A) We only need to fetch 4 values, regardless of the range [l,h ], [l,h ] Index 4 Can we discard the array A? Why? Any entry A[i,j] is equivalent to RangeSum([i,i], [j,j]) v \ v Aalborg University 8 - DWML course 4

25 Update / Maintenance Cube A Insert a tuple (,) with measure value δ Index v \ v 4 Increment the count of A[,] by δ 5 Increment the count of P[i,j] by δ, for any i and j P[,], P[4,], P[,], P[4,] 4 8 If we insert the tuple (,), then we need to increment the whole prefix-array! Updates in data warehouse are often done in batch Index Prefix-Sum Array P (of A) 4 P can be updated in low amortized cost v \ v Aalborg University 8 - DWML course 5

26 Extension to Multi-dimensional Case Suppose that there are d dimensions and the cube A has N entries RangeSum([l,h ], [l,h ],, [l d,h d ]) Can be computed by a straightforward method, using Π j=..d (h j l j +) cell values Prefix-sum array P (of A) has N entries also Pre-computation time of P: O(dN) No need to keep A afterwards By using the prefix-sum array P, we only need to access d elements of P to compute RangeSum Aalborg University 8 - DWML course

27 Blocked Prefix Sum Array Tradeoff between array space and query processing time But we now need to keep the original array A Blocked Prefix Sum Array Define b as the length of a group of cells Only keep each entry P[i,j] where (i+) mod b = or i is the last index (j+) mod b = or j is the last index Processing a range sum query RangeSum([,], [,]) = 4 RangeSum([,], [,]) Decompose into internal region and boundary region Internal region can be processed by the above techniques Border region will be discussed in the next slide Index v \ v Aalborg University 8 - DWML course Index v \ v 8 Cube A Blocked Prefix-Sum Array P, b=

28 Blocked Prefix Sum Array Decompose query range into regions Internal region (dark gray) Compute the sum using blocked prefix sum array P Border region (light gray) Compute the sum by accessing cells in the original array A For each border cell, we choose the cheaper way to compute its sum Visit the cells (of A) within the range, or Visit the complement cells (of A) a block How about the update cost of Blocked Prefix Sum Array, compared to Prefix Sum Array? Aalborg University 8 - DWML course 8

29 Range Max Query A range max query is of the form RangeMax([l,h ], [l,h ]) max v=l..h max v=l..h A[v,v ] Can we build a prefix max array? Consider the queries RangeMax([,], [,]) RangeMax([,], [,]) The prefix property does not hold for the range max query If the global maximum value is at (,), then it overwrites other maximum value in any local region Cube A Index v \ v Prefix-Max Array??? Index v \ v Aalborg University 8 - DWML course 9

30 Multiway Augmented Tree Multiway tree structure Balanced tree Each node stores its associated region and the maximum value in that region Branch-and-bound search RangeMax([,], [,]) Visit root node Consider its subtrees * and * Visit the subtree * first Check leaf nodes and Obtain the maximum value 5 Track back to the branch * No need to visit its subtree Return 5 as the result root v v v Index v \ v 9 8 Cube A ** * (9) * (5) * () * () 5 4 Aalborg University 8 - DWML course

31 Update Multiway Augmented Tree Cube A Insertion Insert a tuple (,) with measure value δ A[,] = max(δ, A[,]) Compare the new value with the old value, and update the tree as follows Deletion Delete a tuple (,) with measure δ If δ equals to A[,], then we need to search tuples in that cell and update A[,] Compare the new value with the old value, and update the tree as follows Tree update If the new value is different from the old value, then we locate the leaf node and propagate changes upwards the tree Efficient update when compared to prefix cube root v Index v \ v ** * (9) * (5) * () * () v v Aalborg University 8 - DWML course

32 Variant of the Tree Cube A Index v \ v Variant of the Tree Like the previous multiway tree Still a balanced tree Each node is associated with a region and the maximum value in that region The only difference: The way that the branches are divided [,][,] root The previous solution for RangeMax queries is still applicable on this tree [,][,] (9) [,][,] () [,][,] (8) [,][,] () Aalborg University 8 - DWML course

33 Using the Tree for Range Sum Query Cube A Can we apply the multiway augmented tree for answering range sum query? YES, if for each node, we store the sum of values in its region But, it is not as efficient as the prefix-sum array Processing the query RangeSum([,], [,]) Visit only the relevant tree nodes and accumulate the sum result Question: Do we need to visit the subtree of [,][,] [,][,] [,][,] [,][,] Index v \ v [,][,] (4) [,][,] (5) [,][,] () root [,][,] () [,][,] (9) Aalborg University 8 - DWML course

34 Summary Data Cube in ROLAP and MOLAP ROLAP Technique(s) Efficient Data Cube Computation MOLAP Technique(s) Prefix Sum Array Multiway Augmented Tree Aalborg University 8 - DWML course 4

Efficient Computation of Data Cubes. Network Database Lab

Efficient Computation of Data Cubes Network Database Lab Outlines Introduction Some CUBE Algorithms ArrayCube PartitionedCube and MemoryCube Bottom-Up Cube (BUC) Conclusions References Network Database