CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full materialization Materializing all the cells of all of the cuboids for a given data cube Issues in time and space Iceberg Cube Partial materialization Materializing the cells of only interesting cuboids Materializing only the cells in a cuboid whose measure value is above the minimum threshold Closed Cube Materializing only closed cells 1
Typical Computation Process Example all time item location supplier time,item time,location item,location location,supplier time,supplier item,supplier time,item,location time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier Computation Techniques Aggregating Aggregating from the smallest child cuboid Caching Caching the result of a cuboid for the computation of other cuboids to reduce disk I/O Sorting, Hashing and Grouping Sorting, hashing, and grouping operations are applied to a dimension in order to reorder and cluster Pruning A priori pruning the cells with lower support than minimum threshold 2
CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Multi-Way Array Aggregation Features Array-based bottom-up approach Uses multi-dimensional chunks all No direct tuple comparisons Simultaneous aggregation on multiple dimensions A B C Intermediate aggregate values are re-used for computing ancestor cuboids AB AC BC Full materialization (No iceberg optimization, No Apriori pruning) ABC 3
Aggregation Strategy Partitions array into chunks Chunk: a small sub-cube which fits in memory Data addressing Uses chunk id and offset Multi-way Aggregation Computes aggregates in multi-way Visits chunks in the order to minimize memory access to minimize memory space B C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 A Aggregation Process Example c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 4
Aggregation Process Example - Continued c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 Optimization of Memory Usage What is the best traversing order? Depends on the memory space required Example Suppose the data size on each dimension A, B and C is 40, 400 and 4000, respectively. Minimum memory required when traversing the order, 1,2,3,4,5,, 64 Total memory required is 1 2 3 4 a2 a3 100 1000 + 40 1000 + a0 a1 A 40 400 B C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 5
Summary of Multi-Way Method Cuboids should be sorted and computed according to the data size on each dimension Keeps the smallest plane in the main memory, fetches and computes only one chunk at a time for the largest plane Problems Full materialization extremely large storage is required Computes well only for a small number of dimensions ( high dimensional data partial materialization ) Reference Zhao, Y., Deshpande, P.M. and Naughton, J.F., An Array-Based Algorithm for Simultaneous Multidimensional Aggregates, In Proceedings of SIGMOD (1997) CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 6
Bottom-Up Computation (BUC) Characteristics Top-down approach Partial materialization (iceberg cube computation) Divides dimensions into partitions 1 all and facilitates iceberg pruning No simultaneous aggregation 2 A 10 B 14 C 16 D 3 AB 7 AC 9 AD 11 BC 13 BD 15 CD 4 ABC 6 ABD 8 ACD 12 BCD 5 ABCD Iceberg Pruning Process Partitioning Sorts data values Partitions into blocks that fit in memory Apriori Pruning For each block If it does not satisfy min_sup, its descendants are pruned If it satisfies min_sup, materialization and a recursive call including the next dimension 7
Apriori Pruning Example Description in SQL Iceberg cube computation Example compute cube iceberg_cube as select A, B, C, D, count(*) from R cube by A, B, C, D having count(*) min_sup (,,, ) 0-D cuboid ( a1,,, ) ( a1, b1,, ) ( a1, b1, c1, ) ( a1, b1, c1, d1 ) ( a1, b1, c2, ) ( a1, b2,, ) 1-D cuboid 2-D cuboid 3-D cuboid pruning 3-D cuboid pruning Summary of BUC Method Computation of sparse data cubes Problems Sensitive to the order of dimensions The most discriminating dimension should be used first Dimensions should be in the order of decreasing cardinality Dimensions should be in the order of increasing maximum number of duplicates Reference Beyer, K. and Ramakrishnan, R., Bottom-Up Computation of Sparse and Iceberg CUBEs, In Proceedings of SIGMOD (1999) 8
CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Characteristics of Star-Cubing Characteristics Integrated method of top-down and bottom-up cube computation Explores both multidimensional aggregation (as Multi-way) and apriori pruning (as BUC) all Explores shared dimensions e.g., dimension A is a shared A/A B/B C/C D dimension of ACD and AD e.g., ABD/AB means ABD has AB/AB AC/AC AD/A BC/BC BD/B CD the shared dimension AB ABC/ABC ABD/AB ACD/A BCD ABCD 9
Iceberg Pruning Strategy Iceberg Pruning in a Shared Dimension If AB is a shared dimension of ABD, then the cuboid AB is computed simultaneously with ABD If the aggregate on a shared dimension does not satisfy the iceberg condition (min_sup), then all the cells extended from this shared dimension cannot satisfy the condition either If we can compute the shared dimensions before the actual cuboid, we can apply Apriori pruning Pruning while aggregating simultaneously on multiple dimensions Cuboid Trees Cuboid Trees Tree structure to represent cuboids Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, Each level represents a dimension Each node represents an attribute Each node includes the attribute value, aggregate value, descendant(s) a1: 30 root: 100 a2: 20 a3: 20 a4: 20 The path from the root to a leaf represents a tuple Example b1: 10 b2: 10 b3: 10 count(a1,b1,*,*) = 10 c1: 5 c2: 5 count(a1,b1,c1,*) = 5 d1: 2 d2: 3 10
Star Nodes Star Nodes If the single dimensional aggregate does not satisfy min_sup, no need to consider the node in the iceberg cube computation The nodes are replaced by * A B C D count Example (min_sup = 2) a1 b1 * * 1 a1 b1 * * 1 A B C D count a1 * * * 1 a1 b1 c1 d1 1 a2 * c3 d4 1 a1 b1 c4 d3 1 a2 * c3 d4 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1 A B C D count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 Star Tree Structure Star Tree A cuboid tree that is compressed using star nodes Star Tree Construction Uses the compressed table Keeps the star table for lookup of star nodes Lossless compression from the original cuboid tree root:5 a1:3 a2:2 b*:1 b1:2 b*:2 c*:1 d*:1 c*:2 d*:2 c3:2 d4:2 Star Table b2 * b3 * b4 * c1 * c2 * d1 *... 11
Multi-Way Star-Tree Aggregation Aggregation DFS (depth-first-search) from the root of a star tree Creates star trees for the cuboids on the next level root:5 1 BCD:5 1 a1cd/a1:3 2 a1b*d/a1b*:1 3 a1b*c*/a1b*c*:1 4 a1:32 a2:2 b*:1 3 c*:1 4 d*:15 b*:13 b1:2 b*:2 c*:1 4 d*:1 5 c*:14 c*:2 c3:2 d*:1 5 d*:15 d*:2 d4:2 Base Tree BCD Tree ACD/A Tree ABD/AB Tree ABC/ABC Tree Pruning Prunes if the aggregates do not satisfy min_sup Prunes if all the nodes in the generated tree are star nodes Implementation of Star Cubing Method Multi-way aggregation & iceberg pruning all A/A C/C D/D BCD:3 b*:3 b1:2 AC/AC AD/A pruning BC/BC pruning BD/B CD c*:1 c3:2 c*:2 d*:1 d4:2 d*:2 pruning ABC/ABC pruning ABD/AB ACD/A BCD root:5 a1:3 a2:2 b*:1 b1:2 b*:2 ABCD c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 12
Summary of Star Cubing Problems Sensitive to the order of dimensions The order of decreasing cardinality Reference Xin, D., Han, J., Li, X. and Wah, B.W., Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, In Proceedings of VLDB (2003) CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 13
Shell Fragment Cube Computation Motivations The computation and storage of iceberg cube are still costly for high dimensional data ( the curse of dimensionality ) Hard to determine an appropriate iceberg threshold No update in an iceberg cube Features Reduces a high dimensional cube into a set of lower dimensional cubes Lossless reduction Online re-construction of high-dimensional data cube Fragmentation Strategy Observation OLAP occurs only on a small subset of dimensions at a time Fragmentation Partitions the set of dimensions into shell fragments Computes data cubes for each shell fragment ( 20 3-D data cube computation is much better than 1 60-D data cube.) Retains inverted indices or value-list indices Semi-Online Computation Given the pre-computed fragment cubes, dynamically compute cube cells of the high-dimensional cube online 14
Inverted Index Example TID A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 Divides the 5 dimensions into 2 shell fragments, (A, B, C) and (D, E) Builds the inverted index (1-D inverted index) Value TID List Size a1 1 2 3 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 d1 1 3 4 5 4 d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1 Cube Computation Process Generalizes the 1-D inverted indices to multi-dimensional inverted indices Computes cuboids using the inverted indices Example Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC The cuboid AB is computed using the 2-D inverted index below Cell Intersection TID List Size (a1,b1) {1,2,3} {1,4,5} {1} 1 (a1,b2) {1,2,3} {2,3} {2,3} 2 (a2,b1) {4,5} {1,4,5} {4,5} 2 (a2,b2) {4,5} {2,3} {} 0 15
Online Query Computation Process Given shell fragment cubes Divides the query into shell fragments Fetches the corresponding TID list for each fragment from the cubes Intersects the TID lists from each fragment to construct instantiated base table Computes the online cube using the base table with any data cube computation algorithm Computes the query Shell Fragment Cube Process ABCDEF ABC Cube Dimensions DEF Cube D Cuboid EF Cuboid DE Cuboid Cell T-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} 16
Online Query Process ABCDEFGHI J KLMN Instantiated Base Table Online Cube Summary of Shell Fragment Cube Advantages Significant increase of the speed of data cube computation Significant decrease of the storage usage Various applications in very high dimensional data Problems Tradeoffs between the preprocessing time to construct a data cube and the online computation time for queries Tradeoffs between the storage for a data cube and the memory for online computation Reference Li, X., Han, J. and Gonzalez, H., High-Dimensional OLAP: A Minimal Cubing Approach, In Proceedings of VLDB (2004) 17
Questions? Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 18