Efficient Computation of Data Cubes. Network Database Lab

Size: px

Start display at page:

Download "Efficient Computation of Data Cubes. Network Database Lab"

Johnathan Chase
5 years ago
Views:

1 Efficient Computation of Data Cubes Network Database Lab

2 Outlines Introduction Some CUBE Algorithms ArrayCube PartitionedCube and MemoryCube Bottom-Up Cube (BUC) Conclusions References Network Database Lab 2

3 Introduction Network Database Lab 3

4 Introduction Precompute aggregates Improve the response time of aggregation queries Datacube operator GROUP-BY, CUBE BY In OLAP, The grouping attributes dimensions The attributes that are aggregated measures One particular GROUP BY in a CUBE computation cuboid, group-by Network Database Lab 4

5 The basic CUBE problem To compute all of the aggregations as efficiently as possible. Exponential in the number of dimensions: for d dimensions, 2 d group-bys are computed. The size of each group-by depends upon the cardinality of its dimension. Fit in memory. Sparseness: Large domain size of some group-by attributes A large number of group-by attributes in the datacube query Network Database Lab 5

6 How to Simultaneously computing the aggregates Share partitioning and aggregation costs between various group-bys. Network Database Lab 6

7 Some CUBE Algorithms PipeSort, PipeHash, Overlap ArrayCube [ZDN97] PartitionCube and MemoryCube [RS97] Bottom-up Cube (BUC) [BR99] Network Database Lab 7

8 PipeSort Search the space of possible sort orders for the best set of sorts that convert the CUBE lattice into a processing tree. Network Database Lab 8

9 PipeSort Minimize the number of sorts. Compute a group-by from its smallest parent. Take advantage of common prefix. Various paths are evaluated in turn. Limitation: Does not scale well with respect to the number of group-by attributes. k At least sorts exponential in k k / 2 No partition. Network Database Lab 9

10 PipeHash Compute a group-by from its smallest parent in the lattice. Use a hash table for every simultaneously computed group-by. If all of the hash tables cannot fit in memory, PipeHash partitions the data on some attribute and processes each partition independently. Problems: Do not overlap as much computation as PipeSort. Require a significant amount of memory to store the hash tables. Network Database Lab 10

11 Overlap Try to minimize the number of disk access. Overlap as much sorting as possible by computing a group-by from a parent that shares the longest prefix. If a group-by shares a prefix with its parent, then the parent consists of a number of partitions, one for each value of the prefix. Try to fit as many partitions in memory as possible to avoid writing intermediate results. Network Database Lab 11

12 Common Problems Generate significant I/O by sorting intermediate results. Require large amounts of main memory. Do not work well on sparse CUBEs. Network Database Lab 12

13 ArrayCube Compute the CUBE for Multidimensional OLAP (MOLAP) systems. Array storage issues: Too large to fit in memory split up into chunks Many of the cells in the array are empty compress these chunks Not array format data load array from tables Network Database Lab 13

14 Basic Array cubing algorithm Construct the minimum size spanning tree for the group-bys of the Cube. Compute any group-by D i1 D i2..d ik of a Cube from the parent D i1 D i2..d ik+1, which has the minimum size. Read in each chunk of D i1 D i2..d ik+1 along the dimension D ik+1 and aggregate each chunk to a chunk of D i1 D i2..d ik. Keep only one D i1 D i2..d ik chunk in memory at any time. Network Database Lab 14

15 Example The array ABC is a array with array chunks. BC group-by: Read from 1 to 64 Aggregate each four ABC chunks to a BC chunk Output the BC chunk to disk Reuse the memory for the next BC chunk. Compute AB, AC, BC from ABC independently. Network Database Lab 15

16 Multi-Way Array Algorithm Overlap the computations of the different group-bys, thus avoiding the multiple scans required by the naive algorithm. Try to minimize the memory needed for each computation, so that we can achieve maximum overlap. Network Database Lab 16

17 Multi-Way Array Algorithm Dimension Order Row major order of the chunks with the n dimensions D 1, D 2,, D n, in some order O = (D j1, D j2,, D jn ). n 1 Memory Requirements D i Ci i= 1 i= p+ 1 Minimum Memory Spanning Tree(MMST) Optimal Dimension Order O = (D 1, D 2,, D n ), where p D 1 D... 2 D n Network Database Lab 17

18 Example The array ABC is a array with array chunks. Laid out in the dimension order ABC. Memory required: b 0 c 0 : chunk 1,2,3,4 4 4 a 0 c 0 : chunk 1,5,9, a 0 b 0 : chunk 1,17,33, Network Database Lab 18

19 Example (MMST) Network Database Lab 19

20 Example (Dimension Order) The array ABCD is a array with array chunks. Network Database Lab 20

21 PartitionCube and MemoryCube Partition the large relations into fragments that fit in memory. Perform the complex operation over each memory-sized fragment independently. Network Database Lab 21

22 Algorithm Partition-Cube Fix each possible value of a group-by attribute B j in turn and computing the tuples in the corresponding sub-datacube, followed by computing the datacube tuples with the value ALL for B j. Rather than re-reading the input relation R for the ALL datacube, we read the finest granularity cuboid. The datacube is broken up into n+1 smaller subdatacube computations. Network Database Lab 22

23 Partition Example Network Database Lab 23

24 Algorithm Memory-Cube Paths in the search lattice k At least paths (PipeSort) k k / 2 k / 2 is also upper-bound Sharing sort work Computing the cube by traversing paths Network Database Lab 24

25 Example G(1) = D ε G(2) = CD D ε D G(3) = BCD BD B ε BD D CD D G(4) = ABCD ABD AB A ε ABD AD D ACD AD D BCD BD B BD CD Network Database Lab 25

26 Bottom-up Cube (BUC) Combine the I/O efficiency of PartitionCube/MemoryCube. Take advantage of minimum support pruning like Apriori (Iceberg-CUBE problem). Proceed from the bottom of the lattice, and works its way up toward the larger, less aggregated group-bys. Network Database Lab 26

27 Example Network Database Lab 27

28 Features of BUC The elimination of the aggregation and partitioning of single tuple partitions is a key factor in the success of BUC on sparse CUBEs because many partitions have a single tuple. BUC does not try to share the computation of aggregates between parent and child group-bys, only the partitioning costs. Partitioning is the major expense, not aggregation. Network Database Lab 28

29 Conclusions What is important in computing datacube? Memory requirement partitioning Costs CPU time, I/O time Share costs partitioning, sorting, aggregation Which algorithm is best (fast)? Dense CUBE ArrayCube Sparse CUBE BUC Network Database Lab 29

30 References [ZDN97] Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. SIGMOD 97, pages , May 1997 [RS97] K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB 97, pages , Aug [BR99] K. Beyer and R. Ramakrishnan. Bottomup computation of sparse and iceberg cubes. SIGMOD 99, pages , June 1999 Network Database Lab 30

Data Warehousing and Data Mining

Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement: