Chapter 5, Data Cube Computation

Similar documents
Data Cube Technology

Data Cube Technology. Chapter 5: Data Cube Technology. Data Cube: A Lattice of Cuboids. Data Cube: A Lattice of Cuboids

Data Warehousing and Data Mining

Efficient Computation of Data Cubes. Network Database Lab

2 CONTENTS

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 5

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-1 E-ISSN:

Different Cube Computation Approaches: Survey Paper

Data Mining: Concepts and Techniques. Chapter 4

EFFICIENT computation of data cubes has been one of the

Data Mining: Concepts and Techniques. Example of Star Schema. Multidimensional Data. A Short Recap on Data Cubes Data Cube: A Lattice of Cuboids

Data Mining Part 3. Associations Rules

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 13, Sequence Data Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Computing Data Cubes Using Massively Parallel Processors

An Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000

CS570 Introduction to Data Mining

Chapter 4, Data Warehouse and OLAP Operations

Mining Frequent Patterns without Candidate Generation

Frequent Pattern Mining

PnP: Parallel And External Memory Iceberg Cube Computation

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Multi-Cube Computation

DATA MINING II - 1DL460

ECT7110 Introduction to Data Warehousing

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Data Warehousing 資料倉儲

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

ECLT 5810 Introduction to Data Warehousing

Data Mining Techniques

An Empirical Comparison of Methods for Iceberg-CUBE Construction

Data Mining for Knowledge Management. Association Rules

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

Effectiveness of Freq Pat Mining

Chapter 1, Introduction

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Dta Mining and Data Warehousing

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

Frequent Pattern Mining

Information Management course

Roadmap. PCY Algorithm

CS490D: Introduction to Data Mining Chris Clifton

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

BCB 713 Module Spring 2011

CompSci 516 Data Intensive Computing Systems

Association Rule Mining: FP-Growth

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Quotient Cube: How to Summarize the Semantics of a Data Cube

Lecture 2 Data Cube Basics

Performance and Scalability: Apriori Implementa6on

Chapter 4: Association analysis:

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Using Tiling to Scale Parallel Data Cube Construction

Information Sciences

Count of Group-bys Sizes < X 400. Sum of Group-bys Sizes < X Group-by Size / Input Size

Chapter 7: Frequent Itemsets and Association Rules

CompSci 516 Data Intensive Computing Systems

Association Rules. A. Bellaachia Page: 1

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Map-Reduce for Cube Computation

Mining Association Rules in Large Databases

CHAPTER 8. ITEMSET MINING 226

Data Warehousing & On-Line Analytical Processing

PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining

CSE 5243 INTRO. TO DATA MINING

Association Pattern Mining. Lijun Zhang

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Specifying logic functions

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Sequential PAttern Mining using A Bitmap Representation

Data Warehousing & On-line Analytical Processing

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Fundamental Data Mining Algorithms

Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

A Simple and Efficient Method for Computing Data Cubes

Association Rules Extraction with MINE RULE Operator

CS 412 Intro. to Data Mining

Product presentations can be more intelligently planned

Data Warehousing & OLAP

Data Cube Materialization Using Map Reduce

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Cube-Lifecycle Management and Applications

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Multi-Dimensional Partitioning in BUC for Data Cubes

MSCIT 5210/MSCBD 5002: Knowledge Discovery and Data Mining

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

RELATIONAL OPERATORS #1

Table Of Contents: xix Foreword to Second Edition

Processing of Very Large Data

Communication and Memory Optimal Parallel Data Cube Construction

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

DKT 122/3 DIGITAL SYSTEM 1

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Association Rule Mining

Transcription:

CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full materialization Materializing all the cells of all of the cuboids for a given data cube Issues in time and space Iceberg Cube Partial materialization Materializing the cells of only interesting cuboids Materializing only the cells in a cuboid whose measure value is above the minimum threshold Closed Cube Materializing only closed cells 1

Typical Computation Process Example all time item location supplier time,item time,location item,location location,supplier time,supplier item,supplier time,item,location time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier Computation Techniques Aggregating Aggregating from the smallest child cuboid Caching Caching the result of a cuboid for the computation of other cuboids to reduce disk I/O Sorting, Hashing and Grouping Sorting, hashing, and grouping operations are applied to a dimension in order to reorder and cluster Pruning A priori pruning the cells with lower support than minimum threshold 2

CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Multi-Way Array Aggregation Features Array-based bottom-up approach Uses multi-dimensional chunks all No direct tuple comparisons Simultaneous aggregation on multiple dimensions A B C Intermediate aggregate values are re-used for computing ancestor cuboids AB AC BC Full materialization (No iceberg optimization, No Apriori pruning) ABC 3

Aggregation Strategy Partitions array into chunks Chunk: a small sub-cube which fits in memory Data addressing Uses chunk id and offset Multi-way Aggregation Computes aggregates in multi-way Visits chunks in the order to minimize memory access to minimize memory space B C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 A Aggregation Process Example c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 4

Aggregation Process Example - Continued c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 1 2 3 4 a0 a1 a2 a3 Optimization of Memory Usage What is the best traversing order? Depends on the memory space required Example Suppose the data size on each dimension A, B and C is 40, 400 and 4000, respectively. Minimum memory required when traversing the order, 1,2,3,4,5,, 64 Total memory required is 1 2 3 4 a2 a3 100 1000 + 40 1000 + a0 a1 A 40 400 B C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0 60 b3 B13 14 15 16 44 28 56 b2 9 10 11 12 40 24 52 b1 5 6 7 8 36 20 b0 5

Summary of Multi-Way Method Cuboids should be sorted and computed according to the data size on each dimension Keeps the smallest plane in the main memory, fetches and computes only one chunk at a time for the largest plane Problems Full materialization extremely large storage is required Computes well only for a small number of dimensions ( high dimensional data partial materialization ) Reference Zhao, Y., Deshpande, P.M. and Naughton, J.F., An Array-Based Algorithm for Simultaneous Multidimensional Aggregates, In Proceedings of SIGMOD (1997) CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 6

Bottom-Up Computation (BUC) Characteristics Top-down approach Partial materialization (iceberg cube computation) Divides dimensions into partitions 1 all and facilitates iceberg pruning No simultaneous aggregation 2 A 10 B 14 C 16 D 3 AB 7 AC 9 AD 11 BC 13 BD 15 CD 4 ABC 6 ABD 8 ACD 12 BCD 5 ABCD Iceberg Pruning Process Partitioning Sorts data values Partitions into blocks that fit in memory Apriori Pruning For each block If it does not satisfy min_sup, its descendants are pruned If it satisfies min_sup, materialization and a recursive call including the next dimension 7

Apriori Pruning Example Description in SQL Iceberg cube computation Example compute cube iceberg_cube as select A, B, C, D, count(*) from R cube by A, B, C, D having count(*) min_sup (,,, ) 0-D cuboid ( a1,,, ) ( a1, b1,, ) ( a1, b1, c1, ) ( a1, b1, c1, d1 ) ( a1, b1, c2, ) ( a1, b2,, ) 1-D cuboid 2-D cuboid 3-D cuboid pruning 3-D cuboid pruning Summary of BUC Method Computation of sparse data cubes Problems Sensitive to the order of dimensions The most discriminating dimension should be used first Dimensions should be in the order of decreasing cardinality Dimensions should be in the order of increasing maximum number of duplicates Reference Beyer, K. and Ramakrishnan, R., Bottom-Up Computation of Sparse and Iceberg CUBEs, In Proceedings of SIGMOD (1999) 8

CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Characteristics of Star-Cubing Characteristics Integrated method of top-down and bottom-up cube computation Explores both multidimensional aggregation (as Multi-way) and apriori pruning (as BUC) all Explores shared dimensions e.g., dimension A is a shared A/A B/B C/C D dimension of ACD and AD e.g., ABD/AB means ABD has AB/AB AC/AC AD/A BC/BC BD/B CD the shared dimension AB ABC/ABC ABD/AB ACD/A BCD ABCD 9

Iceberg Pruning Strategy Iceberg Pruning in a Shared Dimension If AB is a shared dimension of ABD, then the cuboid AB is computed simultaneously with ABD If the aggregate on a shared dimension does not satisfy the iceberg condition (min_sup), then all the cells extended from this shared dimension cannot satisfy the condition either If we can compute the shared dimensions before the actual cuboid, we can apply Apriori pruning Pruning while aggregating simultaneously on multiple dimensions Cuboid Trees Cuboid Trees Tree structure to represent cuboids Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, Each level represents a dimension Each node represents an attribute Each node includes the attribute value, aggregate value, descendant(s) a1: 30 root: 100 a2: 20 a3: 20 a4: 20 The path from the root to a leaf represents a tuple Example b1: 10 b2: 10 b3: 10 count(a1,b1,*,*) = 10 c1: 5 c2: 5 count(a1,b1,c1,*) = 5 d1: 2 d2: 3 10

Star Nodes Star Nodes If the single dimensional aggregate does not satisfy min_sup, no need to consider the node in the iceberg cube computation The nodes are replaced by * A B C D count Example (min_sup = 2) a1 b1 * * 1 a1 b1 * * 1 A B C D count a1 * * * 1 a1 b1 c1 d1 1 a2 * c3 d4 1 a1 b1 c4 d3 1 a2 * c3 d4 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1 A B C D count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 Star Tree Structure Star Tree A cuboid tree that is compressed using star nodes Star Tree Construction Uses the compressed table Keeps the star table for lookup of star nodes Lossless compression from the original cuboid tree root:5 a1:3 a2:2 b*:1 b1:2 b*:2 c*:1 d*:1 c*:2 d*:2 c3:2 d4:2 Star Table b2 * b3 * b4 * c1 * c2 * d1 *... 11

Multi-Way Star-Tree Aggregation Aggregation DFS (depth-first-search) from the root of a star tree Creates star trees for the cuboids on the next level root:5 1 BCD:5 1 a1cd/a1:3 2 a1b*d/a1b*:1 3 a1b*c*/a1b*c*:1 4 a1:32 a2:2 b*:1 3 c*:1 4 d*:15 b*:13 b1:2 b*:2 c*:1 4 d*:1 5 c*:14 c*:2 c3:2 d*:1 5 d*:15 d*:2 d4:2 Base Tree BCD Tree ACD/A Tree ABD/AB Tree ABC/ABC Tree Pruning Prunes if the aggregates do not satisfy min_sup Prunes if all the nodes in the generated tree are star nodes Implementation of Star Cubing Method Multi-way aggregation & iceberg pruning all A/A C/C D/D BCD:3 b*:3 b1:2 AC/AC AD/A pruning BC/BC pruning BD/B CD c*:1 c3:2 c*:2 d*:1 d4:2 d*:2 pruning ABC/ABC pruning ABD/AB ACD/A BCD root:5 a1:3 a2:2 b*:1 b1:2 b*:2 ABCD c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 12

Summary of Star Cubing Problems Sensitive to the order of dimensions The order of decreasing cardinality Reference Xin, D., Han, J., Li, X. and Wah, B.W., Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration, In Proceedings of VLDB (2003) CSI 4352, Introduction to Data Mining Chapter 5, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 13

Shell Fragment Cube Computation Motivations The computation and storage of iceberg cube are still costly for high dimensional data ( the curse of dimensionality ) Hard to determine an appropriate iceberg threshold No update in an iceberg cube Features Reduces a high dimensional cube into a set of lower dimensional cubes Lossless reduction Online re-construction of high-dimensional data cube Fragmentation Strategy Observation OLAP occurs only on a small subset of dimensions at a time Fragmentation Partitions the set of dimensions into shell fragments Computes data cubes for each shell fragment ( 20 3-D data cube computation is much better than 1 60-D data cube.) Retains inverted indices or value-list indices Semi-Online Computation Given the pre-computed fragment cubes, dynamically compute cube cells of the high-dimensional cube online 14

Inverted Index Example TID A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 Divides the 5 dimensions into 2 shell fragments, (A, B, C) and (D, E) Builds the inverted index (1-D inverted index) Value TID List Size a1 1 2 3 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 d1 1 3 4 5 4 d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1 Cube Computation Process Generalizes the 1-D inverted indices to multi-dimensional inverted indices Computes cuboids using the inverted indices Example Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC The cuboid AB is computed using the 2-D inverted index below Cell Intersection TID List Size (a1,b1) {1,2,3} {1,4,5} {1} 1 (a1,b2) {1,2,3} {2,3} {2,3} 2 (a2,b1) {4,5} {1,4,5} {4,5} 2 (a2,b2) {4,5} {2,3} {} 0 15

Online Query Computation Process Given shell fragment cubes Divides the query into shell fragments Fetches the corresponding TID list for each fragment from the cubes Intersects the TID lists from each fragment to construct instantiated base table Computes the online cube using the base table with any data cube computation algorithm Computes the query Shell Fragment Cube Process ABCDEF ABC Cube Dimensions DEF Cube D Cuboid EF Cuboid DE Cuboid Cell T-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} 16

Online Query Process ABCDEFGHI J KLMN Instantiated Base Table Online Cube Summary of Shell Fragment Cube Advantages Significant increase of the speed of data cube computation Significant decrease of the storage usage Various applications in very high dimensional data Problems Tradeoffs between the preprocessing time to construct a data cube and the online computation time for queries Tradeoffs between the storage for a data cube and the memory for online computation Reference Li, X., Han, J. and Gonzalez, H., High-Dimensional OLAP: A Minimal Cubing Approach, In Proceedings of VLDB (2004) 17

Questions? Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 18