Data Warehousing and Data Mining

Similar documents
Chapter 5, Data Cube Computation

2 CONTENTS

Efficient Computation of Data Cubes. Network Database Lab

Data Cube Technology

Data Cube Technology. Chapter 5: Data Cube Technology. Data Cube: A Lattice of Cuboids. Data Cube: A Lattice of Cuboids

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

Data Mining Part 3. Associations Rules

Chapter 4: Mining Frequent Patterns, Associations and Correlations

EFFICIENT computation of data cubes has been one of the

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

An Empirical Comparison of Methods for Iceberg-CUBE Construction. Leah Findlater and Howard J. Hamilton Technical Report CS August, 2000

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-1 E-ISSN:

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

Database design View Access patterns Need for separate data warehouse:- A multidimensional data model:-

DATA MINING II - 1DL460

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 5

Different Cube Computation Approaches: Survey Paper

Computing Data Cubes Using Massively Parallel Processors

CS570 Introduction to Data Mining

On-Line Analytical Processing (OLAP) Traditional OLTP

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

BCB 713 Module Spring 2011

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Data Warehousing Conclusion. Esteban Zimányi Slides by Toon Calders

Mining Frequent Patterns without Candidate Generation

OLAP2 outline. Multi Dimensional Data Model. A Sample Data Cube

Chapter 4: Association analysis:

Association Rules. A. Bellaachia Page: 1

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

An Empirical Comparison of Methods for Iceberg-CUBE Construction

Association Pattern Mining. Lijun Zhang

Frequent Pattern Mining

DW Performance Optimization (II)

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Lecture 2 Data Cube Basics

PARSIMONY: An Infrastructure for Parallel Multidimensional Analysis and Data Mining

Mining Association Rules in Large Databases

Count of Group-bys Sizes < X 400. Sum of Group-bys Sizes < X Group-by Size / Input Size

Data Mining Techniques

Range CUBE: Efficient Cube Computation by Exploiting Data Correlation

Processing of Very Large Data

Effectiveness of Freq Pat Mining

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Slide Set 5. for ENEL 353 Fall Steve Norman, PhD, PEng. Electrical & Computer Engineering Schulich School of Engineering University of Calgary

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Frequent Pattern Mining

Information Sciences

ETL and OLAP Systems

Multi-Dimensional Partitioning in BUC for Data Cubes

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Chapter 6: Association Rules

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows

Specifying logic functions

Chapter 12: Indexing and Hashing. Basic Concepts

Data warehouses Decision support The multidimensional model OLAP queries

Jarek Szlichta Acknowledgments: Jiawei Han, Micheline Kamber and Jian Pei, Data Mining - Concepts and Techniques

Chapter 12: Indexing and Hashing

ACM-ICPC Indonesia National Contest Problem A. The Best Team. Time Limit: 2s

Multi-Cube Computation

Table Of Contents: xix Foreword to Second Edition

Data Warehousing & Data Mining

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Summary. 4. Indexes. 4.0 Indexes. 4.1 Tree Based Indexes. 4.0 Indexes. 19-Nov-10. Last week: This week:

Advanced Data Management Technologies

FlowCube: Constructing RFID FlowCubes for Multi-Dimensional Analysis of Commodity Flows

Association Rule Mining: FP-Growth

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Data Mining for Knowledge Management. Association Rules

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Chapter 11: Indexing and Hashing

Chapter 12: Indexing and Hashing

Performance and Scalability: Apriori Implementa6on

Dwarf: Shrinking the PetaCube

EE 368. Weeks 5 (Notes)

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining

Chapter 7: Frequent Itemsets and Association Rules

ELCT201: DIGITAL LOGIC DESIGN

Lectures for the course: Data Warehousing and Data Mining (IT 60107)

Made available courtesy of Springer Verlag Germany: The original publication is available at

Data Warehousing 資料倉儲

ELCT201: DIGITAL LOGIC DESIGN

An Overview of Data Warehousing and OLAP Technology

Ecient Computation of Iceberg Cubes with Complex Measures

Advanced Database Systems

DATA CUBE : A RELATIONAL AGGREGATION OPERATOR GENERALIZING GROUP-BY, CROSS-TAB AND SUB-TOTALS SNEHA REDDY BEZAWADA CMPT 843

Slides for Lecture 15

Advanced Data Management Technologies

Synthesis 1. 1 Figures in this chapter taken from S. H. Gerez, Algorithms for VLSI Design Automation, Wiley, Typeset by FoilTEX 1

DATA WAREHOUING UNIT I

CMPE223/CMSE222 Digital Logic

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

Transcription:

Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement: The Lecture Slides are adapted from the original slides of Han s textbook.

Lecture Outline Types of cells Types of Cubes Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing 2

Data Cube Terminologies A data cube is a lattice of cuboids. Each cuboid represents a group-by. The base cuboid is the least generalised of all the cuboids. The apex cuboid is the most generalised of all the cuboids. Operations Drill Down: move from the apex cuboid downward in the lattice. Roll Up: move from the base cuboid upward in the lattice. Commonly used measures include: count(), sum(), min(), max() average() 3

Cell Types Base cell A cell in the base cuboid Aggregate cell A cell from a nonbase cuboid It aggregates over one or more dimensions Notations A cell is in general written as a = (a 1, a 2,, a n, measure) An aggregated dimension is indicated by an asterisk ( ). a is an m-dimensional cell if exactly m values in a 1, a 2,, a n are not, m m n. If m = n, then a is a base cell. What should m be, if a is an aggregate cell of the apex cuboid? 4

Example Consider a data cube with the dimensions month, city, and customer group, and the measure sales. What does the following notations mean? (Jan,,, 2800) and (, Chicago,, 1200) (Jan,, Business, 150); and (Jan, Chicago, Business, 45) 5

Ancestor and Descendant Cells An ancestor-descendant relationship may exist between cells. In a n-dimensional data cube, given two cells an i D cell a = (a 1, a 2,, a i, measure a ) a j D cell b = (b 1, b 2,, b j, measure b ) a is an ancestor of b, and b is a descendant of a if and only if i < j for 1 k n, a k = b k wherever a k. In particular, cell a is called a parent of cell b, and b is a child of a, if and only if j = i + 1. Example: 1-D cell a = (Jan,,, 2800) and 2-D cell b = (Jan,, Business, 150) are ancestors of 3-D cell c = (Jan, Chicago, Business, 45); c is a descendant of both a and b; b is a parent of c; and c is a child of b. 6

Lecture Outline Types of cells Types of Cubes Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing 7

Types of cubes Full cube All cells and cuboids are materialized. All possible combination of dimensions and values (prohibitively expensive). 2 n - exponential to the number of dimensions (n) more cuboid if we consider concept hierarchies for each dimension the size of the cuboid depends on the cardinality of its dimensions Iceberg cube (tip of iceberg) Partial materialization tradeoff between storage space and response time for OLAP. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. Closed cube No ancestor cell is created if its measure is equal to that of its descendent cell. Shell cube Only cuboids with limited number of dimensions are precomputed. 8

Closed Cube in more detail Suppose that there are 2 base cells for a database of 100 dimensions, denoted as {(a 1, a 2, a 3,, a 100 ) 10, (a 1, a 2, b 3,, b 100 )) 10}, where each has a cell count of 10. Is iceberg cube good enough in sparse cases like this? How many cuboids? Apex cuboid + 1-D cuboids + 2-D cuboids + + 99-D cuboids 100 + 100 + + 100 + 1 = 2 100 1 1 2 99 Total number of aggregate cells is 2 101 6 Ignore all the aggregate cells that can be obtained by replacing constants with, only 3 cells really offer new information. a 1, a 2, a 3,, a 100 : 10, a 1, a 2, b 3,, b 100 : 10, (a 1, a 2,,, )) 20} 9

Closed cell and closed cube Closed cell A cell, c, is a closed cell if there exists no cell, d, such that d is a specialization (descendant) of cell c (i.e., where d is obtained by replacing in c with a non- value), and d has the same measure value as c. No ancestor cell is created if its measure is equal to that of its descendent cell. Closed Cube A closed cube is a data cube consisting of only closed cells. 10

Lecture Outline Types of cells Types of Cubes Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing 11

Multi-way Array Aggregation Used for MOLAP and full cube computation Array-based bottom-up algorithm Using multi-dimensional chunks Direct array addressing Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization 12

Example Multi-way Array Aggregation 13

Example cuboids to be computed The base cuboid, denoted by ABC (from which all the other cuboids are directly or indirectly computed). This cube is already computed and corresponds to the given 3-D array. The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group-by s AB, AC, and BC. These cuboids must be computed. The 1-D cuboids, A, B, and C, which respectively correspond to the group-by s A, B, and C. These cuboids must be computed. The 0-D (apex) cuboid, denoted by all, which corresponds to the group-by (); That is, there is no group-by here. This cuboid must be computed. It consists of only one value. If, say, the data cube measure is count, then the value to be computed is simply the total count of all the tuples in ABC. 14

Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively. Therefore AB is the smallest and BC is the largest 2- D planes If chunks are scanned as 1, 2, 3, then 156,000 memory units are needed (40*400+40*1000+100*1000) If chunks are scanned as 1, 17, 33, 49, 5, 21,37 then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400*4000+40*1000+10*100) 15

What is the best traversing order? All all A B C A B C AB AC BC AB AC BC ABC ABC Needs 156,000 Memory units Needs 1,641,000 Memory units 16

Multi-way Array Aggregation Method: the planes should be sorted and computed according to their size in ascending order Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, top-down computation and iceberg cube computation methods can be explored 17

Bottom-Up Computation (BUC) all Bottom-up cube computation (Note: top-down in our view!) Divides dimensions into partitions and facilitates iceberg pruning If a partition does not satisfy min _sup, its descendants can be pruned If minsup = 1 compute full CUBE! No simultaneous aggregation A B C D AB AC AD BC BD CD ABC ABD ACD BCD ABCD 1 all 2 A 10 B 14 C 16 D 3 AB 7 AC 9 AD 11 BC 13 BD 15 CD 4 ABC 6 ABD 8 ACD 12 BCD 5 ABCD 18

BUC Partationing Usually, entire data set can t fit in main memory Sort distinct values partition into blocks that fit Continue processing Optimizations Partitioning External Sorting, Hashing, Counting Sort Ordering dimensions to encourage pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the partitions-greater pruning opportunity Collapsing duplicates Can t do holistic aggregates anymore! Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first. 19

BUC: Example (having count( ) > 5) New York Toronto 3 1 1 2 I1 I2 I3 5 1 0 1 1 0 8 9 1 1 1 8 Q1 Q2 Q3 Q4 1 8 New-York Toronto I1 I2 I3 5 1 0 1 1 0 8 9 1 1 1 8 I1 I2 I3 3 1 1 2 8 1 2 1 2 1 11 8 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 20

BUC: Example (having count( ) > 5) All 1 All 77 7 5 I P Q 2 Q 20 5 23 29 6 4 P,I Q,P Q,l I,P,Q 3 I1 I2 I3 Q1 Q2 Q3 Q4 Q,I 8 2 1 3 9 1 10 10 3 2 12 16 Q1 Q2 Q3 Q4 21

Till Now Aggregates simultaneously on multiple dimensions. Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation. Facilitates a-priori pruning. During partitioning, each partition s count is compared with min sup. The recursion stops if the count does not satisfy min sup. 22

Star-cubing (aside) Bottom-up computation with top-down expansion of shared dimensions Base cuboid tree fragment 23

Star-tree construction and Star-Cubing 24