Multiway pruning for efficient iceberg cubing

Similar documents
WITH the multidimensional model for data warehouses,

Support Vector Machines

Concurrent Apriori Data Mining Algorithms

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Computing Complex Iceberg Cubes by Multiway Aggregation and Bounding

GSLM Operations Research II Fall 13/14

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Hermite Splines in Lie Groups as Products of Geodesics

Solving two-person zero-sum game by Matlab

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Hierarchical clustering for gene expression data analysis

Conditional Speculative Decimal Addition*

Smoothing Spline ANOVA for variable screening

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Load Balancing for Hex-Cell Interconnection Network

An Optimal Algorithm for Prufer Codes *

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Mathematics 256 a course in differential equations for engineering students

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

A Deflected Grid-based Algorithm for Clustering Analysis

X- Chart Using ANOM Approach

Fast Computation of Shortest Path for Visiting Segments in the Plane

Machine Learning: Algorithms and Applications

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

UNIT 2 : INEQUALITIES AND CONVEX SETS

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Programming in Fortran 90 : 2017/2018

Support Vector Machines

The Codesign Challenge

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Analysis of Continuous Beams in General

S1 Note. Basis functions.

Wishing you all a Total Quality New Year!

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Private Information Retrieval (PIR)

Outline. CHARM: An Efficient Algorithm for Closed Itemset Mining. Introductions. Introductions

CS1100 Introduction to Programming

The Shortest Path of Touring Lines given in the Plane

Module Management Tool in Software Development Organizations

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A Binarization Algorithm specialized on Document Images and Photos

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Math Homotopy Theory Additional notes

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

CSE 326: Data Structures Quicksort Comparison Sorting Bound

USING GRAPHING SKILLS

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSE 326: Data Structures Quicksort Comparison Sorting Bound

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Bridges and cut-vertices of Intuitionistic Fuzzy Graph Structure

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

TF 2 P-growth: An Efficient Algorithm for Mining Frequent Patterns without any Thresholds

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Meta-heuristics for Multidimensional Knapsack Problems

Intra-Parametric Analysis of a Fuzzy MOLP

Problem Set 3 Solutions

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

CE 221 Data Structures and Algorithms

Machine Learning. Topic 6: Clustering

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

STING : A Statistical Information Grid Approach to Spatial Data Mining

Virtual Machine Migration based on Trust Measurement of Computer Node

Lecture 5: Multilayer Perceptrons

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Optimal Workload-based Weighted Wavelet Synopses

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Wavefront Reconstructor

Classifier Selection Based on Data Complexity Measures *

Cluster Analysis of Electrical Behavior

Loop Transformations, Dependences, and Parallelization

Efficient Processing of Ordered XML Twig Pattern

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

A Combined Approach for Mining Fuzzy Frequent Itemset

y and the total sum of

CMPS 10 Introduction to Computer Science Lecture Notes

Transcription:

Multway prunng for effcent ceberg cubng Xuzhen Zhang & Paulne Lenhua Chou School of CS & IT, RMIT Unversty, Melbourne, VIC 3001, Australa {zhang,lchou}@cs.rmt.edu.au Abstract. Effectve prunng s essental for effcent ceberg cube computaton. Prevous studes have focused on exclusve prunng: regons of a search space that do not satsfy some condton are excluded from computaton. In ths paper we propose nclusve and ant-prunng. Wth nclusve prunng, necessary condtons that solutons must satsfy are dentfed and regons that can not be reached by such condtons are pruned from computaton. Wth ant-prunng, regons of solutons are dentfed and prunng s not appled. We propose the multway prunng strategy combnng exclusve, nclusve and ant-prunng wth boundng aggregate functons n ceberg cube computaton. Prelmnary experments demonstrate that the multway-prunng strategy mproves the effcency of ceberg cubng algorthms wth only exclusve prunng. 1 Introducton Snce the ntroducton of the CUBE operator [6], the computaton of data cubes has attracted much research [3, 7, 10]. Data cubes consst of aggregates for parttons produced by group-by s of all subsets of a gven set of groupng attrbutes called dmensons. Gven an aggregaton constrant, aggregates n a data cube that satsfy the constrant form an ceberg cube [3]. Prunng s crtcal to the effcency of ceberg cube computaton, as the cube sze grows exponentally wth the number of dmensons. Wth tradtonal prunng strateges [1, 3, 4, 7, 9, 10] regons of the search space that do not satsfy a constrant are dentfed and then pruned from computaton. In ths paper, we propose two new prunng technques. (1) Inclusve prunng dentfes condtons that the solutons must meet, and unts under search that do not meet any of these condtons can not be solutons and thus are pruned. (2) Ant-prunng dentfes regons where all unts n the regons are solutons and testng for prunng s saved. We also propose nclusve and ant-prunng strateges wth boundng aggregate functons for effcent ceberg cubng. Our ntal experments confrm that multway-prunng mproves the effcency of cubng. Wth tradtonal prunng, condtons are dentfed that warrant non-solutons. In ths sense, t s termed exclusve prunng. Wth tradtonal exclusve prunng, the larger the soluton set, the more tests for the prunng condton result n frutless effort and become extra unnecessary cost. To the contrary of exclusve prunng, the cost for nclusve prunng does not ncrease and ant-prunng becomes better wth larger soluton set. Exclusve, nclusve and ant-prunng are

collectvely called the multway prunng strategy. Wthout nclusve prunng and ant-prunng, tradtonal exclusve prunng stll compute the complete solutons and prunes correctly. However, t ncurs extra cost. Inclusve and ant-prunng can not replace exclusve prunng, but complement exclusve prunng and n many cases greatly reduce the cost for prunng. 1.1 Related work Studes n the ceberg cubng lterature have all focused on exclusve prunng strateges n ceberg cube computaton [3, 7, 10, 12]. Another related area s constrant data mnng. Several types of constrants have been proposed and ther propertes are used for prunng [1, 8, 9]. All these work focus on exclusve prunng, expect n Reference [8]. The succnct constrants n Reference [8] are conceptually smlar to nclusve prunng, however they are constrants orthogonal to the constrants for exclusve prunng. Our nclusve prunng does not rely on extra constrants and t s used to enhance exclusve prunng. The term nclusve prunng was coned by Webb [11] n the area of machne learnng. Specfc prunng rules were developed for classfcaton learnng. We ntroduce nclusve prunng nto ceberg cube mnng and develop nclusve prunng strateges for aggregaton constrants based on boundng. The concept of border was used n data mnng to represent large spaces [2, 5]. Rather than computng borders, we focus on usng borders for more effectve prunng and to compute all solutons n the space represented by borders. 2 Prelmnares on data cubes Dmensonal datasets are defned by dmensons and measures. In Table 1, there are 4 dmensons Month, Product, SalesMan (Man) and Cty, and Sale (aggregated as Sum(Sale)) s the measure. Throughout ths paper we use upper-case letters to denote dmensons and lower-case letters to denote dmenson-values. For example (A,B,C) denotes a group-by and (a,b,c) denotes a partton of tuples from the (A,B,C) group-by, and s called a group. We also use an upper case letter for a measure attrbute to represent the set of values for the measure. The group-bys n a data cube form a lattce structure called the cube lattce. Fg. 1 shows an example 4-dmensonal cube lattce, where each node denotes a group-by lst. The empty group-by that aggregates all tuples n a dataset s at the bottom of the lattce and the group-by wth all dmensons s at the top of the lattce. Group-bys form subset/superset relatonshps. Groups from dfferent group-bys also form super-group/sub-group relatonshps. A specal value * s assumed n the doman of all dmensons, whch can match any value of the dmenson. Wth the Sales dataset, the (Jan,,, ) group denotes the partton of tuples wth Month = January and no restrcton on values for the other dmensons. For smplcty, (Jan,,, ) s also wrtten as (Jan).

Table 2. The bounds of some aggregate functons ({X = 1..n} are MSPs) Table 1. A sales dataset, partally aggregated Month Prod Man Cty Cnt(*) Sum(Sale) Jan Toy John Perth 5 200 Mar TV Peter Perth 40 100 Mar TV John Perth 20 100 Mar TV John Syd 10 100 Apr TV Peter Perth 8 100 Apr Toy Peter Syd 5 100 F Cnt Sum Max Max Mn Max Avg Max Sum upper bound; lower bound Cnt(X ); Mn Cnt(X ) Max(X ); Mn Max(X ) Mn(X ); Mn Mn(X ) Avg(X ); Mn Avg(X ) f there exsts Sum(X ) > 0, Sum Sum(X), Sum(X )>0 otherwse Max Sum(X ); f there exsts Sum(X ) < 0, Sum Sum(X )<0 Sum(X), otherwse Mn Sum(X ) The 3-dmensonal group (Jan, TV, Perth) s a sub-group of the 2-dmensonal groups (Jan, TV), (Jan, Perth) and (TV, Perth). Gven a dataset S of n dmensons A 1,..., A n wth the measure X, the data cube of applyng F on S can be expressed n two ways: (1) Cube(A 1,...,A n ) can be vewed as the set of possble group-bys, or the set of possble groups for all group-bys. (2) Cube(X) = {X = {t[x] t g } g s a partton of the cube}; that s, a data cube s expressed as parttons of mult-set of measure values followng the groupng of tuples. 3 Boundng aggregate functons Boundng was proposed as a technque to prune for complex aggregaton constrants [12]. The dea of boundng s to estmate the upper and lower bounds for data cubes from the most specfc parttons (MSPs) of data. The Sales dataset comprses 6 MSPs, as shown n Table 1. Each MSPs has been aggregated wth functons Count(*) (denoted as Cnt(*)) and Sum(Sale). We generalze the orgnal defnton of data cubes [6]. Fg. 1 shows Cube (ABCD), and t can be decomposed nto two lattce structures: The nodes n the rght polygon form Cube (BCD). For each partton of data wth a doman(a), the nodes n the left polygon form a data cube on dmensons {B, C, D} condtoned on a doman(a). In later dscussons we use the term data cube to refer to both uncondtonal data cubes and condtonal data cubes. The concept of data cube core was coned by Gray et al. [6] t was generalzed to condtonal data cubes. Gven Cube(A 1,...,A n ) over measure X, all n- dmensonal parttons {(a 1,...,a n ) a,a doman(a ),1 n} comprse the core of the data cube and each partton g n the core s a Most Specfc Partton (MSP). The mult-set of measure values for tuples n an MSP g, namely {t[x] t g }, s an MSP of the measure. All aggregates n a data cube can be

ABCD ABC ACD ABD BCD AB AC AD BC CD BD A B C D All Fg. 1. Cube(ABCD) decomposton BC BCD BCE (a 1 ) BD (a 1 ) BE (a 1 ) CD (a 1 ) CE (a 1) DE (a 1 ) B (a 1 ) C (a 1 ) D (a 1) E L B BCDE ALL BDE U B CDE Fg. 2. An ant-prunng border on the lattce condtoned on (a 1). computed from ts MSPs. In Fg. 1, any group n Cube(BCD) a 1 can be computed from some (a 1,b,c j,d k ) groups and any group n Cube(BCD) can be computed from some (b,c j,d k ) groups, where, j and k terates over values n the doman of B, C and D respectvely. Gven an aggregate functon F, and a mult-dmensonal dataset wth measure X, the upper (lower) bound for a data cube s a real number such that for any partton X Cube(X), F(X ) s no larger (smaller) than the upper (lower) bound. An aggregate functon F s boundable for a data cube f the upper and lower bounds for the data cube can be determned wth a sngle scan of the local aggregate values of the MSPs. The bounds for some commonly seen aggregate functons are lsted n Table 2. We use Count(*) as an example to explan. The Count(*) for any group n a data cube s no larger than Sum Cnt(X ), the sum of Count(*) all MSPs. The Count(*) for any group n a data cube s no smaller than Mn Cnt(X ), the smallest Count(*) among all MSPs. Moreover, these bounds are the tghtest bounds for Count(*). Indeed arthmetc expressons of base functons Sum, Count, Mn, and Max are boundable. Detals of boundng algorthms are dscussed n Reference [12]. 4 Inclusve prunng wth boundng Inclusve prunng s appled before computng the bounds of data cubes for exclusve prunng. Dmenson values that solutons n a lattce must have are called nclusve dmenson values. Groups that are defned by only non-nclusve dmenson-values are defntely not solutons and should be pruned. Defnton 1. Gven a data cube to be computed wth some constrant, the nclusve dmenson values are those that defne groups that are solutons. From ths defnton, dmenson values that do not nvolve any solutons are non-nclusve dmenson values. Theorem 1. (Inclusve prunng) Gven a cube lattce L, an aggregaton constrant, and a set of nclusve dmenson values, groups n L that are defned by only non-nclusve dmenson values are not solutons and can be pruned.

Proof. All soluton groups n L must be grouped by at least one nclusve dmenson value. A group n L whose groupng dmenson values are solely defned by non-nclusve dmensons can not be a soluton and so can be pruned. The queston to answer now s how to dentfy the nclusve dmenson values n a cube lattce before t s computed. As wll be seen n Secton 6, the nclusve dmenson values for a cube are decded n the prevous computaton of ts supercubes: all dmensons-values whose bounds have non-empty ntersecton wth the nterval defned by thresholds of the gven constrant are nclusve dmenson values for the cube. Importantly these bounds can be computed by a sngle traversal of the MSPs. As a result we have the beneft of prunng wth lttle extra cost. Inclusve prunng s acheved by removng branches that contan only non-nclusve dmenson-values. Note that these branches are pruned from all sub-trees that are to be computed. 5 Ant-prunng wth boundng When the selectvty of an aggregaton constrant s hgh, a large number of groups are qualfed as ceberg groups. When the data s skewed, the ceberg groups resde n a few regons of the cube. In both cases, examnng the prunng condtons on these groups does not result n any prunng. We ntroduce the noton of the Ant-prunng border to mark regons where all groups are solutons. Defnton 1 (Ant-prunng regon) In ceberg-cubng, gven a lattce L an ant-prunng regon s a convex space of groups that are all solutons. An antprunng regon B can be denoted as B L, B U : B L s the lower border where none of ther supergroups are n B. B U s the upper border consstng of groups n B where none of ther subgroups are n B. Example 1 An example ant-prunng border on the lattce of B, C, D, and E condtoned on (a 1 ) s shown n Fg. 2. The upper border conssts of the (a 1 BC), (a 1 BDE), and (a 1 CDE) group-bys and the lower border conssts of the (a 1 BC), (a 1 D), and (a 1 E) group-bys. The ant-prunng regon contans groups that are supergroups of some group n the (a 1 BC), (a 1 BDE), and (a 1 CDE) group-bys as well as subgroups of some group n the (a 1 BC), (a 1 D), and (a 1 E) group-bys. An ant-prunng regon s a maxmum regon n a group-lattce where prunng s unnecessary. The groups wthn the regon covered by the ant-prunng borders are defntely solutons. We am to compute ant-prunng regons that are lattces wth lttle extra cost. We observe that lattces are convex spaces: Gven a lattce condtoned on a group g, the lattce s a convex space where all groups are subgroups of g and supergroups of some MSPs. We make use of boundng to detect ant-prunng regons that are lattces, as shown n the observaton below. Observaton 1 Gven a lattce condtoned on a group g, f both bounds satsfy a gven constrant, the MSPs are the upper ant-prunng border B U, and g s the lower ant-prunng border B L ; the lattce of g s an ant-prunng regon. All groups n the lattce satsfy the constrant.

For ceberg cubes wth complex aggregaton constrants, the cost of checkng for the prunng condtons s hgh. Detectng a group-lattce that s an antprunng regon supposedly mproves the effcency of cubng algorthms. 6 Multway Bound-Prune Cubng on G-trees The Group tree (G-tree) s our data structure for cubng. The G-tree for an n-dmensonal dataset s of depth n, where each level represents a dmenson. A common path startng from the root collapses the tuples wth common dmensonvalues. Each tree node keeps the local aggregates necessary to compute the ceberg cube for a gven aggregate functon. A G-tree s constructed by one scan of the nput data. The G-tree n Fg. 3 represents the Sales dataset. To compute the data cube wth aggregate functon Average, the local aggregates n each node are Sum(Sale) and Cnt( ). On the leftmost path node (March) shows that there are 70 tuples wth Sum(Sale) = 300 n the (March,,, ) group, whereas the node (Peter) shows that there are 40 tuples wth Sum(Sale) = 100 n the (March, TV, Peter, ) group. 6.1 Top-down aggregaton on G-trees Top-down aggregaton on G-trees s based on the followng observaton: Constructon of an n-dmensonal G-tree has computed n group-bys, namely the group-bys whose dmensons are prefxes of the gven lst of dmensons. Wth the G-tree n Fg. 3, the root node has the aggregate for the group (,,, ) wth Cnt ( ) = 88 and Sum (Sale) = 700; The nodes at level one compute the aggregates for groups n (Month,,, ), whch are (March,,,,70,300), (January,,,,5,200), and (Aprl,,,,13,200). The nodes at the next 3 levels compute the aggregates for (Month,Product), (Month,Product,Man), and (Month, Product, Man, Cty) respectvely. Let A, B, C and D denote the four dmensons of the Sales dataset. The groupbys that are not represented on the G-tree n Fg. 3 are computed by collapsng one dmenson at a tme to construct sub-g-trees. In Fg. 4, each node represents a G-tree and all group-bys that are smultaneously computed on the G-tree. The ABCD-tree representng the tree of Fg. 3 s at the top. The group-bys (A,B,C,D), (A,B,C), (A,B), (A) and () are computed on the (A,B,C,D)-tree. The sub-trees of the ABCD-tree, ( A)BCD, A( B)CD, and AB( C)D, are formed by collapsng on dmensons A, B, and C respectvely. 6.2 Multway prunng on G-trees In Fg. 3, the leaf nodes of the G-tree are the MSPs for Cube (Mon,Prod,Man,Cty). The dmensons after / n each node of Fg. 4 denote prefx dmensons for the tree at the node and all ts sub-trees. A s the prefx dmenson for the ACDtree and ts sub-trees. All group-bys that are computed on the ACD-tree and ts

March (70, 300) root (88, 700) January (5, 200) Aprl (13, 200) (ABCD, ABC, AB, A, All) 1 -A -B -C (BCD, BC, B) 2 (ACD, AC)/A 6 (ABD)/AB 8 Peter (40, 100) TV (70, 300) John (30, 200) Toy (5, 200) John (5, 200) TV (8, 100) Peter (8, 100) Toy (5, 100) Peter (5, 100) -B -C -C (CD,C) 3 (BD)/B 5 (AD)/A 7 -C (D) 4 Perth (40, 100) Perth (20, 100) Sydney (10, 100) Perth (5, 200) Fg. 3. A sample G-tree Perth (8, 100) Sydney (5, 100) Fg. 4. Top-down cubng of a 4 dmensonal data cube wth shared dmensons sub-trees form data cubes condtoned on some A-value, Cube(CD) a, where terates over the values of A. The leaf nodes orgnated from a are the MSPs for Cube(CD) a. The bounds for Cube(ACDE) a are obtaned by a traversal of the leaf nodes orgnated from a. Wth the G-tree over dmensons Month, Product, Man, and Cty n Fg. 3, before collapsng on Product, we calculate the bounds for the cube lattces condtoned on each dmenson value. Followng Fg. 2, the bounds for Cube(Product, Man, Cty) March are computed from the three leaf nodes orgnated from (March) of G: Avg(Cube (Man, Cty) Mar) = Max({100/40, 100/20, 100/10}) = 10; Avg(Cube (Man, Cty) Mar) = Mn({100/40, 100/20, 100/10}) = 2.5. Inclusve prunng s acheved by dentfyng I, the set of nclusve dmensonvalues for a cube lattce before ts aggregates are computed. Gven a G-tree G on dmensons A 1,..., A n, the bounds for Cube(A k+1,...,a n ) a 1,...,a k 1, 1 < k < n, are computed from the leaf nodes orgnated from the node a k. If the bound nterval does not volate the gven constrant, a k should be added to I. After I s obtaned, the G-tree s traversed agan n depth-frst order for nclusve prunng. At each node, f ts dmenson-value s from I, the traversal to ts chldren s termnated because the branch s not to be trmmed. If the traversal reaches a leaf node, the branch s trmmed. From the leaf upwards, the nodes on the branch that have no chldren are deleted. Observaton 2 (Inclusve prunng) Consder computng an ceberg cube wth the constrant F(X) n [δ 1,δ 2 ]. Gven a G-tree wth n dmensons A 1,..., A n, Suppose [b 1,b 2 ] are the bounds for Cube(A k+1,...,a n ) a 1,...,a k 1, the sub-cube by collapsng a k (1 < k < n), nclusve dmenson-values are those a k such that [b 1,b 2 ] [δ 1,δ 2 ] φ. Branches defned by only non-nclusve dmenson-values are pruned from computaton. After nclusve prunng branches on a G-tree that can not contan solutons have been pruned from computaton. Wth the bounds computed for each cube lattce, exclusve- and ant-prunng are then appled.

Algorthm 1 The MBPC Algorthm // Assume that an n-dmensonal G-tree T wth aggregate functon F s constructed. // The ceberg cube s computed by callng MBPC(T(A 1,..., A n), φ). Input: a) T s a G-tree wth condtonal base B. b) Aggregate constrant F(X) n [δ 1, δ 2] s assumed global. Output: The ceberg groups n the ceberg cube. MBPC(T(B 1,..., B m), B) (1) for each dmenson = 1..m 1 do (2) for each node b of dmenson B on T (3) Output the group g condtoned on B f F(g) n [δ 1, δ 2]; (4) Compute the bounds [b 1, b 2] and I for Cube(B +1,..., B m) B {b }; // nclusve prunng (5) Prune branches formed by only dmenson-values from I; (6) f ([b 1, b 2] [δ 1, δ 2] = φ) (7) break; // exclusve prunng (8) else f ([b 1, b 2] [δ 1, δ 2] = [b 1, b 2]) (9) compute all groups n Cube(B +1,..., B m) B {b }; (10) break; //ant-prunng (11) for j = + 1..m do (12) Construct the sub-tree T s by collapsng B j; (13) MBPC(T s, B {b }); Observaton 3 (Exclusve- and ant-prunng) Consder computng an ceberg cube wth the constrant F(X) n [δ 1,δ 2 ]. Gven a G-tree for dmensons A 1,..., A n, let [b 1,b 2 ] be the bounds for Cube(A k+1,...,a n ) a 1,...,a k 1, the cube by collapsng A k (1 < k < n). If [b 1,b 2 ] [δ 1,δ 2 ] = φ, the branches orgnated from the node (a 1,...,a k 1 ) can be pruned; otherwse f [b 1,b 2 ] [δ 1,δ 2 ] = [b 1,b 2 ], all groups n Cube(A k+1,...,a n ) are qualfed groups. 6.3 The MBPC algorthm We present the Multway Bound-Prune Cubng (MBPC) algorthm for computng ceberg cubes, wth nclusve- exclusve and ant-prunng strateges. The algorthm s shown as Algorthm 1. A G-tree s bult by one scan of the nput dataset and s the nput for MBPC. MBPC s a recursve procedure where wth each recurson, an addtonal dmenson-value s accumulated n g, the group to be computed. For the ease of presentaton, one group g s processed a tme n Algorthm 1. Indeed wth top-down cubng on the G-tree, multple groups are smultaneously processed. The steps nvolved n each group are the same. At lne 3 of the algorthm, the groups that are computed on an G-tree are condtoned on the condtonal base B of the tree. Intally the condtonal base s empty. The prefx dmenson-value before the collapsng dmenson s accumulated nto the condtonal base B wthn each recurson. The bounds are obtaned

18 3.6 Run Tme(sec) 16 14 12 10 BPC MBPC (ant-prunng) Run Tme (sec) 3.4 3.2 3.0 BPC MBPC (ant-prunng) 8 0 [1,10] [1,20] [1,30] [1,40] [1,50] [1,60] [1,70] [1,80] [1,90] [1,100] 2.8 0 [500,10000] [500,20000] [500,15000] [500,25000] [500,30000] [500,40000] [500,35000] [500,50000] [500,45000] [500,60000] [500,55000] Average Range (a) Weather (100,000 tuples, 9 dmensons) Average Range (b) Census (88,443 tuples, 12 dmensons) Fg. 5. Performance comparson of MBPC (ant-prunng) and BPC when the branches under the prefx groups are amalgamated. No extra scan of the tree s requred. Inclusve prunng (lne 5) s appled before further aggregaton of subgroups. For the top-down ceberg cubng on the G-tree, a lst I s created when the G-tree s bult. The lst conssts of dmenson-values of the G-tree, each s assocated wth a bt-vector. When the G-tree s traversed n depth-frst order to obtan the bounds for the nodes on the G-tree, the bounds for the dmenson-values n the the lst are also obtaned; they determne whch branches can be trmmed. The test for ant-prunng (lne 8) can be combned wth that for exclusve prunng (lne 6) as a sngle test, where the ntersecton of the two ranges s performed only once. Ant-prunng (lne 9) s appled by aggregaton of groups wthout any boundng or boundng test beng performed. 7 Experments We conducted prelmnary experments to examne the performance of MBPC. We mplemented MBPC wth ant-prunng and compared ts performance wth that of BPC, that uses only exclusve bound prunng. Experments were conducted on a PC wth 686 processor runnng GNU/Lnux. Two datasets were used n our experments. The weather dataset 1 s the weather reports collected at varous weather statons globally n 1985. Nne of the attrbutes such as staton-d, longtude, and lattude are used as dmensons and the cardnaltes range from 2 to 6505. Numbers between 1 and 100 were randomly generated as the measure. The US census dataset 2 was collected from surveys about varous aspects of ndvduals n the households n US n 1990. The orgnal dataset has 61 dfferent attrbutes such as hrswork1(hours worked last week) and valueh(value of the house). We selected 12 dscrete attrbutes as dmensons and a numerc attrbute as the measure. Census s dense and skewed and Weather s very sparse. 1 http://cdac.ornl.gov/ftp/ndp026b/sep85l.dat.z 2 ftp://ftp.pums.org/pums/data/p19001.z

The results of computng ceberg cubes wth the constrant Avg(X) n [δ 1,δ 2 ] are shown n Fg. 5. The tme reported does not nclude tme for output. Generally ant-prunng further mproves the effcency of the BPC algorthm. Moreover, as δ 2 get larger and the ceberg cube to compute get larger, more pronounced mprovement can be observed. MBPC shows sgnfcant mprovement n run tme over BPC on Weather. MBPC outperforms BPC consstently by about 13% for dfferent aggregaton constrant thresholds. MBPC shows modest mprovement over BPC on the Census dataset. 8 Conclusons We have proposed multway prunng strateges n ceberg cubng: exclusve, nclusve and ant-prunng. We have also presented an ceberg cubng algorthm wth the multway prunng strategy wth boundng. Our ntal experments wth ant-prunng has shown that the multway prunng strategy can sgnfcantly mprove effcency of cubng algorthms, especally on computng large cubes. More extensve experments are underway to examne the effectveness of nclusve prunng and the multway prunng strategy wth respect to general data characterstcs. References 1. R. Agrawal and R. Srkant. Fast algorthms for mnng assocaton rules. In Proc. of VLDB 94, pages 487 499, Santago, Chle. 2. R. J. Bayardo. Effcently mnng long patterns from databases. In Proc. of SIG- MOD 98, pages 85 93. 3. K Beyer and R Ramakrshnan. Bottom-up computaton of sparse and ceberg cubes. In Proc. of SIGMOD 99. 4. L. Chou and X. Zhang. Computng complex ceberg cubes by multway aggregaton and boundng. In Proc. of DaWak 04 (LNCS 3181), Zaragoza Span. 5. G. Dong and J. L. Effcent mnng of emergng patterns: Dscoverng trends and dfferences. In Proc. of KDD 99, pages 15 18, San Dego, USA. 6. J Gray et al. Data cube: a relatonal aggregaton operator generalzng group-by, cross-tab, and sub-totals. Data Mnng and Knowledge Dscovery, 1(1), 1997. 7. J Han, J Pe, G Dong, and K Wang. Effcent computaton of ceberg cubes wth complex measures. In Proc. of SIGMOD 01. 8. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mnng and prunng optmzatons of constraned assocatons rules. In Proc. of SIGMOD 98. 9. J. Pe, J. Han, and L. V. S. Lakshmanan. Mnng frequent temsets wth convertble constrants. In Proc. of ICDE 01. 10. K. Wang et al. Dvde-and-approxmate: A novel constrant push strategy for ceberg cube mnng. IEEE TKDE, 17(3):354 368, March 2005. 11. G. I. Webb. Inclusve prunng: a new class of prunng rules for unordered search and ts applcaton to classfcaton learnng. In Proc. of ACSC 96. 12. X. Zhang, L. Chou and G. Dong. Effcent computaton of ceberg cubes by boundng aggregate functons. IEEE TKDE, In submsson.