WITH the multidimensional model for data warehouses,

Size: px

Start display at page:

Download "WITH the multidimensional model for data warehouses,"

Jodie O’Neal’
5 years ago
Views:

1 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY Effcent Computaton of Iceberg Cubes by Boundng Aggregate Functons Xuzhen Zhang, Paulne Lenhua Chou, and Guozhu Dong, Senor Member, IEEE Abstract The ceberg cubng problem s to compute the multdmensonal group-by parttons that satsfy gven aggregaton constrants. Prunng unproductve computaton for ceberg cubng when nonantmonotone constrants are present s a great challenge because the aggregate functons do not ncrease or decrease monotoncally along the subset relatonshp between parttons. In ths paper, we propose a novel bound prune cubng (BP-Cubng) approach for ceberg cubng wth nonantmonotone aggregaton constrants. Gven a cube over n dmensons, an aggregate for any group-by partton can be computed from aggregates for the most specfc n-dmensonal parttons (MSPs). The largest and smallest aggregate values computed ths way become the bounds for all parttons n the cube. We provde effcent methods to compute tght bounds for base aggregate functons and, more nterestngly, arthmetc expressons thereof, from bounds of aggregates over the MSPs. Our methods produce tghter bounds than those obtaned by prevous approaches. We present ceberg cubng algorthms that combne boundng wth effcent aggregaton strateges. Our experments on real-world and artfcal benchmark data sets demonstrate that BP-Cubng algorthms acheve more effectve prunng and are several tmes faster than state-of-the-art ceberg cubng algorthms and that BP-Cubng acheves the best performance wth the top-down cubng approach. Index Terms Data mnng, data cube, prunng, data warehouses. Ç 1 INTRODUCTION WITH the multdmensonal model for data warehouses, a data set conssts of tuples over dmensons and measures, and queres nvolve aggregatng the measures over parttons of tuples sharng dentcal dmenson values. For example, the Structured Query Language (SQL) n Fg. 1a gves the monthly total amount of sales for each cty on the Sales data set n Table 1. The (Month, Cty) group-by dvdes the data set nto parttons of tuples sharng dentcal (Month, Cty) values, and the measure Sale s aggregated to yeld Sum(Sale) for each partton. The Cube operator was proposed [6] to compute all potental group-bys of a data set, leadng to the noton of a data cube. The ceberg cube was proposed n [4], where only parttons whose aggregate values satsfy an aggregaton constrant are produced. By addng HAVING CountðÞ 10 to the SQL query n Fg. 1a, we get an SQL query (n Fg. 1b) for computng those (Month, Cty) parttons, each havng 10 tuples. The aggregaton constrant CountðÞ 10 s antmonotone: If a partton does not contan 10 tuples and thus fals the constrant, then all ts subparttons wll have fewer tuples and also fal the constrant. The bottom-up cubng (BUC) strategy was proposed [4], where aggregates are computed startng from. X. Zhang and P.L. Chou are wth the School of Computer Scence and IT, RMIT Unversty, GPO Box 2476v, Melbourne 3001, Australa. E-mal: {zhang, lchou}@cs.rmt.edu.au.. G. Dong s wth the Department of Computer Scence and Engneerng, Wrght State Unversty, Dayton, OH E-mal: guozhu.dong@wrght.edu. Manuscrpt receved 18 Sept. 2005; revsed 12 Apr. 2006; accepted 15 Jan. 2007; publshed onlne 25 Apr For nformaton on obtanng reprnts of ths artcle, please send e-mal to: tkde@computer.org, and reference IEEECS Log Number TKDE Dgtal Object Identfer no /TKDE the partton of all tuples and then followed recursvely wth the subparttons, and antmonotone constrants are used for prunng. In On-Lne Analytcal Processng (OLAP) applcatons, aggregaton constrants often nvolve complex aggregates. The constrant SumðxÞ 5; 000 and VarðxÞ 100, whch states that the total proft ðxþ s at least $5,000, and the varance of proft s at most $100, s a typcal example of constrants for supportng enterprse data analyss and decson makng. As x can be postve or negatve, the subset relatonshp between a group 1 and ts subgroups may not mply monotoncally ncreasng (or decreasng) aggregate values. Computng ceberg cubes wth such constrants has been a challengng problem. Exstng proposals n the lterature have all followed the approach of dervng weaker but antmonotone constrants [7], [15]. They are restrcted n two aspects: 1) the weaker constrants derved from nonantmonotone constrants are often very loose and not effectve enough for prunng, especally for complex constrants, and 2) aggregates are computed followng the BUC framework, and only one group s aggregated at a tme n the recursve parttonng process, whch lmts the amount of nformaton readly avalable for prunng. In ths paper, we propose a novel technque, called bound prune cubng (BP-Cubng), for effcently computng ceberg cubes wth nonantmonotone aggregaton constrants. The man deas are to effectvely derve tght bounds for possble aggregates and to use such bounds to prune unproductve computaton. An mportant part s played by the most specfc parttons (MSPs), namely, those nonempty parttons that cannot be further dvded (for the subcube under consderaton). MSPs can be vewed as the 1. We use group and partton as synonyms /07/$25.00 ß 2007 IEEE Publshed by the IEEE Computer Socety Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

2 904 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 Fg. 1. Two SQL group-by queres on the sales data set. (a) An SQL group-by. (b) Addng a constrant to (a). TABLE 1 A Sales Data Set, Partally Aggregated Fg. 2. Data cube lattce and subdata cube. (a) Cube(ABCD) decomposton. (b) CubeðBCDÞja 1. Month, Product, SalesMan, and Cty are dmensons. Sale s the measure. basc unts for computng data cubes, snce other parttons are unons of some MSPs. The aggregate for any partton n the data cube can be computed from aggregatng some MSPs. Decomposng an aggregate functon nto the base aggregate functons Count, Max, Mn, and Sum on MSPs, the aggregate functon can be tghtly bounded. By usng MSPs, nonexstng groupngs of tuples wll not nfluence the estmaton of the bounds of aggregates, 2 whch s another reason why tght bounds can be obtaned. Wth a four-dmensonal data cube on the average Sale (Avg(Sale)) of Sales, sx MSPs are shown n Table 1. For each MSP, AvgðSaleÞ ¼SumðSaleÞ=CountðÞ. 3 We now llustrate cases where SumðSaleÞ > 0 for each MSP (the general case s treated later). Both Sum(Sale) and CountðÞ ncrease monotoncally over unons of postve MSPs. The upper bound of Avg(Sale) can be obtaned from the upper bound for Sum(Sale) and the lower bound for CountðÞ, whch s Sum ðsumðsaleþþ=mn ðcountðþþ, where ranges over the sx MSPs. Ths can, n fact, be mproved: The upper bound for Avg(Sale) s actually Max AvgðSaleÞ, where ranges over all MSPs (see Secton 6). For Table 1, Max AvgðSaleÞ s 200=5 ¼ 40, reached at the group (January, Toy, John, Perth). The MSPs can be obtaned by a sngle scan of the raw data. The bound-prune technque descrbed above wll be appled recursvely for all subcubes. We wll optmze the computaton for subcubes by reusng computed results for supercubes. Our bound-prune technque can be appled n the BUC approach. More mportantly, t also fts ncely nto the topdown cubng approach, where multway shared computaton of aggregates mproves cubng effcency. Bound-prune s a general technque that can prune for constrants of the form F ðxþ, F ðxþ, or FðxÞ n ½ 1 ; 2 Š. The aggregate functon F can be an SQL aggregate functon or a commonly seen complex aggregate functon such as Average (Avg) or Varance (Var). As wll be shown later 2. Ths s an ssue for other approaches. 3. For ease of dscusson, we assume that NULL s not allowed for measures and, therefore, CountðSaleÞ ¼CountðÞ. n our experments on real-world and synthetc benchmark data sets, our BP-Cubng algorthms are several tmes faster than exstng prunng technques, ncludng the most recent Dvde-and-Approxmate (DnA) algorthm [15]. Organzatonally, Secton 2 gves some prelmnares. Secton 3 defnes boundablty of aggregate functons. We dscuss how we can derve the bounds for base aggregates, ther functons, and complex aggregates n Sectons 4, 5, and 6, respectvely. Secton 7 then presents the BP-Cubng algorthms. We report expermental evaluatons n Secton 8. We dscuss related works n Secton 9 and conclude n Secton SOME PRELIMINARIES ON DATA CUBES We wll use uppercase letters to denote dmensons and lowercase letters to denote dmenson values. A group-by s a tuple of dmensons of the form (A, B, C), and a partton of (A, B, C) s a tuple of dmenson values of the form (a, b, c). The group-bys n a data cube form a lattce structure called the cube lattce. Fg. 2 shows an example fourdmensonal cube lattce. The empty group-by (whch aggregates all tuples n a data set) s at the bottom of the lattce, and the group-by wth all dmensons s at the top. The edges n the lattce represent subset-superset relatonshps between group-bys. A specal value s used n specfyng parttons, wth the meanng that t can match any value (of the applcable dmenson). For the data set n Table 1, ðjan; ; ; Þ denotes the partton of tuples havng Month ¼ January and havng no restrcton on the other dmensons. For brevty, s usually omtted n namng parttons; for example, ðjan; ; ; Þ s wrtten as (Jan). The subset relatonshp also exsts between parttons from dfferent group-bys. For example, the 3D partton (Jan, TV, Perth) s a subset of each of the 2D parttons (Jan, TV), (Jan, Perth), and (TV, Perth). Gven a data set S of n dmensons A 1 ;...;A n and a measure X, the correspondng data cube s usually referred to as CubeðA 1 ;...;A n Þ. Here, we thnk of the cube as the set of possble group-bys or the set of possble groups for all group-bys. For example, the four-dmensonal cube n Table 1 can be denoted as Cube(Month, Product, SalesMan, Cty). For ease of dscusson, we also refer to the cube as CubeðXÞ, and thnk of t as the measure parttons (the possble bags of measure values for the possble groups). Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

3 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 905 Specfcally, let g 1 ;...;g m be all possble groups of the cube. By an abuse of notaton, we wll use X to denote the bag (multset) ft½xšjt 2 g g, and we wll refer to X 1 ;...;X m as the measure parttons. Now, gven an aggregate functon F, we apply F to the measure parttons n the data cube to get F ðcubeðxþþ ¼ ff ðx ÞjX 2 CubeðXÞg. 4 Wth the aggregate functon Sum(Sale), we can wrte Sum(Cube(Sale)) for the four-dmensonal cube n Table 1. Aggregate functons are categorzed nto dstrbutve, algebrac, and holstc functons, dependng on how an aggregate on a partton can be computed from aggregates on ts subparttons [6]. 1) An aggregate functon F s dstrbutve f there s a functon G such that F ðsþ ¼ GðfF ðs Þj ¼ 1;...;ngÞ for a data set S and parttonng fs 1 ;...;S n g of S. 5 For example, Sum and Max are dstrbutve, wth G ¼ F, and Count s dstrbutve, wth G ¼ Sum. 2) An aggregate functon F s algebrac f t can be computed by a functon H wth several arguments, each of whch s obtaned by applyng a dstrbutve aggregate functon. Avg, Var, Standard-Devaton, and MaxN and MnN 6 are algebrac functons; for example, Avg s algebrac, snce AvgðSÞ ¼SumðSÞ=CountðSÞ, and both Sum and Count are dstrbutve. All dstrbutve aggregatons are algebrac. 3) An aggregate functon F s holstc f t s not dstrbutve or algebrac. Rank, Mode, and Medan are examples of holstc functons. We wll call the dstrbutve aggregatons used for computng algebrac functons F (tem 2 above) the auxlary aggregates. Exstng cubng algorthms have made use of the propertes of dstrbutve and algebrac functons to compute supergroups from subgroups [1], [10], [11], [17]. No effcent cubng algorthms for holstc functons have been reported. It s worth notng that n most applcatons, aggregate functons are algebrac functons. In the dscussons below, the term aggregate functons wll refer to dstrbutve and algebrac functons unless specfed otherwse. 3 BOUNDING DATA CUBES The basc dea of boundng s to estmate the upper and lower bounds for a cube from the smallest or the MSPs of the base table. In ths secton, we defne the boundablty concept for aggregate functons. In Sectons 4 and 5, we wll dscuss how we can bound the base aggregate functons Count, Max, Mn, and Sum and how we can bound algebrac functons gven as arthmetc expressons of the base aggregate functons. Defnton 1 (most specfc partton and data cube core). Gven a data set on dmensons A 1 ;...;A n and measure X, all n-dmensonal parttons form the core of the data cube: fða 1 ;...;a n Þja 2 domanða Þ;a 6¼ ; 1 ng: Each partton n the core s an MSP. The multset of measure values for tuples n an MSP g, namely, ft½xšjt 2 gg, san MSP of the measure. All aggregates n a data cube can be computed from ts MSPs. For the Sales data set n Table 1 and Cube(Month, 4. Strctly speakng, FðX Þ s the F aggregate of X for group g. 5. S ¼[ n ¼1 S and S \ S j ¼ for 6¼ j. 6. MaxN and MnN return, respectvely, the N largest and smallest values. Product, SalesMan, Cty), the core conssts of sx (Month, Product, SalesMan, Cty) parttons, namely, (Jan, Toy, John, Perth), (Mar, TV, Peter, Perth), (Mar, TV, John, Perth), (Mar, TV, John, Sydney), (Apr, TV, Peter, Perth), and (Apr, Toy, Peter, Sydney). Any aggregate n ths cube can be computed from a subset of the sx MSPs. We thus can use MSPs to bound the aggregates n a data cube. Defnton 2 (Aggregate functon bound). An upper bound of an aggregate functon F for the data cube on X s a real number U such that for any partton X of CubeðXÞ, t s the case that FðX ÞU. Smlarly, we can defne a lower bound for F ðcubeðxþþ to be a real number L such that for any partton X of CubeðXÞ, F ðx ÞL. Observaton 1. Gven a data cube on measure X and an aggregate functon F, the tghtest upper bound and lower bound are, respectvely, reached by the largest and smallest aggregate values that can be produced by any set of MSPs of the data cube. Example 1. In Table 1, Sum(Sale) for any group n the data cube s not larger than the sum of Sum(Sale) for all sx MSPs, whch s 700. On the other hand, Sum(Sale) for any group n the data cube s not smaller than the mnmal Sum(Sale) among the sx MSPs, whch s 100. Therefore, for Sum(Sale), Cube(Month, Product, SalesMan, Cty) has the tghtest upper bound of 700 and the tghtest lower bound of 100. The noton of MSP and data cube core s also appled to subdata cubes. Fg. 2a shows Cube (ABCD), and ts MSPs are the ABCD parttons. The polygon on the left of Fg. 2a denotes a set of lattce structures, as shown n Fg. 2b. The polygon on the rght of Fg. 2a represents Cube(BCD). The MSPs for Cube(BCD) are BCD parttons, whch are unons of ABCD parttons, wth equal values for B, C, and D. The lattce structure, as shown n Fg. 2b, s called a subdata cube. For the lattce n Fg. 2b, (a 1, B, C, D) parttons, whch are a subset of the ABCD parttons, form ts core. The relatonshp between cores for a cube and ts subcubes, and that for a cube and ts lower-dmensonal counterparts, are mportant for computng bounds from MSPs, wthout ncurrng extra cost and for effectve prunng wth bounds, as shown n our BP-Cubng algorthms, dscussed n Secton 7. In later dscussons, we use the term data cube to refer to a standard or subdata cube. Defnton 3 (Subdata cube). Consder dmensons A 1 ;...;A n and B 1 ;...;B k and values b 1 2 domanðb 1 Þ;...;b k 2 domanðb k Þ: The groups aggregatng tuples that satsfy B ¼ b for 1 k form a subdata cube, or smply a subcube, denoted as CubeðA 1 ;...;A n Þjb 1 ;...;b k. The core of the subdata cube comprses MSPs wth B ¼ b for 1 k. B 1 ;...;B k are condtonal dmensons for the subcube. Example 2. Contnung wth Example 1, consder CubeðProduct; SalesMan; CtyÞjMar: There are three MSPs for the data cube: (Mar, TV, Peter, Perth), (Mar, TV, John, Perth), and Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

4 906 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 (Mar, TV, John, Sydney). Tghter bounds can be obtaned on subcubes. Indeed, the tghtest upper bound for Sum(Sale) for the data cube s 100 þ 100 þ 100 ¼ 300, and the tghtest lower bound s Mnðf100; 100; 100gÞ ¼ 100. Based on Observaton 1, we can see that a data cube can be bounded from ts MSPs. However, exhaustvely checkng the power set of MSPs s equvalent to computng the complete data cube and s not computatonally feasble. We defne boundable aggregate functons as follows. Defnton 4 (Boundable aggregate functon). An aggregate functon F s boundable for a data cube f some upper and lower bounds of F for the data cube can be determned by some algorthm wth a sngle scan of some auxlary aggregate values of the MSPs of the data cube. We wll use F ðcubeðxþþ and F ðcubeðxþþ to denote, respectvely, the upper and lower bounds, computed by a gven sngle-scan algorthm. 7 It should be ponted out that the bounds computed may not be the tghtest. To ensure the effectveness of prunng, we am to derve bounds that are as tght as possble. Let us use some example aggregate functons to explan ths defnton. Example 3. Consder the aggregate functon Count and measure X wth n MSPs X 1 ;...;X n, where CountðXÞ ¼SumðfCountðX Þj ¼ 1...ngÞ: The auxlary aggregate functon s Count. For any partton g of CubeðXÞ, the number of tuples n g s not larger than the total number of tuples of all MSPs; n other words, CountðgÞ SumðfCountðX Þj ¼ 1...ngÞ. On the other hand, we also have CountðgÞ MnðfCountðX Þj ¼ 1...ngÞ: As a result, SumðfCountðX Þj ¼ 1...ngÞ s an upper bound, and MnðfCountðX Þj ¼ 1...ngÞ s a lower bound. They can be obtaned by a sngle scan of the auxlary aggregates of MSPs. Therefore, CountðCubeðXÞÞ s boundable. We now gve an example of aggregate functons whose bounds can be nfnte. Example 4. Gven measure X and aggregate functon 1/Sum, consder CubeðXÞ wth MSPs X 1 ;...;X n. Then, 1=SumðXÞ ¼1=SumðfSumðX Þj ¼ 1...ngÞ. As wll be explaned later n Tables 3 and 4, the sgns of SumðX Þ are mportant for computng the bounds: 1. If SumðX Þ > 0 for all ¼ 1...n, then the upper and lower bounds are computed from, respectvely, the postve lower and upper bounds of SumðCubeðXÞÞ: 1=SumðCubeðXÞÞ ¼ 1=MnðfSumðX Þj ¼ 1...ngÞ 7. We wll wrte FðCubeðXÞÞ and FðCubeðXÞÞ to mean, respectvely, the upper and lower bounds computed by our algorthm. That algorthm s the sum of the boundng formulas presented n ths paper. TABLE 2 The Bounds of Base Aggregate Functons Gven a data set, X s the measure, and X 1 ;...;X n are the MSPs. (a) There s such that SumðX Þ > 0. (b) There s such that SumðX Þ < 0. and 1=SumðCubeðXÞÞ ¼ 1=SumðfSumðX Þj ¼ 1...ngÞ: 2. If SumðX Þ < 0 for all ¼ 1...n, then the upper and lower bounds are computed from, respectvely, the negatve lower and upper bounds of SumðCubeðXÞÞ as 1=SumðCubeðXÞÞ ¼ 1=SumðfSumðX Þj ¼ 1...ngÞ and 1=SumðCubeðXÞÞ ¼ 1=MaxðfSumðX Þj ¼ 1...ngÞ: 3. For general cases, where SumðX Þ can be postve or negatve, the upper bound 1=SumðCubeðXÞÞ s computed from the smallest SumðxÞ 0 for any partton n the cube. For any group g of CubeðXÞ, SumðgÞ can be the sum of any postve or negatve MSP aggregates and can have a mnmum of 0. Thus, the upper bound for 1=SumðCubeðXÞÞ s 1. Smlarly, the lower bound 1=SumðCubeðXÞÞ s computed from the largest SumðxÞ < 0 for any partton n the cube, whch can be a negatve approachng 0. Therefore, the lower bound for 1=SumðCubeðXÞÞ s 1. Computng the upper and lower bounds for arthmetc expressons of base aggregate functons such as 1/Sum s summarzed later n Table 4. 4 BOUNDING BASE AGGREGATE FUNCTIONS We now consder how we can bound the base aggregate functons of SQL. Theorem 1. The base aggregate functons Count, Mn, Max, and Sum are boundable, and ther bounds are lsted n Table 2. We use the followng example for Sum to explan the reasonng for the bounds of Table 2 and how they are obtaned by a sngle scan of MSPs. Example 5. Suppose we are to compute a data cube on measure X, whch may take postve or negatve values. Let the MSPs of X be X 1 ;...;X n. Wth a sngle scan of X 1 ;...;X n, we have SumðX Þð1 nþ and the postve/negatve and maxmal/mnmal values among them: Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

5 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 907 TABLE 3 The Sgn Bounds of Base Aggregate Functons TABLE 4 Boundng Arthmetc Expressons X s the measure of a data set, and the MSPs for the data cube are X 1 ;...;X n. 1. If there exsts such that SumðX Þ > 0, then the sum for any group n CubeðXÞ s not greater than the sum of all such postve SumðX Þ. Otherwse, SumðX Þ0 for ¼ 1...n; the sum for any group s not greater than the maxmal SumðX Þ. Thus, SumðCubeðXÞÞ ¼ Sum SumðX Þ>0 SumðX Þ f there exsts SumðX Þ > 0; SumðCubeðXÞÞ ¼ MaxSumðX Þ; otherwse: 2. If there exsts such that SumðX Þ < 0, then the sum of any group n CubeðXÞÞ s not less than the sum of all such SumðX Þ. Otherwse, SumðX Þ0 for all ¼ 1...n; the sum of any group s not less than the mnmal SumðX Þ. Therefore, SumðCubeðXÞÞ ¼ Sum SumðX Þ<0 SumðX Þ; f there exsts SumðX Þ < 0; SumðCubeðXÞÞ ¼ MnSumðX Þ; otherwse: To bound functons that are arthmetc expressons of base aggregate functons, sometmes, the sgn bounds of aggregatons need to be computed (see Secton 5). These nclude postve upper bound ðf þ Þ, postve lower bound ðf þ Þ, negatve upper bound ðf Þ, and negatve lower bound ðf Þ, where F s ether a base aggregate functon Mn, Max, Sum, or Count or an arthmetc expresson of base aggregate functons. F þ CubeðXÞ s defned to be a real l 0 obtaned by a sngle-scan algorthm such that l F ðgþ 0 for any partton g n CubeðXÞ. The other sgn bounds can be defned smlarly. The postve (negatve) bounds are applcable only f there are nonnegatve (negatve) aggregates among the MSPs. The sgn bounds of base aggregate functons are lsted n Table 3. We explan how the sgn bounds for Sum are computed. Computaton for other functons s straghtforward. We dscuss how we can compute the postve bounds of Sum, wth dscussons on the negatve bounds mrror the same The short notatons E 1, E 2, and ðe 1 op E 2 Þ denote E 1 ðcubeðxþþ, E 2 ðcubeðxþþ, and ðe 1 op E 2 ÞðCubeðXÞÞ, respectvely, where X s the measure of a data set. When the operator s or /, to bound ðe 1 op E 2 Þ, not all four knds of computaton are applcable. As ndcated, the sgns of E 1 ðx Þ and E 2 ðx Þð ¼ 1...nÞ decde the applcable computaton. case as n postve bounds. The postve upper bound s the sum of all postve sums of the MSPs. To compute the postve lower bound, two cases exst: 1) f SumðX Þ0 for all X ð1 nþ, then the postve lower bound s the mnmal SumðX Þ0 and 2) f the sums of MSPs nclude both negatve and postve ones, then the sum of any set of MSPs can be as low as 0, and so, 0 s the postve lower bound. 5 BOUNDING ARITHMETIC EXPRESSIONS OF BASE AGGREGATE FUNCTIONS Consder a data cube CubeðXÞ and an arthmetc expresson of auxlary aggregates E ¼ E 1 op E 2, where E 1 and E 2 are base aggregate functons or arthmetc expressons of base aggregates. If the operator s þ or, then the upper (lower) bounds of EðCubeðXÞÞ can be computed from the upper (lower) bounds of E 1 ðcubeðxþþ and E 2 ðcubeðxþþ. If the operator s or /, then we need to use the sgn bounds of E 1 ðcubeðxþþ and E 2 ðcubeðxþþ to bound EðCubeðXÞÞ. Proposton 1. Gven CubeðXÞ on measure X and an arthmetc expresson E ¼ E 1 op E 2, where the operator s þ,,, or /, and E 1 and E 2 are base aggregate functons or arthmetc expressons of base aggregates, EðCubeðXÞÞ and EðCubeðXÞÞ can be computed from the bounds or sgn bounds of E 1 ðcubeðxþþ and E 2 ðcubeðxþþ, as shown n Table 4. Proof. Let X 1 ;...;X n be the MSPs of CubeðXÞ. We consder each of the four possble operators: 1. E 1 þ E 2. By the defnton of upper bound, for any group g 2 CubeðXÞ, E 1 ðgþ E 1 ðcubeðxþþ, and E 2 ðgþ E 2 ðcubeðxþþ. We have E 1 ðgþþe 2 ðgþ E 1 ðcubeðxþþ þ E 2 ðcubeðxþþ: Thus, ðe 1 þ E 2 ÞðCubeðXÞÞ ¼ E 1 ðcubeðxþþ þ E 2 ðcubeðxþþ: The proof s smlar for ðe 1 þ E 2 ÞðCubeðXÞÞ ¼ E 1 ðcubeðxþþ þ E 2 ðcubeðxþþ: Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

6 908 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY E 1 E 2. Gven g 2 CubeðXÞ, E 1 ðgþ E 1 ðcubeðxþþ and E 2 ðgþ E 2 ðcubeðxþþ. Thus, E 1 ðgþ E 2 ðgþ E 1 ðcubeðxþþ E 2 ðcubeðxþþ and ðe 1 E 2 ÞðCubeðXÞÞ ¼ E 1 ðcubeðxþþ E 2 ðcubeðxþþ: The proof s smlar for ðe 1 E 2 ÞðCubeðXÞÞ ¼ E 1 ðcubeðxþþ E 2 ðcubeðxþþ: 3. E 1 E 2. Let g be a group of CubeðXÞ. Let Y 1 ;Y 2 ;...;Y k be the MSPs that are contaned n g. Based on ther sgns under E 1, we form two subsets from these MSPs: PE 1 ¼fY j je 1 ðy j Þ0; 1 j kg; NE 1 ¼fY j je 1 ðy j Þ < 0; 1 j kg: Smlarly, based on ther sgns under E 2, we form the followng subsets from these MSPs: PE 2 ¼fY j je 2 ðy j Þ0; 1 j kg; and NE 2 ¼fY j je 2 ðy j Þ < 0; 1 j kg: For the upper bound, we frst consder the case where PE 1 6¼; and PE 2 6¼; or NE 1 6¼; and NE 2 6¼;. E 1 ðgþe 2 ðgþ can have postve values. Thus, E 1 ðgþe 2 ðgþ Max E1 þðcubeðxþþ Eþ 2 ðcubeðxþþ; E 1 ðcubeðxþþ E 2 ðcubeðxþþ : Otherwse, E 1 ðgþe 2 ðgþ can only be negatve. We have E 1 ðgþe 2 ðgþ Max E1 þ ðcubeðxþþ E 2 ðcubeðxþþ; E 1 ðcubeðxþþ Eþ 2 ðcubeðxþþ : For all cases, by combnng the two nequaltes wth E 1 ðgþe 2 ðgþ on the left-hand sde, we get ðe 1 E 2 ÞðCubeðXÞÞ: Max E1 þðcubeðxþþ Eþ 2 ðcubeðxþþ; E 1 ðcubeðxþþ E 2 ðcubeðxþþ; E þ 1 ðcubeðxþþ E 2 ðcubeðxþþ; E 1 ðcubeðxþþ Eþ 2 ðcubeðxþþ : The proof for the lower bound s qute smlar to that for the upper bound. The possble postve values of g are Mn E1 þ ðcubeðxþþ Eþ 2 ðcubeðxþþ; E1 ðcubeðxþþ E 2 ðcubeðxþþ and the possble negatve values of g are Mn E1 þðcubeðxþþ E 2 ðcubeðxþþ; E 1 ðcubeðxþþ Eþ 2 ðcubeðxþþ : By combnng the last two nequaltes, we get ðe 1 E 2 ÞðCubeðXÞÞ: Mn E1 þ ðcubeðxþþ Eþ 2 ðcubeðxþþ; E 1 ðcubeðxþþ E 2 ðcubeðxþþ; E þ 1 ðcubeðxþþ E 2 ðcubeðxþþ; E 1 ðcubeðxþþ Eþ 2 ðcubeðxþþ : 4. E 1 =E 2. The proof s smlar to that for the operator and s omtted. tu Example 6. Consder a data cube CubeðXÞ, where the measure X can be postve or negatve, and the aggregate functon Avg. Suppose the MSPs are X 1 ;...;X n. For any MSP X, we note that AvgðX Þ¼SumðX Þ=CountðX Þ, and CountðX Þ s always postve. Followng Table 4, AvgðCubeðXÞÞ s Max Sum þ ðcubeðxþþ=count þ ðcubeðxþþ; Sum ðcubeðxþþ=count þ ðcubeðxþþ and AvgðCubeðXÞÞ s Mn Sum þ ðcubeðxþþ=count þ ðcubeðxþþ; Sum ðcubeðxþþ=count þ ðcubeðxþþ : If there exsts some such that SumðX Þ > 0, then AvgðCubeðXÞÞ ¼ Sum þ ðcubeðxþþ=countðcubeðxþþ; and AvgðCubeðXÞÞ ¼ Sum þ ðcubeðxþþ=countðcubeðxþþ: Otherwse, SumðX Þ0 for all ¼ 1...n; then, we have AvgðCubeðXÞÞ ¼ Sum ðcubeðxþþ=countðcubeðxþþ; and AvgðCubeðXÞÞ ¼ Sum ðcubeðxþþ=countðcubeðxþþ: We then follow Table 3 to compute the sgn bounds n the above expressons. The sgn bounds for base aggregate functons are lsted n Table 3. We now dscuss how we can derve the sgn bounds for arthmetc expressons. Note that n Table 5, postve (negatve) bounds are applcable only f there exsts an MSP X n the cube such that ðe 1 ðx Þ op E 2 ðx ÞÞ 0 ððe 1 ðx Þ op E 2 ðx ÞÞ < 0Þ: Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

7 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 909 TABLE 5 The Sgn Bounds of Arthmetc Expressons Short notatons E 1, E 2, and ðe 1 op E 2 Þ denote E 1 ðcubeðxþþ, E 2 ðcubeðxþþ, and ðe 1 op E 2 ÞðCubeðXÞÞ, respectvely, where X s the measure of a data set. For example, the postve upper bound ðe 1 þ E 2 ÞðCubeðXÞÞ þ s applcable only f there exsts X such that E1ðX ÞþE 2 ðx Þ0: For the operators þ and, when E 1 þ E 2 or E 1 E 2 can be postve or negatve for MSPs, the postve lower and negatve upper bounds for both expressons are 0: ths s because the combnaton of postve and negatve MSPs can produce the aggregate value of 0. For the operators and /, the sgns of E 1 and E 2 for MSPs decde the applcable computaton. We use an example to explan Table 5. Example 7. Gven data cube CubeðXÞ wth MSPs X 1 ;...;X n, consder the aggregate functon EðXÞ ¼1=ðMaxðXÞþMnðXÞÞ: To compute EðCubeðXÞÞ, three cases can arse: Case 1. MnðX Þ0 for all ¼ 1...n. It follows that MaxðX Þ0 for all ¼ 1...n. Then, EðCubeðXÞÞ ¼ 1= Max þ ðcubeðxþþ þ Mn þ ðcubeðxþþ : Case 2. MaxðX Þ0and MnðX Þ0for all ¼ 1...n. Then, EðCubeðXÞÞ ¼ 1= Max ðcubeðxþþ þ Mn ðcubeðxþþ : Case 3. MaxðX Þ and MnðX Þ can be postve or negatve ð1 nþ. Based on Table 4, we have EðCubeðXÞÞ ¼ 1=ðMaxðÞ þ MnðÞÞ þ ðcubeðxþþ. Based on Table 5, ðmaxðþ þ MnðÞÞ þ ðcubeðxþþ s 0, whch means that the upper bound for 1=ðMaxðÞ þ MnðÞÞðCubeðXÞÞ s 1. Proposton 1 leads to the followng theorem about boundng arthmetc expressons of base aggregate functons. Theorem 2. Gven a data cube on measure X, an arthmetc expresson E ¼ E 1 op E 2 of base aggregate functons, where the operator s þ,,, or/.e s boundable for CubeðXÞ, and ts bounds can be computed followng Tables 2, 3, 4, and 5. Proof. By Proposton 1, we can compute the bounds for an arthmetc expresson from the bounds or sgn bounds of the operand expressons, as lsted n Table 4. Table 5 shows how the sgn bounds of any expresson can be computed from the bounds for base aggregate functons. Tables 2 and 3 demonstrate that a sngle scan of the MSPs of a data cube can compute the bounds of all base aggregate functons. As a result, all arthmetc expressons of base aggregate functons are boundable, and the bounds can be computed followng Tables 2, 3, 4, and 5. tu We now show how we can compute the upper bound for the complex aggregate functon Var based on the theorem. Example 8. The Var for X ¼fx j ¼ 1...Ng s defned as 8 P ðx XÞ 2 VarðXÞ ¼ N, where X denotes the Avg of X. It can be rewrtten as follows: VarðXÞ ¼ ¼ ¼ ¼ P ðx2 2 X x þ X 2 Þ ; N P P x2 x N P x2 2 X N N 2 X2 þ X 2 ¼ Sum x 2 CountðXÞ SumðXÞ 2 : CountðXÞ þ N X2 N ; P x2 N X2 ; Let QSumðXÞ denote Sum x 2. It follows that VarðXÞ s an arthmetc expresson of QSumðXÞ, SumðXÞ, and CountðXÞ. Consder a data cube CubeðXÞ wth MSPs X 1 ;...;X n. Based on Table 4, we have QSumðÞ VarðCubeðXÞÞ ¼ ðcubeðxþþ CountðÞ SumðÞ 2ðCubeðXÞÞ: CountðÞ We next consder each of the two subexpressons separately. We frst consder ð QSumðÞ CountðÞ ÞðCubeðXÞÞ: For any MSP X, QSumðX Þ and CountðX Þ are always postve. By followng Table 4, we have the followng fractonal expresson, where the numerator and denomnator expressons can be computed followng Table 2: QSumðÞ CountðÞ ðcubeðxþþ ¼ QSumðCubeðXÞÞ=CountðCubeðXÞÞ. We then consder ð SumðÞ CountðÞ Þ2 ðcubeðxþþ. Based on Table 4, we have SumðÞ 2 Mn ðsum=countþ þ ðsum=countþ þ ; CountðÞ ðsum=countþ ðsum=countþ : Count s always postve. Byfollowng Table 5 for 2ðCubeðXÞÞ: computng sgn bounds, we get SumðÞ CountðÞ 8. There s another defnton of Varance: VarðXÞ ¼ dscussons here also apply to ths defnton. P ðx XÞ2 N 1. The Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

8 910 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY ; Mn Sum þ ðcubeðxþþ=countðcubeðxþþ 2 Sum ðcubeðxþþ=countðcubeðxþþ : In the above equaton, CountðCubeðXÞÞ ¼ Sum CountðX Þ: Moreover, based on Table 3, Sum þ ðcubeðxþþ ¼ Mn SumðX Þ f SumðX Þ0 for all ¼ 1...n; otherwse, t s 0. Smlarly, Sum ðcubeðxþþ ¼ Max SumðX Þ f SumðX Þ < 0 for all ¼ 1...n; otherwse, t s 0. In summary, we compute VarðCubeðXÞÞ as follows: f SumðX Þ0 for ¼ 1::n; Sum SumðX 2Þ=Mn CountðX Þ 2 ; Mn SumðX Þ= Sum CountðX Þ otherwse; f SumðX Þ < 0 for ¼ 1::n; Sum SumðX 2Þ=Mn CountðX Þ 2 ; Max otherwse; Sum SumðX Þ= Sum CountðX Þ SumðX 2Þ=Mn CountðX Þ: Wth these equatons, obvously tghter VarðCubeðXÞÞ s obtaned f the sums of MSPs are of the same sgn. 6 OPTIMIZATION FOR BOUNDING COMPLEX FUNCTIONS When rewrtng several commonly used algebrac aggregaton functons, namely, Avg, Var, and Standard-Devaton, nto arthmetc expressons of dstrbutve aggregate functons, we notce that very often, they use the followng subexpresson: F ðxþ ¼Sum G 1 ðx Þ= Sum G 2 ðx Þ; where G 1 and G 2 are some dstrbutve aggregate functons, and X 1 ;...;X n are MSPs. (We can rewrte AvgðXÞ as SumðXÞ=CountðXÞ and VarðXÞ as QSumðXÞ=CountðXÞ ðsumðxþ=countðxþþ 2 : Moreover, we have SumðXÞ=CountðXÞ ¼Sum SumðX Þ=Sum CountðX Þ; whch s F, wth G 1 ¼ Sum and G 2 ¼ Count.) Therefore, t s desrable to gve tghter bounds for F than those obtaned by Table 4. The next result provdes not only tght bounds but also the optmal. Theorem 3. Let CubeðXÞ be a data cube over measure X wth MSPs X 1 ;...;X n, let G 1 and G 2 be aggregate functons, and let F be as defned above. If G 2 ðx Þ > 0 for all ¼ 1...n, then we can bound F usng FðCubeðXÞÞ ¼ Max FðX Þ; F ðcubeðxþþ ¼ Mn F ðx Þ: Moreover, these bounds are the optmal bounds for F ðcubeðxþþ. Proof. Wthout loss of generalty, we can assume that Mn FðX Þ¼FðX 1 Þ, and Max F ðx Þ¼FðX n Þ. Gven a group g of CubeðXÞ, let X g1 ;...;X gk be the MSPs contaned n g. Thus, g ¼ X g1 [...[ X gk. Then, F ðgþ ¼ Sum j G 1 ðx gj Þ=Sum j G 2 ðx gj Þ; ¼ G 1ðX g1 Þþ...þ G 1 ðx gk Þ ; G 2 ðx g1 ÞþþG 2 ðx gk Þ ¼ F ðx g 1 ÞG 2 ðx g1 Þþ...þ F ðx gk ÞG 2 ðx gk Þ : G 2 ðx g1 Þþ...þ G 2 ðx gk Þ The dvson n the last step s permtted, snce G 2 ðx g Þ > 0 for all. Therefore, F ðgþ FðX n Þ ¼ F ðx g 1 ÞG 2 ðx g1 Þþ...þ F ðx gk ÞG 2 ðx gk Þ G 2 ðx g1 Þþ...þ G 2 ðx gk Þ F ðx n Þ G 2ðX g1 Þþ...þ G 2 ðx gk Þ G 2 ðx g1 Þþ...þ G 2 ðx gk Þ ¼ ðfðxg 1 Þ FðXnÞÞG 2 ðxg 1 Þþ...þðF ðxg k Þ F ðxnþþg 2 ðxg k Þ G 2 ðxg 1 Þþ...þG 2 ðxg k Þ : Snce F ðx gj ÞF ðx n Þ and G 2 ðx gj Þ > 0 for all j ¼ 1...k, we have G 2 ðx g1 Þþ...þ G 2 ðx gk Þ > 0 and ðf ðx gj Þ F ðx n ÞÞ G 2 ðx gj Þ0 for all j. Therefore, FðgÞ F ðx n Þ0 or, equvalently, F ðgþ F ðx n Þ. Smlarly, t can be shown that F ðgþ F ðx 1 Þ. Thus, we have proven that F ðx n Þ and F ðx 1 Þ are the upper and lower bounds for F ðcubeðxþþ, respectvely. As Mn F ðx Þ s the smallest aggregate for an MSP and thus s a real aggregate n F ðcubeðxþþ, all other lower bounds for F ðcubeðxþþ must be greater than Mn F ðx Þ. Mn F ðx Þ s thus the tghtest lower bound for F ðcubeðxþþ. Smlarly, we can prove that Max FðX Þ s the tghtest upper bound for F ðcubeðxþþ. Next, we show that the bounds obtaned by followng Tables 2 and 4 are weaker. Consderng Max FðX Þ (and stll assumng that Max F ðx Þ¼FðX n Þ), the case for Mn F ðx Þ s smlar. By followng Tables 2 and 4, F ðcubeðxþþ s computed: Case 1. There exsts j such that G 1 ðx j Þ > 0. F ðcubeðxþþ s Sum G 1 ðx Þ=Sum G 2 ðx Þ¼ Sum G 1 ðx Þ= Mn G 2 ðx Þ: G 1 ðx Þ0 Snce G 2 ðx Þ > 0 for all ¼ 1...n, and G 1 ðx j Þ > 0, F ðx j Þ > 0. Snce Max FðX Þ¼FðX n Þ, t follows that G 1 ðx n Þ > 0. Snce Sum G 1 ðx ÞG 1 ðx n Þ and G 1 ðx Þ0 Mn G 2 ðx ÞG 2 ðx n Þ, we have Sum G 1ðX Þ0 G 1 ðx Þ= Mn G 2 ðx Þ G 1ðX n Þ G 2 ðx n Þ ¼ F ðx nþ: Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

9 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 911 TABLE 6 Boundng Complex Aggregate Functons Fg. 3. A G-tree for the data set n Table 1. CubeðXÞ s a data cube on measure X wth MSPs fx 1 ;...;X n g. Hence, Max F ðx Þ¼F ðx n Þ s a tghter bound. Case 2. For all ð1 nþ, G 1 ðx Þ0. FðCubeðXÞÞ s Sum As G 1 ðx Þ=Sum G 2 ðx Þ¼Max G 1 ðx Þ= Max G 2 ðx Þ: 0 Max G 1 ðx ÞG 1 ðx n Þ and 0 <G 2 ðx n ÞMax G 2 ðx Þ, 0 Max G 1 ðx Þ= Max G 2 ðx ÞG 1 ðx n Þ=G 2 ðx n Þ¼FðX n Þ: Hence, Max F ðx Þ¼FðX n Þ s a tghter bound. tu By followng Theorem 3, tghter bounds for Avg and Var can be computed, as shown n Table 6. In Secton 7, we present an effcent mplementaton of the BP-Cubng algorthm, utlzng the boundng results dscussed so far. 7 BOUND-PRUNE CUBING ALGORITHMS We frst present the group tree (G-tree) data structure that we wll use for ceberg cubng. We then explan how bound prune cubng (BP-Cubng) s mplemented on the G-tree. Fnally, we brefly dscuss how our algorthms utlze antmonotone constrants. The ceberg cube wth the constrant Avg(Sale) n [15, 20] on the Sales data set n Table 1 s used as a runnng example throughout ths secton. 7.1 The G-Tree The underlyng data structure for BP-Cubng s the G-tree. A G-tree s the compresson of a gven nput data set and s constructed by one scan of the data set. It s used for both the top-down and bottom-up bound prune cubng algorthms (BP-Cubng(TD) and BP-Cubng(BU), respectvely), although there are some mnor dfferences dependng on the traversal strategy. The G-tree for the data set n Table 1 s shown n Fg. 3. A G-tree for an n-dmensonal data set s of depth n, where each level represents a dmenson. A path startng from the root collapses the tuples wth common dmenson values along the path. Each tree node keeps the auxlary aggregates necessary to compute the ceberg cube. In our example, the auxlary aggregates n a node are Sum(Sale) and CountðÞ. For the leftmost path from the root of the G-tree n Fg. 3, the node (March) shows that there are 70 tuples wth SumðSaleÞ ¼300 n the ðmarch; ; ; Þ partton, whereas the node (Peter) shows that there are 40 tuples wth SumðSaleÞ ¼100 n the ðmarch; TV; Peter; Þ partton. For a gven data set, dfferent dmenson orders result n dfferent G-trees. Dependng on whether the G-tree s traversed top-down or bottom-up n computng cubes, the tree should be constructed n dfferent orders of dmensons. Antmonotone prunng s effectve when the most dscrmnatng dmenson s examned frst. Ths suggests the cardnalty-descendng order of dmensons durng cube computaton. As a result, for bottom-up traversal, the cardnalty-ascendng order should be used n constructng the G-tree. In contrast, for top-down traversal, the cardnalty-descendng order should be used n constructng the G-tree. Our experments have confrmed that such heurstcs ndeed have a postve mpact on the effectveness of prunng and the effcency of cubng algorthms. 7.2 Top-Down Bound-Prune Cubng on G-Trees An observaton of Fg. 3 reveals some useful facts: The path from the root to a node of the G-tree aggregates a group, and each level of the G-tree computes a group-by of the cube lattce. For the G-tree n Fg. 3, the root node has the aggregate for the group ð; ; ; Þ, wth CountðÞ ¼ 88, and SumðSaleÞ ¼700. The nodes at level 1 compute the aggregates for groups n ðmonth; ; ; Þ, whch are ðmarch; ; ; ; 70; 300Þ, ðjanuary; ; ; ; 5; 200Þ, and ðaprl; ; ; ; 13; 200Þ. The nodes at the next three levels compute the aggregates for the group-bys (Month, Product), (Month, Product, SalesMan), and (Month, Product, SalesMan, Cty), respectvely. The leaf nodes gve the MSPs for Cube(Month, Product, SalesMan, Cty). We summarze these n the followng observatons. Observaton 2. In a G-tree, the aggregates n each node are the aggregates for the group wth dmenson values on the path from the root to the node. Observaton 3. In a G-tree, for n dmensons A 1 ;...;A n, the leaf nodes are the MSPs for CubeðA 1 ;...;A n Þ. Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

10 912 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 Fg. 4. Top-down bound prune cubng of CubeðABCDÞ: For each node, we show a G-tree (before : ), the group-bys (after : ) computed n the G-tree, and the shared dmensons (after the / ). The numbers show the order n whch trees are bult. G-trees are constructed to compute all group-bys n a data cube. When constructng a G-tree for an n-dmensonal data set, we smultaneously compute n group-bys, namely, the group-bys whose dmensons are prefxes of the lst of dmensons ordered by the levels of the tree. To compute the other group-bys n the cube, we collapse one dmenson from a gven G-tree at a tme to construct a sub-g-tree 9 and compute the correspondng group-bys of the sub-g-tree. For example, by collapsng dmenson Product n the orgnal G-tree, gven n Fg. 3, we get the subtree fmonth; ð ProductÞ; SalesMan; Cty; Customerg, shown n Fg. 5. Fg. 4 shows the (sub)g-trees constructed for computng the data cube on dmensons A, B, C, and D. Each node n Fg. 4 represents a G-tree and the set of all correspondng group-bys. The ABCD tree at the top s constructed by one scan of the data set, and the correspondng group-bys (A, B, C, D), whch are the (A, B, C), (A, B), (A), and () trees are computed durng ths scan. The sub-g-trees of the ABCD tree, whch are the ð AÞBCD, Að BÞCD, and ABð CÞD trees, are formed by collapsng dmensons A, B, and C, respectvely. The CD tree s a subtree of the BCD tree and recursvely s also a subtree of the ABCD tree. The dmensons after / n the nodes denote common prefx dmensons for the tree at the node and all of ts subtrees. A s the common prefx dmenson for the ACD tree and ts subtrees. All group-bys that are computed on the ACD tree and ts subtrees form subcubes for A values CubeðCDÞja. The leaf nodes orgnatng from a are the MSPs for CubeðCDÞja. In the sub-g-tree constructon process, we also compute the bounds and use them for prunng, as descrbed n the observaton below. Observaton 4. Wth top-down aggregaton, gven a G-tree G wth n dmensons A 1 ;...;A n and a subtree G k, by collapsng a dmenson A k, 1 < k < n, A 1 ;...;A k 1 are the common prefx dmensons for G k and all ts subtrees. Each node ða 1 ;...;a k 1 Þ of G gves the prefx dmenson values for CubeðA kþ1 ;...;A n Þja 1 ;...;a k 1. The core of the cube conssts of the MSPs correspondng to the leaf nodes of the branches orgnatng from the node ða 1 ;...;a k 1 Þ. If the bounds of CubeðA kþ1 ;...;A n Þja 1 ;...;a k 1 fal the gven constrant, then those branches can be pruned. Example 9. Consder the G-tree G n Fg. 3 as the orgnal tree and the G-tree G p n Fg. 5 as the subtree. Month s the prefx dmenson for the group-bys on G p and the subtree of G p. The subcubes are 9. It should be ponted out that a sub-g-tree s not part of an orgnal G-tree but s obtaned by collapsng a gven dmenson of the orgnal G-tree. Fg. 5. The subtree on fmonth; ð ProductÞ; SalesMan; Ctyg obtaned by collapsng dmenson Product n the orgnal G-tree, gven n Fg CubeðSalesMan; CtyÞjMarch, 2. CubeðSalesMan; CtyÞjJanuary, and 3. CubeðSalesMan; CtyÞjAprl. In G, the leaf nodes of the March subtree are the MSPs of CubeðSalesMan; CtyÞjMarch. Wth the constrant Avg(Sale) n [15, 20], G s pruned n constructng the subtree G p. By followng Table 6, the bounds for CubeðSalesMan; CtyÞjMarch are computed from the three leaf nodes orgnatng from node (March) of G: AvgðCubeðSaleÞjMarÞ ¼Maxðf100=40; 100=20; 100=10gÞ ¼ 10: AvgðCubeðSaleÞjMarÞ ¼Mnðf100=40; 100=20; 100=10gÞ ¼ 2:5: As [2.5, 10] volates Avg(Sale) n [15, 20], all three branches orgnatng from (March) are removed from further computaton. Smlarly, Avg(Sale) of CubeðSalesman; CtyÞjJanuary s bounded as [40, 40], whch volates the constrant, and s pruned. Avg(Sale) of CubeðSalesman; CtyÞjAprl s bounded as [12.5, 25], whch does not volate the constrant, and the branches from (Aprl) are kept n G p. The nodes n dashed lnes n Fg. 5 hghlght the fact that the branches from (March) and (January) are pruned before the G p tree s generated, and mportantly, they are permanently pruned from all future recursve computaton. 7.3 The Top-Down Bound-Prune Cubng Algorthm In ths secton, we present the Top-Down Bound-Prune Cubng (BP-Cubng(TD)) algorthm, shown n Algorthm 1, for computng ceberg data cubes. It performs boundprunng n the top-down multway aggregaton manner, and t uses the G-trees. Algorthm 1. The Top-Down Bound-Prune Cubng Algorthm Input: A data set D over dmensons A 1 ;...;A n and aggregaton constrant C, assumed global. Output: An n-dmensonal ceberg cube on D satsfyng C. (1) Buld the G-tree TðA 1 ;...;A n Þ from D; (2) Output aggregates satsfyng C, computed when T was bult; (3) for ¼ 1...ðn 1Þ, do (4) BP-CubngðTðA 1 ;...;A n Þ;A Þ; Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

11 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 913 Fg. 6. The G-tree for the data set n Table 1, wth header table and sde lnks. // B, the th dmenson of T s the dmenson to be collapsed Procedure BP-Cubng ðtðb 1 ;...;B k Þ;B Þ (5) T s nl; //construct subtree T s by collapsng B. (6) for each node n 1 of dmenson B 1 n T, do (7) V 1 dmenson values on the path from root to n 1 ; // the MSPs for CubeðB þ1 ;...;B k ÞjV 1 (8) Let M be the set of leaves orgnatng from n 1 ; (9) Compute the bounds for CubeðB þ1 ;...;B k ÞjV 1 from M; // prune the paths f the bounds volate C (10) f the bounds do not volate C then (11) Collapse B and add the paths passng n 1 of T to T s ; (12) Output aggregates on T s that satsfy C; (13) for j ¼ðþ1Þ...ðk 1Þ do (14) BP-CubngðT s ðb 1 ;...;B 1 ;B þ1 ;...;B k Þ;B j Þ Let D be a data set wth n dmensons A 1 ;...;A n and C be the ceberg constrant. To compute the ceberg cube, frst, a G-tree T s bult by one scan of D (lne 1). As mentoned n Observaton 2, we smultaneously aggregate n group-bys, namely, ða 1 Þ, ða 1 ;A 2 Þ;...; ða 1 ;A 2 ;...;A n Þ, when buldng the G-tree. Those groups that satsfy C are output (lne 2). The Bound-Prune Cubng (BP-Cubng) procedure computes the ceberg cube by recursvely collapsng dmensons and buldng sub-g-trees. Gven a G-tree, before a subtree s bult, bounds for branches are computed, and those branches whose bounds fal the constrant are pruned from further computaton (lnes 9-11). The leaf nodes of a G-tree are the MSPs for the G-tree, and these MSPs are used to compute the correspondng aggregate bounds. T s s a subtree of T, obtaned by collapsng dmenson B ; the collapsng computaton s descrbed n lnes Aggregates are computed durng the constructon of T s. Then, BP- Cubng s recursvely called (lnes 13-14), where group-bys of dmensons after B are computed. 7.4 Bottom-Up Bound-Prune Cubng on G-Trees The BP-Cubng(BU) algorthm s smlar to Algorthm 1. Wth the bottom-up cubng strategy, a group s computed before ts subgroups. There are several dfferences concernng 1) the G-tree used, 2) how sub-g-trees are constructed, and 3) how boundng wth MSPs are done. We descrbe these below. Fg. 7. The subtree of the G-tree n Fg. 6, wth Cty ¼ Sydney. We frst consder the G-tree used. In BUC, a G-tree wll also contan a header table and sde lnks. A header table entry represents a one-dmensonal group together wth the correspondng aggregates, and t s lnked to nodes wth the correspondng dmenson value. Fg. 6 shows such a G-tree for the sample Sales data set. The sde-lnks wll be used to effcently derve aggregates of other groups n subsequent computaton. We now consder how sub-g-trees are constructed. In the frst G-tree for a data set, the header table contans onedmensonal groups. In the sub-g-trees constructed from a gven G-tree, the header table contans groups whose dmensonalty s one level hgher than that of groups for the gven G-tree. For the G-tree n Fg. 6, we buld sub-gtrees for the 2D subgroups of the (Cty) groups, namely, the groups of (Salesman, Cty), (Product, Cty), and (Month, Cty). Smlarly, we buld sub-g-trees for the 3D subgroups of the (Salesman, Cty) groups, and so on. The sub-g-trees are obtaned by mergng paths n the orgnal G-tree. For example, to compute the 2D subgroups of Sydney, the subtree n Fg. 7 s constructed by mergng the paths from the root to the nodes on the Sydney sde lnk. In ths process, the aggregates for the nodes above the (Sydney) nodes need to be modfed to remove the contrbuton of the non-sydney ctes. For example, the node (John) has the aggregates of (30, 200) for both Perth and Sydney but only has the aggregates of (10, 100) for Sydney. We now turn to boundng wth MSPs. Recall that, wth the top-down aggregaton strategy, all MSPs for a (sub) data cube are convenently avalable for boundng (by convenently avalable, we mean that they are avalable wthout extra overhead). In the bottom-up aggregaton, the convenently avalable MSPs are those ponted to by sde lnks of certan header table entres. The observaton below descrbes how bounds are computed, and prunng s acheved. Observaton 5. For a G-tree of dmensons A 1 ;...;A n, the leaf nodes are the MSPs for CubeðA 1 ;...;A n Þ. Wth the BUC strategy, for a k 2 domanða k Þð1 k nþ, the subgroups of ða k Þ to be computed from the branches on the sde lnk of a k comprse CubeðA 1 ;...;A k 1 Þja k, and MSPs of the cube are the leaf nodes on the sde lnk for a k (avalable for boundng). The subtrees from the sde lnk of a k can be pruned f the bounds of CubeðA 1 ;...;A k 1 Þja k fal the gven constrant. Example 10. Wth CubeðMonth; Product; SalesmanÞjSydney, shown n Fg. 6, ts MSPs are the leaf nodes (Sydney, 10, 100) and (Sydney, 5, 100) on the sde lnk for Sydney. Thus, Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

12 914 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 TABLE 7 Census, TPC-R, and Weather Multdmensonal Data Sets Fg. 8. Bottom-up bound prune cubng (BP-Cubng) on four dmensons A, B, C, and D CubeðMonth; Product; SalesmanÞjSydney ¼ 100=5 ¼ 20; CubeðMonth; Product; SalesmanÞjSydney ¼ 100=10 ¼ 10: As [10, 20] does not volate the constrant Avg(Sale) n [15, 20], the subtree for Sydney (shown n Fg. 7) s thus created. In Fg. 8, we use an example to gve some hgh-level vew of the BUC process for a four-dmensonal data cube. Each node represents the group that s computed on some G-tree. From top to bottom, nodes are lnked by the subtree relatonshp. The superscrpts over the nodes ndcate the order n whch the G-trees are created. Wth a tree, the dmensons after / are the shared dmensons for the tree and all ts subtrees. They are also the condtonal dmensons where the tree s created. Generally, the BP-Cubng(BU) algorthm s gven n Algorthm 2. Algorthm 2. The Bottom-Up Bound-Prune Cubng Algorthm Input: A data set D over dmensons A 1 ;...;A n and aggregaton constrant C. Output: The n-dmensonal ceberg cube on D, satsfyng C. (1) Buld the G-tree TðA 1 ;...;A n Þ from D; (2) Output all aggregates n TðA 1 ;...;A n Þ:Header; (3) foreach a 2 T:Header, do (4) BP-CubngðTðA 1 ;...;A n Þ; fagþ; Procedure BP-Cubng ðtðb 1 ;...;B k Þ;SÞ // S s a set of dmenson values as the condton for subcubes. (5) T s ¼ nl; (6) Suppose a s on the th dmenson of T; (7) Let M be the nodes followng the sde lnk of a; (8) Compute the bounds for CubeðB 1 ;...;B 1 ÞjS from the leaves orgnatng from M; (9) f the bounds do not volate C, then // prunng // Secton 7.4 (10) Construct the subtree T s ðb 1 ;...;B 1 Þ; (11) Output aggregates on T s :Header that satsfy C; (12) foreach a s 2 T s :Header do (13) BP-CubngðT s ðb 1 ;...;B 1 Þ;S[fa s gþ; 7.5 Interacton wth Antmonotone Constrants Our dscussons have focused on complex nonantmonotone aggregaton constrants. Antmonotone constrants can be easly ncorporated n BP-Cubng as follows: Suppose an antmonotone constrant C 0 s present, n addton to the nonantmonotone constrant C. Frst, n lnes 2 and 12 of Algorthm 1, a group g s checked aganst both C and C 0 before beng output. More mportantly, before the bounds are computed at lne 9, the branches of T orgnatng from n are pruned from further computaton f n fals C 0, as all subgroups of n wll also fal C 0. Wth top-down cubng, the prefx groups before the collapsng dmenson that fal C 0 are pruned. Wth bottom-up cubng, header-table entres that fal C 0 are pruned from further computaton. 8 EXPERIMENTS In ths secton, we evaluate the performance of both the BP- Cubng(TD) and BP-Cubng(BU) algorthms wth experments. We compare the performance of these algorthms wth that of the DnA algorthm [14], [15] on non-antmonotone aggregaton constrants and also wth that of the BUC algorthm [4] on antmonotone constrants. DnA s recent work on prunng for nonantmonotone aggregaton constrants. As DnA s not desgned for prunng for antmonotone constrants, extra processng s needed to prune for such constrants. On the other hand, BUC uses the same partton-based bottom-up aggregaton strategy as DnA and s desgned for antmonotone constrants. Thus, BP-Cubng s compared wth BUC on prunng wth antmonotone constrants. To do far comparson, all algorthms were mplemented wth all possble optmzaton technques. BUC was mplemented wth the collapsng duplcates optmzaton [4]. We added collapsng duplcates and ndexng tuples to DnA, even though such optmzatons were not reported n the orgnal papers [14], [15]. Block memory allocaton s used n the BP-cubng algorthms to reduce the number of calls of the dynamc memory allocaton functons. It s assumed that, for all algorthms, data structures used can ft nto memory. All experments were performed on a PC wth an 686 processor runnng GNU/Lnux. As outputtng the groups can take a sgnfcant amount of tme for bg data cubes, we choose to exclude the tme for output n the tmng for the algorthms. The thresholds for constrants are selected n a way such that there are at least 10 groups n the output. We expermented wth constrants nvolvng the aggregate functons Count, Avg, Var, and Sum. 8.1 Data Sets We used real-world, as well as artfcal, data sets n our experments. When selectng the data sets, we consdered the followng data characterstcs: dense versus sparse and random versus skewed. Table 7 summarzes the data sets used n our experments. The US census data set 10 was collected n a 1990 US households survey. The orgnal data set had 61 attrbutes such as hrswork1 (hours worked last week), nchld (number of own chldren on the household), and valueh (value of house). We selected 12 dscrete attrbutes as 10. ftp://ftp.pums.org/pums/data/p19001.z. Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

13 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 915 Fg. 9. BP-Cubng versus BUC on the CountðÞ constrant. (a) Census. (b) TPC-R. (c) Weather. (d) Weather-5. dmensons and a numercal attrbute as the measure. The data set s dense and skewed. The Weather data set 11 contans real weather reports from varous weather statons n We used these nne attrbutes as dmensons: staton-d, longtude, solar-alttude, lattude, present-weather, weather-change-code, day, hour, and brghtness. Ther cardnaltes, respectvely, are 6,505, 351, 179, 152, 99, 10, 8, 3, and 2. We randomly generated values between 1 and 100 as the measure. Ths Weather data set s the same as that used n the experments n [4], where BUC was shown to be effcent for sparse data. The TPC-R data set 12 s an artfcal data set provded by the Transacton Processng Councl and s desgned for testng the performance of representatve complex queres n hgh-level busness decson-makng envronment. Its dmensons nclude customer, suppler, order, and shpment. The orgnal TPC-R data set conssts of several relatonal tables. We constructed a joned relaton as our multdmensonal data set. A numercal attrbute s selected as the measure. The TPC-R data set s relatvely dense and random. 8.2 BP-Cubng versus BUC on CountðÞ Ths set of experments s desgned to evaluate two aspects of BP-Cubng: the tree-based aggregaton strategy and prunng wth smple antmonotone constrants. Ths nvolves comparng the performance of the BP-Cubng algorthms wth BUC on computng ceberg cubes for the constrant CountðÞ. For the specal case of ¼ 1, the full data cubes are computed. Snce there s no prunng, the dfference n effcency should be solely attrbuted to the aggregaton strateges. Fg. 9 shows the runtme of the algorthms. Fg. 9a shows that BUC uses seconds to compute the full data cube over the dense and skewed Census data set, and BP-Cubng(BU) and BP-Cubng(TD) take and seconds, respectvely, whch are 1.9 and 3.76 tmes faster than BUC. Fg. 9b shows that BUC uses seconds to compute the full cube for the TPC-R data set. In contrast, BP-Cubng(BU) and BP-Cubng(TD) use only 17.9 and 5.89 seconds, respectvely, whch are 4.9 and 14.9 tmes faster. On the other hand, BUC s more effcent on the sparse Weather data than the BP-Cubng algorthms. Fg. 9c shows that BUC uses 6.16 seconds on computng the full cube, whereas BP-Cubng(BU) and BP-Cubng(TD) use and seconds, respectvely. When executed on the data set obtaned by removng the four dmensons wth the largest cardnaltes of 6,505, 351, 179, and 152 from the Weather data set, the BP-Cubng algorthms and BUC are all able to compute the full cube quckly: BUC uses 0.3 seconds, whereas both BP-Cubng algorthms use 0.02 seconds, as shown n Fg. 9d. Ths set of experments has shown that top-down multway aggregaton usng G-trees s a superb aggregaton strategy, and the expermental results suggest that treebased aggregaton s generally more effcent than recursve partton-based aggregaton. Our experments also confrmed the observaton n [4] that partton-based bottomup aggregaton s sutable for computng sparse data cubes. Even though BP-Cubng(BU) and BUC both use the same bottom-up strategy, BP-Cubng(BU) has better performance than BUC when dealng wth antmonotone constrants. Ths mples that aggregaton on G-trees s a superor strategy. In Fg. 9c, we see that ntally, BP-Cubng(TD) s more effcent than BP-Cubng(BU) for computng the full cube on Weather. However, for constrants wth the threshold 5, BP-Cubng(BU) outperforms BP-Cubng(TD). Ths may have happened because the sze of the G-tree s large for sparse data, and the top-down aggregaton strategy may have aggregated many groups, whch turns out to fal the constrant. 8.3 BP-Cubng versus DnA on AvgðXÞ n ½ 1 ; 2 Š In ths set of experments, we compare the performance of the BP-Cubng algorthms aganst DnA on the nonantmonotone constrant AvgðXÞ n ½ 1 ; 2 Š. Fg. 10 shows that the BP-Cubng algorthms scale very well when the ½ 1 ; 2 Š range threshold becomes looser. In all data sets, the BP-Cubng algorthms show modest lnear ncrease n computaton tme. In contrast, the performance of DnA degrades sgnfcantly (whch may be due to the fact that ts prunng s at the tuple record level: When many groups need to be processed, the search cost for the mnmal partton to approxmate a group becomes hgh (see Secton 9 for more dscussons)). Fgs. 10a and 10b show that n the dense data sets Census and TPC-R, the BP-Cubng algorthms sgnfcantly outperform DnA, and the BP-Cubng algorthm shows the best performance. For the constrant Avg (X) n [500, 50,000] over Census, DnA fnshes n seconds, whereas BP-Cubng(TD) fnshes n only 3.53 seconds, whch s tmes faster. At the lower end, for the constrant Avg(X) n [500, 10,000], BP-Cubng(TD) s 7.09 tmes faster. In TPC-R, BP-Cubng(TD) outperforms DnA by tmes. BP-Cubng(BU) also outperforms DnA overall, especally for larger constrant ranges. For the constrant Avg(X) n [500, 50,000] over Census, BP-Cubng(BU) s 2.33 tmes faster than DnA. Fg. 10c shows that n the sparse Weather data set, BP-Cubng(BU) acheves more sgnfcant effcency mprovement over DnA than BP-Cubng(TD). In ths data set, Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

14 916 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 Fg. 10. BP-Cubng versus DnA on the AvgðXÞ n ½ 1 ; 2 Š constrant. (a) Census. (b) TPC-R. (c) Weather. the fgure may suggest that the mprovement of the BP- Cubng algorthms over DnA s not as pronounced as n Census and TPC-R. However, Fg. 9 has shown that n the sparse Weather data, the partton-based aggregaton strategy of DnA s more effcent than the tree-based aggregaton strategy of BP-Cubng. Wth an aggregaton strategy that works not that well, the BP-Cubng algorthms stll acheve better performance than DnA. Such dramatc result can only be attrbuted to the effectveness of bound prunng. As experments have shown that BP-Cubng(TD) has better performance than BP-Cubng(BU) n general, BP-Cubng(TD) s used n later experments for comparng wth DnA. 8.4 BP-Cubng versus DnA on VarðXÞ In ths secton, we compare our boundng technques aganst DnA on the constrant VarðXÞ, whch nvolves the complex nonmonotone functon Var. For BP- Cubng, we obtan the upper bound for Var by followng Table 6. For DnA, we derve the upper bound for Var as follows ([14, Example 4.2]): QSUMðXÞ Count 1ðcÞ psum 1ðcÞ 2 nsum 1ðcÞ 2 Count 2ðXÞ Count 3ðXÞ psum 2ðXÞnsum 2ðXÞ þ 2 ðcount 4ðcÞÞ 2 ; whose notatons are explaned later n Secton 9. The runtme of both BP-Cubng(TD) and DnA s shown n Fgs. 11a, 11b, and 11c. When the Var threshold gets larger, the ceberg cubes get smaller, and both algorthms fnsh faster. BP-Cubng s always faster than DnA at all Var thresholds. The speedup s usually around several tmes. Ths can be attrbuted to the tghter bounds obtaned usng MSPs. Fg. 11. BP-Cubng versus DnA on VarðXÞ and SumðXÞ. (a) VarðXÞ : Census. (b) VarðXÞ : TPC-R. (c) VarðXÞ : Weather. (d) SumðXÞ : Census. (e) SumðXÞ : TPC-R. (f) SumðXÞ : Weather. 8.5 BP-Cubng versus DnA on SumðXÞ We now compare BP-Cubng wth DnA on computng ceberg cubes defned by SumðXÞ. As shown n Table 2, SumðXÞ s a representatve for our general boundng theory, whch does not nvolve any optmzaton. The upper bound of SumðXÞ needs to be computed for prunng. In BP-Cubng, ths s computed followng Table 2. In DnA, ths s computed followng [15] (see Secton 9). The artfcal data sets TPC-R and Weather contan only postve measure values, whch renders SumðXÞ an antmonotone constrant. We randomly negated 10 percent of the measure values to make the constrant nonantmonotone so that prunng technques can be sensbly compared. Fgs. 11d, 11e, and 11f demonstrate the runtme of both algorthms on the three data sets under dfferent values of. BP-Cubng(TD) sgnfcantly outperforms DnA at all thresholds for SumðXÞ. To ascertan the contrbuton of prunng toward the effcency gan of BP-Cubng over DnA, for each algorthm, we collected the number of groups that are examned for prunng, namely, groups whose aggregates are bounded (approxmated) and tested aganst the gven constrant. There are two types of examned groups: a true postve (or nonsoluton group) s a group whose approxmated aggregate and ts real aggregate pass the constrant, and a Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

15 ZHANG ET AL.: EFFICIENT COMPUTATION OF ICEBERG CUBES BY BOUNDING AGGREGATE FUNCTIONS 917 TABLE 8 Number of Examned Groups versus Number of Soluton Groups on SumðXÞ Bound-prune cubng denotes (BP-Cubng(TD)). false postve (or soluton group) s a group whose approxmated aggregate passes the constrant, but whose real aggregate fals the constrant. Table 8 shows the number of examned groups versus the number of soluton groups for the three data sets. The three columns lst, respectvely, the number of soluton groups n the output and the number of groups examned n BP-Cubng(TD) and DnA. Substantally fewer groups are examned n BP-Cubng(TD) at all thresholds. Ths shows that BP-Cubng wth MSPs has a clear advantage. Observe that, when the number of soluton groups decreases, the numbers of groups examned by both algorthms decrease as well, but that number n DnA decreases at a much slower rate than that n BP-Cubng(TD). 9 RELATED WORK DnA [14], [15] s a recent approach for prunng wth nonantmonotone aggregaton constrants n ceberg cubng. The man dea of DnA s to dvde a partton of tuples nto two subspaces of postve and negatve measure values, respectvely, so that a gven constrant can be rewrtten usng antmonotone or monotone constrants n subspaces. For example, consder a measure X, whch can be postve or negatve. Gven the constrant AvgðXÞ, the authors would frst rewrte the orgnal aggregate nto AvgðXÞ ¼psumðXÞ=Count 1 ðxþ nsumðxþ=count 2 ðxþ; where psum and nsum are the (absolute) sum of postve/ negatve X values, and Count 1 and Count 2 are rewrtng Count n the postve and negatve spaces. Then, they would use psumðxþ=count 1 ðcþ nsumðcþ=count 2 ðxþ as a weaker antmonotone approxmator for Avg(X), where c s some smallest subpartton of the gven partton. If the approxmate fals the threshold, then all groups that are subgroups of X and supergroups of c are pruned. BP-Cubng dffers from DnA n several aspects: 1. Rather than the separately monotone rewrtng strategy of DnA, n BP-Cubng, rewrtng follows the prncples that an algebrac aggregate functon can be expressed as an algebrac expresson of dstrbutve aggregate functons and that the aggregate value for a group can be computed from the aggregates of MSPs. For example, gven measure X wth MSPs X 1 ;...;X n, AvgðXÞ s rewrtten nto SumðXÞ=CountðXÞ ¼ Sum ðsumðx ÞÞ=Sum ðcountðx ÞÞ: 2. In BP-Cubng, the optmzaton for complex aggregate functons such as Avg and Var can produce very tght bounds. Contnung wth the prevous example, by followng Table 6, the optmzed upper bound for AvgðXÞ s MaxðfAvgðc ÞgÞ, where c terates over the smallest parttons. As Avgðc Þ s the real aggregate of a group n the search space, MaxðfAvgðc ÞgÞ s the optmal approxmator and s tghter (smaller) than psumðxþ=count 1 ðcþ SumðcÞ=Count 2 ðxþ; whch s the bound derved by DnA. 3. MSPs are nonempty groups of tuples and BP-Cubng prunes groups. Moreover, the G-tree structure of BP- Cubng greatly facltates top-down multway aggregaton and saves computaton, especally for dense data. DnA calculates aggregate bounds from tuples and prunes tuples the bottom-up recursve parttonng aggregaton strategy can ncur extra cost searchng for and prunng tuples that do not occur n any groups that satsfy a gven constrant. 4. Extra processng s needed n DnA to ncorporate antmonotone constrants such as CountðÞ n. Contnung wth the prevous example, consder the constrant AvgðXÞ and CountðÞ n and the (ab) partton wth c, d, and e as further parttonng dmenson values. In DnA, wth recursve parttonng, even f (e) s nfrequent, subparttons of (ab) and (e), namely, (abe), (abce), (abde), and (abcde), are not pruned (the Rollback tree was proposed to prune these groups). Note that these are only some of the subparttons of (e) that should be pruned. In BP-Cubng, as dscussed n Secton 7.5, all subparttons of a partton falng an antmonotone constrant are pruned. The top-k average technque [7] was desgned specfcally for the constrant AvgðxÞ and CountðÞ k and s not a general prunng technque for ceberg cubng. The tree-based bottom-up aggregaton strategy was frst proposed n [7]. Top-down aggregaton was proposed ndependently n [16] for performng multway aggregaton n cubng; however, the work was focused on how to ncorporate bottom-up prunng for antmonotone aggregaton constrants nto top-down aggregaton. The BUC algorthm was proposed n [4], where prunng wth antmonotone constrants was performed n the bottom-up recursve parttonng aggregaton framework. Our study s also related to constrant data mnng [2], [3], [8], [9], [12], [13]. Agrawal and Srkant [2] wrote the frst paper on prunng wth the antmonotone constrant CountðÞ for assocaton mnng. In other works on Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

918 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO.

support constrants [13]. These constrants are orthogonal to the aggregaton constrants that we consder n ceberg cubng.

In ths paper, we have proposed prunng wth nonantmonotone constrants by estmatng ther upper and lower bounds on a data cube from aggregates of the most specfc (or mnmal) parttons.

16 918 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 7, JULY 2007 assocaton mnng, prunng strateges wth specfc constrants were proposed, namely, mnmum mprovement constrants [3], succnct constrants [8], convertble constrants [9], tem constrants [12], and support constrants [13]. These constrants are orthogonal to the aggregaton constrants that we consder n ceberg cubng. 10 CONCLUSIONS In computng ceberg cubes, prunng wth nonantmonotone aggregaton constrants has been a challengng problem. In ths paper, we have proposed prunng wth nonantmonotone constrants by estmatng ther upper and lower bounds on a data cube from aggregates of the most specfc (or mnmal) parttons. We have proposed ceberg cubng algorthms, called BP-Cubng, whch ncorporate boundng and prunng, and ncorporated top-down and bottom-up aggregaton strateges by usng the G-tree structure. Our extensve experments on real and artfcal data sets have shown that BP-Cubng s effectve for prunng under varous data characterstcs, and our ceberg cubng algorthms sgnfcantly outperform state-of-the-art ceberg cubng algorthms. ACKNOWLEDGMENTS The authors thank the revewers for ther helpful comments. A sgnfcantly abrdged verson of ths paper appeared n [5]. [13] K. Wang, Y. He, and J. Han, Pushng Support Constrants nto Frequent Itemset Mnng, Proc. Int l Conf. Very Large Data Bases (VLDB 00), [14] K. Wang, Y. Jang, J.X. Yu, G. Dong, and J. Han, Pushng Aggregate Constrants by Dvde-and-Approxmate, Proc. 19th IEEE Int l Conf. Data Eng. (ICDE 03), [15] K. Wang, Y. Jang, J.X. Yu, G. Dong, and J. Han, Dvde-and- Approxmate: A Novel Constrant Push Strategy for Iceberg Cube Mnng, IEEE Trans. Knowledge and Data Eng., vol. 17, no. 3, pp , [16] D. Xn, J. Han, X. L, and B.W. Wah, Star-Cubng: Computng Iceberg Cubes by Top-Down and Bottom-Up Integraton, Proc. Int l Conf. Very Large Data Bases (VLDB 03), [17] Y. Zhao, P. Deshpande, and J.F. Naughton, An Array-Based Algorthm for Smultaneous Multdmensonal Aggregates, Proc. ACM Int l Conf. Management of Data (SIGMOD 97), Xuzhen Zhang receved the PhD degree from the Unversty of Melbourne. She s currently a senor lecturer n the School of Computer Scence and Informaton Technology, RMIT Unversty. Her research nterests nclude database technology, data mnng, machne learnng, and ther applcatons n new areas. She s a member of the ACM. Paulne Lenhua Chou receved the PhD degree from RMIT Unversty. She currently works for the Australan government as a data analyst. Her research nterests nclude database technology and data mnng. REFERENCES [1] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrshnan, and S. Sarawag, On the Computaton of Multdmensonal Aggregates, Proc. Int l Conf. Very Large Data Bases (VLDB 96), pp , [2] R. Agrawal and R. Srkant, Fast Algorthms for Mnng Assocaton Rules, Proc. Int l Conf. Very Large Data Bases (VLDB 94), pp , [3] R.J. Bayardo, R. Agrawal, and D. Gunopulos, Constrant-Based Rule Mnng n Large Dense Databases, Proc. 15th IEEE Int l Conf. Data Eng. (ICDE 99), [4] K. Beyer and R. Ramakrshnan, Bottom-Up Computaton of Sparse and Iceberg Cubes, Proc. ACM Int l Conf. Management of Data (SIGMOD 99), [5] L. Chou and X. Zhang, Computng Complex Iceberg Cubes by Multway Aggregaton and Boundng, Proc. Sxth Int l Conf. Data Warehousng and Knowledge Dscovery (DaWak 04), [6] J. Gray, S. Chaudhur, A. Bosworth, A. Layman, D. Rechart, and M. Venkatrao, Data Cube: A Relatonal Aggregaton Operator Generalzng Group-By, Cross-Tab and Subtotals, Data Mnng and Knowledge Dscovery, vol. 1, no. 1, pp , [7] J. Han, J. Pe, G. Dong, and K. Wang, Effcent Computaton of Iceberg Cubes wth Complex Measures, Proc. ACM Int l Conf. Management of Data (SIGMOD 01), [8] R. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, Exploratory Mnng and Prunng Optmzatons of Constraned Assocatons Rules, Proc. ACM Int l Conf. Management of Data (SIGMOD 98), pp , [9] J. Pe, J. Han, and L.V.S. Lakshmanan, Mnng Frequent Itemsets wth Convertble Constrants, Proc. 17th IEEE Int l Conf. Data Eng. (ICDE 01), pp , [10] K.A. Ross and D. Srvastava, Fast Computaton of Sparse Data Cubes, Proc. Int l Conf. Very Large Data Bases (VLDB 97), pp , [11] S. Sarawag, R. Agrawal, and R. Gupta, On Computng the Data Cube, Techncal Report RJ10026, IBM Almaden Research Center, [12] R. Srkant, Q. Vu, and R. Agrawal, Mnng Assocaton Rules wth Item Constrants, Proc. Thrd Int l Conf. Knowledge Dscovery and Data Mnng (KDD 97), pp , Guozhu Dong receved the PhD degree from the Unversty of Southern Calforna n He s a professor at Wrght State Unversty. He prevously taught at the Unversty of Melbourne and Flnders Unversty. He has served on numerous program commttees ncludng the IEEE Internatonal Conference on Data Engneerng (ICDE), the IEEE Internatonal Conference on Data Mnng (ICDM), the Internatonal Conference on Database Theory (ICDT), the ACM Symposum on Prncples of Database Systems (PODS), the ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, and the Internatonal Conference on Very Large Data Bases (VLDB). He was a program commttee cochar of the 2003 Internatonal Conference on Web-Age Informaton Management (WAIM), s on the steerng commttee of the same conference seres, s on the edtoral board of the Internatonal Journal of Informaton Technology, and wll serve as a program commttee cochar of the 2007 Jont Internatonal Conference of APWeb and WAIM. Hs man research nterests are databases, data mnng, and bonformatcs. Hs research has been funded by government agences such as the US Natonal Scence Foundaton (NSF), the Australan Research Councl (ARC), and the US Ar Force Research Laboratory (AFRL) and prvate corporatons. He has publshed more than 100 artcles n journals and conferences, holds three US patents, and was a recpent of the IEEE ICDM 2005 Best Paper Award. He s a senor member of the IEEE and member of the ACM.. For more nformaton on ths or any other computng topc, please vst our Dgtal Lbrary at Authorzed lcensed use lmted to: IEEE Xplore. Downloaded on November 18, 2008 at 20:54 from IEEE Xplore. Restrctons apply.

Multiway pruning for efficient iceberg cubing

Multiway pruning for efficient iceberg cubing Multway prunng for effcent ceberg cubng Xuzhen Zhang & Paulne Lenhua Chou School of CS & IT, RMIT Unversty, Melbourne, VIC 3001, Australa {zhang,lchou}@cs.rmt.edu.au Abstract. Effectve prunng s essental