On-line Evaluation of a Data Cube over a Data Stream

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) O-lie Evaluatio of a Data Cube over a Data Stream Woo Sock Yag ad Wo Suk Lee Departmet of Computer Sciece, Yosei Uiversity 34 Shicho-dog Seodaemu-gu Seoul, 2-749, Korea +82-2-223-276 Abstract: This paper proposes a dyamic data cube for applyig a data cube to a data stream eviromet. The dyamic data cube specifies user-iterestig areas with the support ratio of attribute value, ad maages the attribute groups dyamically by groupig ad dividig methods. With these methods, the memory usage ad processig time are reduced. It also efficietly shows ad emphasizes user-iterestig areas by icreasig the graularity for attributes that have higher support. We also propose a exceptio detectig method to quickly idetify exceptio by usig the reversed way of a multi-stage cluster samplig method. We perform experimets to verify how efficietly the dyamic data cube works i limited memory space. Key-Words: Data stream, OLAP, Data cube, User-iterestig area, Detectig exceptio, Cube tree Itroductio The OLAP [] has greatly evolved i the areas of database ad data warehouse systems for multi-dimesioal data aalysis. The data cube, which has bee a multi-dimesioal data model of OLAP, shows may aspects of the data through two factors: dimesio ad measuremet [2,3]. Now, OLAP is regarded as a essetial tool for busiess decisio makers ad data aalysts, ad the data cube is successfully adopted o multi-dimesioal data aalyses [4]. I Curretly, the amout of data ad its geeratig speed are rapidly icreased due to the fast growth of iformatio techology ad emergig ubiquitous eras. The data stream differs from other covetioal data because it is geerated cotiuously ad eormously i a real-time maer, ad its distributio characteristic is frequetly chaged. So, storig all data i a limited memory space is impossible. Ad it is hard to use a covetioal data cube without modificatio for processig the data stream. We propose the dyamic data cube, which uses a dyamic groupig mechaism for the dimesio attributes. The dyamic data cube ca aalyze the etire data cube because it stores all cuboids placed i a sigle path. It also saves memory usage ad reduces processig time by expadig ad shrikig methods with comig tuples ad maagig them dyamically as groups. It somehow loses detailed iformatio by groupig may attributes whe they go out of iterestig areas, but o the other had it helps to give meaigful iformatio to aalysts because it divides the groups ad gets detailed iformatio whe attributes come ito iterestig regios. 2 Related Work The data cube models data i a multi-dimesioal fashio ad is defied by dimesio ad fact tables [5]. The dimesio is a aspect which shows why aalysts wat to maage the data record, ad the fact is a umerical value. The Stream Cube [6] is proposed to adopt a data cube over the data stream. It has the followig three characteristics. Firstly, it uses a tilted time frame to summarize time dimesio. Aalysts have show iterest mostly i recet data rather tha old data, so the latest data should be stored i coarse graularity. Suppose that each tuple comes i every oe miute. Whe the time reaches fiftee miutes, these data are summarized ito a quarter spa. This method saves more memory usage tha storig every data ito oe-miute graularity. However, it becomes harder to discer detailed iformatio from past data over time. Secodly, the stream cube costructs the data cube oly with cuboids iside of user-iterestig areas [7, 8]. It defies miimal iterestig layers ad observatio layers by applyig the cocept of iceberg query [9, ], ad saves memory usage by usig cuboids that reside oly betwee two layers. Therefore, oce the stream cube is costructed, queries about the upper side of the observatio layer or the lower side of the miimal iterestig layer caot be aswered. Thirdly, it uses the H-tree method [] to store oe popular path i which cuboids lie i betwee the miimal iterestig layer ad observatio layer. The query about other paths ot stored i the H-tree is aswered by usig give iformatio. The popular path is fixed ad is ISSN: 79-59 373 ISBN: 978-96-474-28-4

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) decided by statistical results or the experiece of aalysts. As the fixed path caot be chaged durig executio, the stream cube caot show a accurate reactio to the radical chagig of the data stream distributio. 3 Dyamic Data Cube This paper proposes a dyamic data cube to make up for the weak poits of the stream cube described above. The dyamic data cube maitais aalysis graularity by groupig dimesio attributes rather tha storig the whole data cube. A importat regio, called the user-iterestig area, lies betwee the miimum support ratio S mi ad maximum support ratio S max as show i Figure. A support ratio of a group is defied as a ratio of the umber of occurece for the group to the umber of occurece for all groups. Regio of User Iterest 2. Each tree level stores iformatio of each dimesio. 3. The upper- ad lower-level dimesios are liked by a sigle-liked list. Defiitio 4. Cube Tree The cube tree stores dimesio iformatio ad statistical values of the measure attribute i a sigle path from top cuboids to bottom cuboids.. The cube tree is composed of the siblig lists give i Defiitio 2. 2. Each level is composed of more tha oe siblig list, ad all of the siblig lists i the same level are liked to each other by a double-liked list. 3. I order to coect siblig lists i the same level, it maitais the previous ad ext header poiters. 4. The upper ad lower levels are coected by a sigle-liked list. 5. I order to coect differet levels, it maitais the ext level cuboids poiter. S mi S max Fig. User-iterestig areas I order to store aalytic iformatio, the dyamic cube uses statistical iformatio as described i Defiitio. Defiitio. Statistical Iformatio The dyamic data cube keeps the followig data i its odes to store iformatio of the dimesio attribute ad statistical values of the measure attribute.. Iterval(I): I is a domai iterval of the dimesio attribute group. 2. Cout(C): C is a summatio of the occurrece of the dimesio attribute group. 3. Rage of dimesio attribute (R): R is a set of the values of the dimesio attribute group. 4. Average measure attribute value (M): M is a average value of the measure attribute for the dimesio attribute i a group. Defiitio 2. Siblig List. Siblig list is a sigle-liked list that coects odes cotaiig statistical iformatio as defied i Defiitio. 2. Siblig list maitais a poiter for coectio betwee odes. Defiitio 3. Oe-dimesioal Tree The oe-dimesioal tree has values of dimesio attribute ad statistical iformatio of the measure attribute related to oe-dimesioal cuboids.. It is composed of the siblig lists give i Defiitio 2. 3. The update of a dyamic data cube Whe a ew tuple T t is geerated i time t of a data stream D t, the cout value C t- of the dimesio attribute ad average value M t- of the measure attribute i time t- should also be updated. If the attribute value of a ewly produced tuple lies i group rage R ad its measure attribute value is m t, Equatio is used to compute cout value C t ad average measure value M t i curret time t. t t t C t = C t t ( M C ) + m +, M = () t C So that the dimesio attribute group that lies outside of user-iterestig areas is placed iside as best as possible, the dyamic data cube adjusts the rage of the group usig two phases: expadig phase ad shrikig phase. The expadig phase guides expaded groups to user-iterestig areas by dividig the dimesio attribute group whose support is larger tha maximum support S max. This phase takes place as the followig algorithm.. Divide a dimesio attribute group i a oe-dimesioal tree by user-defied value λ i case its support is larger tha user-defied maximum support S max. 2. Divide groups i the cube tree for the same level attribute group as i the oe-dimesioal tree. 3. If a dimesio attribute group to be divided has a smaller iterval tha user-defied value λ, the the group should ot be divided. If a dimesio attribute group has larger support tha user-defied support S max, the groups go out of user-iterestig areas. Dividig a dimesio attribute ISSN: 79-59 374 ISBN: 978-96-474-28-4

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) group by λ guides the group to user-iterestig areas. Equatio 2 is used to calculate the ewly created group s cout ad average measure value after the expadig phase. I C = C ew ew = C, M M I ew = (2) λ I Equatio 2, the ew group s iterval I ew is derived by dividig I by λ. Its cout C ew is also derived by dividig C by λ. The average measure value M ew of the expaded dimesio attribute group ca be also used as a average measure value of the ewly created group without further calculatio because it is already a average value. Figure 2 shows a example of the expadig phase as the groups with rages of ad surrouded by the dotted lie have surpassed support more tha user-defied maximum support S max =.25, divided each group ito λ = 2. I this figure, the left side is a oe-dimesioal tree, ad the right side is a cube tree give i Defiitios 3 ad 4, respectively. The shrikig phase guides combied groups to user-iterestig areas by shrikig the dimesio attribute group whose support is larger tha miimum support S mi. This phase takes place as the followig algorithm.. Combie dimesio attribute groups i a oe-dimesioal tree i case their supports are smaller tha user-defied miimum support S mi. 2. Combie groups i the cube tree for the same level attribute group as i a oe-dimesioal tree. 3. If dimesio attribute groups to be combied are ot cotiuous i their attribute values, the the groups should ot be combied. Compay Regio Color Sales 343 Newly geerated data elemet : Update Node 4 3 (Compay, *, *) 4 4 (*, Regio, *) 4 C: 39 (*, *, Color), C: 4 C: 28 S (, ) =.259 > S max =.25, C:, C:, C:, C: (a) Before expadig 4 4 (Compay, *, *) 4 4 (*, Regio, *) 4 C: 39 (*, *, Color) C: 7 C: 7 C: 28.5.5 (b) After expadig Fig.2 A example of expadig phase (S mi =., S max =.25, λ = 2) ISSN: 79-59 375 ISBN: 978-96-474-28-4

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) The more cotiuous dimesio attribute groups have smaller support tha user-defied value S mi, the more the groups leave user-iterestig areas. Combiig dimesio attribute groups guides the group placed i the user-iterestig areas. Equatio 3 is used to calculate the combied group s cout ad average measure value. C ew = C i, i= M ( M i Ci ) i= ew = The ew cout value of a group ca be calculated by the summig up of those couts of groups. The average measure value M ew of the combied dimesio attribute group will be the average of the ewly created group s measure value. C ew (3) 4 Exceptio Detectig Method i the Dyamic Data Cube Data is stored i a data cube i a summarized form ad is explored by OLAP operatios. OLAP operatios do t help aalysts to reach a meaigful portio of the data cube, eve though it makes it possible to explore the etire data cube. Aalysts usually deped o hypothesis-drive exploratio for the data cube through operatios. Discovery-drive exploratio was recetly proposed to ot go through a tedious explorig job [2]. Although the discovery-drive exploratio supports detectig exceptio with pre-computed exceptios value at differet levels, it is ot applicable to the data stream as it must compute all coefficiets betwee every dimesio ad hierarchy. Therefore, we propose a exceptio detectig method to quickly idetify exceptios by usig the reversed way of the multi-stage cluster samplig, a samplig method from statistics [3]. 4. Exceptio detectig method with reversed way of multi-stage cluster samplig The oe-dimesioal tree ca be regarded as a populatio i statistics because it is made up of clusters of oe-dimesioal attribute ad has stored average measure values sice the begiig of the dyamic data cube. It always cosiders the cube tree as a set of samples. Thus, we ca compare each sample with its ow populatio, ad if some of those samples which do t follow the characteristics of the populatio are selected as exceptios. I order to tell a sample is a exceptio, there is a eed for a precise way that could show the degree of exceptio. The z-score makes it possible to compare two data o matter the differece i the amout of data, as its value lies betwee ad [4]. Let Z T be the z-score of the populatio i the oe-dimesioal tree ad Z P be the z-score of a multi-stage clustered sample i the cube tree respectively. Equatio 4 is the formula to compute the exceptio Z exp i the dyamic data cube. Z = max( Z Z τ,) (4) exp T P If the distace betwee Z T ad Z P is larger tha user-defied value τ, the the sample is detected as a exceptio. 4.2 Compariso with discovery-drive method I order to fid exceptios through discovery-drive exploratio, it is ecessary to compute all coefficiets betwee every dimesio ad hierarchy [2]. For example, SelfExp is computed as Equatio 5. y ˆ ii 2... i y ii 2... i SelfExp ( yi... ) = max( τ,) i2 i (5) σ i i... i where y deotes a value of a cell i the data cube, ad ŷ is a value computed from the correlatio betwee every dimesio ad hierarchy. I order to compute ŷ, all aggregated values through every dimesio are eeded. The space complexity for computig those aggregated values is derived from Equatio 6, assumig that is the umber of dimesios ad m is the umber of each dimesio s values. r! r SSE = Cr m B = m B (6) r= r= r!( r)! I this equatio, B deotes the memory size for oe cell. C r is the umber of combiatios for possible r= subsets of dimesios, ad m r is the umber of values i the r th dimesio. Thus, the multiplicatio of these two values becomes the etire umber of cells i the data cube. The time complexity T SE is for oly updatig without readig tables, so it is the same as time for sortig data i the order of specific attributes. Covetioal database systems usually use the 2 Phase Multiway Sortig algorithm. Most time of this method is affected by time for disk read ad write, ad uses 64Kbytes for each block. Disk read ad write time T IO is differet amog various systems, but is usually below millisecods. Thus, the time complexity T SE depedig o the umber of cells is computed as Equatio 7. S T SE SE = ( ) T 64 IO (7) O the other had, exceptio detectig methods i the 2 ISSN: 79-59 376 ISBN: 978-96-474-28-4

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) Mem ory(mb) Mem o ry(mb ) 6 4 2 8 6 4 2 9 8 7 6 5 4 3 2 Full Data Cube Dyamic Data Cube 8.5 7 Full Data Cube 6 Dy amic Data Cube Memory(MB) 5 4 3 2.85 Full Data Cube Dyamic Data Cube.8 2 4 6 8 2 4 6 8 Tuple Tuple Fig.3 Memory usage Fig.4 Memory usage Fig.5 Accuracy 4 Dimesios 5 Dimesios 6 Dimesios 7 Dimesios Tim e(msec).25 4 Dimesios 5 Dimesios.2 6 Dimesios 7 Dimesios.5..5 A ccur acy M em o ry(m B ).95.9.95.94.93.92.9 4 Dimesios 5 Dimesios.9 6 Dimesios 7 Dimesios.89 Fig.6 Memory usage Fig.7 Processig time Fig.8 Accuracy dyamic data cube eed to search the cube tree ad compare values of samples i the cube tree with values of populatio i the oe-dimesioal tree. Thus, the space complexity S DC is obtaied as follows. r SDC = ( m ) B (8) k r= The space complexity of this exceptio detectig method is the same as the space complexity of the cube tree, because it uses the cube tree itself for exceptio detectig. Also, if the groupig phase is doe i /k-fold of its origial memory size, the the umber of attributes i the dimesio becomes m/k, ot m. Most of the time i the exceptio detectig method is spet gettig samples for compariso; that is, searchig time for the cube tree becomes time complexity, which ca be estimated as Equatio 9. Memory read ad write time T MIO is also below millisecods. T = log S T (9) DC m k DC MIO As show i the above equatios, the space complexity ad time complexity of discovery-drive exploratio methods are far bigger tha those of the dyamic data cube. I other words, the dyamic data cube is better tha discovery-drive exploratio i terms of space ad time cosumptio. 5. Experimets The data sets used i our experimets have distributed betwee.8 ad 3.4 i zipfia distributio [5]. The umber of dimesios varies from 5 to 8, ad each data set cosists of, tuples. The cardiality of the attribute value is. Each tuple is processed oe by oe to simulate the data stream eviromet. The accuracy A i our experimets is obtaied as a relative error betwee the measure value R.C of a covetioal database ad the measure value G.C of the dyamic data cube as follows: m k k R. Ci G. Ci i= k = A = () R. C Figure 3 shows the memory usage of the data cube ad dyamic data cube by varyig the zipfia distributio. Figure 4 shows the memory usage alog with the umber of data tuples i zipfia distributio.8. The dyamic data cube uses less memory space tha the data cube due to its expadig phase for the groups which have lower support tha user-defied values. The memory usage of the data cube is cotiuously icreased because it stores all ewly produced attribute values. Figure 5 shows the accuracy of the experimet i Figure 4. The dyamic data cube shows lower accuracy uder the 2, tuples due to its expadig ad shrikig phases as the dimesio attribute support chaged frequetly. However oce the distributio of the data stream becomes stable, the accuracy is icreased sice there are o more expadig ad shrikig phases. Figure 6 shows the memory usage for the differet umber of dimesios by varyig the zipfia distributio. Figure 7 shows processig time take for updatig i the experimet of Figure 6. As the zipfia distributio has a higher value, most support values of the dimesio attribute ted to be positioed i the same place. ISSN: 79-59 377 ISBN: 978-96-474-28-4

Proceedigs of the 8th WSEAS Iteratioal Coferece o APPLIED COMPUTER SCIENCE (ACS'8) Subsequetly, the groupig phase is frequetly happeed so as the memory usage ad processig time are reduced. As the umber of dimesios is icreased, memory usage ad processig time are icreased because the height of tree becomes higher. Figure 8 shows the accuracy whe zipfia distributios are varied. The differece i accuracy is ot large amog them. As the zipfia distributio has a higher value, the umber of groups to be combied is elarged. It makes the accuracy slightly higher by combiig may groups which have uit rages ito larger sigle groups. 6. Coclusio This paper proposed a dyamic data cube for efficietly adoptig the data cube to a data stream eviromet. It is impossible to store all produced data i a limited memory space due to the characteristic of the data stream. Cosequetly, it is better to try to provide meaigful iformatio by utilizig give memory space. The dyamic data cube reduces memory usage ad processig time by employig user-iterestig areas ad by maagig attribute values as groups ad ot as fie grai uits. These methods ca lose the accuracy for the outside of iterestig areas, but it is possible to maitai precise accuracy for the iterestig areas; that is, it tries to provide more memory space to iterestig areas rather tha to the other areas. At the same time, it provides recet, useful iformatio by usig its dyamical groupig mechaism. We also proposed a exceptio detectig method to help aalysts fid exceptios faster ad more easily. The proposed method employs the reversed way of multi-stage cluster samplig. The oe-dimesioal tree ca be regarded as a populatio i statistics ad the cube tree ca be thought as a set of samples. Thus, we ca compare each sample with its ow populatio, ad whether some of those samples which do ot follow the characteristics of the populatio should be detected as a exceptio. We show that the proposed method eeds less time ad memory space tha discovery-drive exploratio methods. ACKNOWLEDGEMENT "This work was supported by the Korea Sciece ad Egieerig Foudatio (KOSEF) NRL Program grat fuded by the Korea govermet(mest)" [3] Rakesh Agrawal, Ashish Gupta, Suita Sarawagi. Modelig multidimesioal database, I Proc. The 3th Itl Coferece o Data Egieerig, Birmigham, U.K., pp.232-243, 997. [4] S. Chaudhuri ad U. Dayal. A overview of data warehousig ad OLAP techology, SIGMOD Record, Vol. 26, 997, pp.65 74. [5] George Colliat, OLAP, relatioal ad multidimesioal database systems, ACM SIGMOD Record, Vol. 25, No. 3, 995, pp.64-69. [6] Jiawei Ha, Yixi Che, Guozhu Dog, Jia Pei, Bejami W. Wah, Jiayog Wag, Y. Dora Cai, Stream Cube: A Architecture for Multi-Dimesioal Aalysis of Data Streams, Distributed ad Parallel Databases, Vol. 8, No. 2, 25, pp.73-97. [7] V. Hariaraya, A. Rajarama, J.D. Ullma, Implemetig data cubes efficietly, ACM SIGMOD Record, Vol. 25, No. 2, 996, pp.25 26. [8] K. Beyer ad R. Ramakrisha, Bottom-up computatio of sparse ad iceberg cubes, ACM SIGMOD Record, Vol. 28, No. 2, 999, pp.359 37. [9] Z. Shao, J. Ha, D. Xi, MM-Cubig: Computig iceberg cubes by factorizig the lattice space, Proceedigs of the 6th Iteratioal Coferece o Scietific ad Statistical Database Maagemet, 24, pp.23 222. [] D. Xi, J. Ha, X. Li, B.W. Wah, Star-cubig: Computig iceberg cubes by top-dow ad bottom-up itegratio, Proceedigs of the 29th Iteratioal Coferece o Very Large Data Bases, Vol 29, 23, pp.476-487. [] Jiawei Ha, Jia Pei, Guozhu Dog, Ke Wag, Efficiet Computatio of Iceberg Cubes with Complex Measures, SIGMOD Coferece, Vol. 3, No. 2, 2, pp.-2. [2] Suita Sarawagi, Rakesh Agrawal, Nimrod Megiddo, Discovery-drive Exploratio of OLAP Data Cubes, Proceedigs of the 6th Iteratioal Coferece o Extedig Database Techology: Advaces i Database Techology, Vol. 377, 998, pp.68-82. [3] http://e.wikipedia.org/wiki/multistage_samplig [4] http://e.wikipedia.org/wiki/stadard_score [5] M.E.J. Newma, Power laws, Pareto distributios ad s law, Cotemporary Physics, Vol. 46, No. 5, 25, pp.323-35. Refereces: [] Imo, W.H, Buildig the Data Warehouse, Joh Wiley, 992. [2] The OLAP Coucil, MD-API the OLAP Applicatio Program Iterface Versio.5 Specificatio, 996. ISSN: 79-59 378 ISBN: 978-96-474-28-4