Accelerating Multi Dimensional Queries in Data Warehouses

Size: px

Start display at page:

Download "Accelerating Multi Dimensional Queries in Data Warehouses"

Lydia Booth
5 years ago
Views:

1 Acceleratig Multi Dimesioal Queries i Data Warehouses Russel Pears ad Brya Houlisto School of Computig ad Mathematical Scieces, Aucklad Uiversity of Techology, New Zealad rpears@aut.ac.z Data Warehouses are widely used for supportig decisio makig. O Lie Aalytical Processig or OLAP is the mai vehicle for queryig data warehouses. OLAP operatios commoly ivolve the computatio of multidimesioal aggregates. The major bottleeck i computig these aggregates is the large volume of data that eeds to be processed which i tur leads to prohibitively expesive query executio times. O the other had, Data Aalysts are primarily cocered with discerig treds i the data ad thus a system that provides approximate aswers i a timely fashio would suit their requiremets better. I this chapter we preset the Prime Factor scheme, a ovel method for compressig data i a warehouse. Our data compressio method is based o aggregatig data o each dimesio of the data warehouse. Extesive experimetatio o both real-world ad sythetic data have show that it outperforms the Haar Wavelet scheme with respect to both decodig time ad error rate, while maitaiig comparable compressio ratios (Pears ad Houlisto, 007). Oe ecouragig feature is the stability of the error rate whe compared to the Haar Wavelet. Although Wavelets have bee show to be effective at compressig data, the approximate aswers they provide varies widely, eve for idetical types of queries o early idetical values i distict parts of the data. This problem has bee attributed to the thresholdig techique used to reduce the size of the ecoded data ad is a itegral part of the Wavelet compressio scheme. I cotrast the Prime Factor scheme does ot rely o thresholdig but keeps a smaller versio of every data elemet from the origial data ad is thus able to achieve a much higher degree of error stability which is importat from a Data Aalysts poit of view. Keywords: Data Warehousig, OLAP, Multi Dimesioal Aggregates, Approximate Query Processig, Wavelets, Prime Numbers, Data Compressio. INTRODUCTION Data Warehouses are icreasigly beig used by decisio makers to aalyze treds i data (Cuigham, Sog ad Che, 006, Elmasri ad Navathe, 003). Thus a marketig aalyst is able to track variatio i sales icome across dimesios such as time period, locatio, ad product o their ow or i combiatio with each other. This aalysis requires the processig of multi-dimesioal aggregates ad group by operatios agaist the uderlyig data warehouse. Due to the large volumes of data that eed to be scaed from secodary storage, such queries, referred to as O Lie Aalytical Processig (OLAP) queries, ca take from miutes to hours i large scale data warehouses (Elmasri, 003, Oracle 9i). The stadard techique for improvig query performace is to build aggregate tables that are targeted at kow queries (Triatafillakis, Kaellis, ad Martakos 004; Elmasri, 003). For example the idetificatio of the top te sellig products ca be speeded up by buildig a

2 summary table that cotais the total sales value (i dollar terms) for each of the products sorted i decreasig order of sales value. It would the be a simple matter of queryig the summary table ad retrievig the first te rows. The mai problem with this approach is the lack of flexibility. If the aalyst ow chooses to idetify the bottom te products a expesive re-sort would have to be performed to aswer this ew query. Worst still, if the iformatio is to be tracked by sales locatio the the summary table would be of o value at all. This problem is symptomatic of a more geeral oe where Database Systems which have bee tued for a particular access patter perform poorly as chages to such patters occur over a period of time. I their study (Zhe ad Darmot, 005) showed that database systems which have bee optimized through clusterig to suit particular query patters rapidly degrade i performace whe such query patters chage i ature. The limitatios i the above approach ca be addressed by a data compressio scheme that preserves the origial structure of the data. The chapter is orgaized as follows. I the ext sectio we review related work. The ext sectio itroduces the Prime Factor Compressio (PFC) approach. We the preset the algorithms required for ecodig ad decodig with the PFC approach. The O Lie recostructio of Queries is discussed thereafter. Implemetatio related issues are the discussed, followed by a performace evaluatio of PFC ad a compariso with the Haar Wavelet algorithm. We the discuss future treds i optimizig multi-dimesioal queries i the light of the results of this research. We coclude with a summary of the mai achievemets of the research. BACKGROUND Previous research has teded to cocetrate o computig exact aswers to OLAP queries (Ho, ad Agrawal, 997, Wag 00). Ho describes a method that pre-processes a data cube to give a prefix sum cube. The prefix sum cube is computed by applyig the trasformatio: P[A i ]=C[A i ]+P[A i- ] alog each dimesio of the data cube, where P deotes the prefix sum cube, C the origial data cube, A i deotes a elemet i the cube, ad i is a idex i a rage..d i (D i is the size of the dimesio D i ). This meas that the prefix cube requires the same storage space as the origial data cube. The above approach is efficiet for low dimesioal data cubes. For high dimesioal eviromets, two major problems exist. Firstly, the umber of accesses required is d (Ho et al, 997), which ca be prohibitive for large values of d (where d deotes the umber of dimesios). Secodly, the storage required to store the prefix sum cube ca be excessive. I a typical OLAP eviromet the data teds to be massive ad yet sparse at the same time. The degree of sparsity icreases with the umber of dimesios (OLAP) ad thus the umber of o zero cells may be a very small fractio of the prefix sum cube, which by its ature has to be dese for its query processig algorithms to work correctly. Aother exact techique is the Dwarf cube method (Sismais ad Deligiaakis, 00) which seeks to elimiate structural redudacies ad factor them out by coalescig their store. Pre-fix redudacy arises whe the fact table cotais a group of tuples havig a prefix of redudat values for its dimesio attributes. O the other had, suffix redudacy occurs whe groups of tuples cotai a suffix of redudat values for its dimesio attributes. Elimiatio of these redudacies has bee show to be effective for dese cubes. Ufortuately, i the case of sparse cubes with a large umber of dimesios the size of the fact table ca actually icrease i size (Sismais et al, 00), thus providig o gais i storage efficiecy.

3 I cotrast to exact methods, approximate methods attempt to strike a good balace betwee the degree of compressio ad accuracy. Query performace is ehaced by storig a compact versio of the data cube. A histogram based approach was used by (Matias ad Vitter, 998), (Poosala ad Gati, 999), (Vitter et al, 998), to summarize the data cube. However, histograms too suffer from the curse of high dimesioality, with both space ad time complexity icreasig expoetially with the umber of dimesios (Matias et al, 998). The Progressive Approximate Aggregate approach (Lazaridis ad Mehrotra, 00) uses a tree structure to speed up the computatio of aggregates. Aggregates are computed by idetifyig tree odes that itersect the query regio ad the accumulatig their values icremetally. All tree odes that are fully cotaied i the query regio provide a exact cotributio to the query result, whereas odes that have a partial itersectio provide a estimate. This estimatio is based o a assumptio of spatial uiformity of attribute values across the uderlyig data cube. I practice this assumptio may be ivalid as with the case of the real-world data (US Cesus) that we experimeted with i our study. I cotrast, our method makes o such assumptios o the shape of the source data distributio. Aother issue with the above scheme is that it has a worst case ru time performace that is liear i the umber of data elemets covered by the query, as observed by (Che ad Che, 003). This has egative implicatios for queries that spa a large fractio of the uderlyig data cube, particularly sice compressio is ot utilized to store source data. Samplig is aother approach that has bee used to speed up aggregate processig. Essetially, a small radom sample of the database is selected ad the aggregate is computed o this sample. The samplig operatio ca be doe off-lie, whereby the sample is extracted from the database ad all queries are ru o this sigle extracted sample. O the other had, i o-lie samplig data is read radomly from the database each time a query is executed ad the aggregate computed o the dyamically geerated sample (Hellerstei et al, 996). The very ature of samplig makes it very efficiet i terms of ru time, but its accuracy has bee show to be a limitig factor i its widespread adoptio (Vitter ad Wag, 999). Vitter et al, use the wavelet techique to trasform the data cube ito a compact form. It is essetially a data compressio techique that trasforms the origial data cube ito a Wavelet Trasform Array (WTA) which is a fractio of the size of the origial data cube. I their research Matias et al show that wavelets are superior to the histogram based methods, both i terms of accuracy ad storage efficiecy. Wavelets have also bee show to provide good compressio with sparse data cubes, ulike the Dwarf compressio method. Give the wavelet s superior performace over its rivals ad the fact that it is a approximate techique, it was a ideal choice for compariso agaist our Prime Factor scheme which is also approximate i ature. The ext sectio provides a brief overview of the ecodig ad decodig procedures used i wavelet data compressio. I priciple, ay data compressio scheme ca be applied o a data warehouse For example, a 3- dimesioal warehouse that tracks sales by time period, locatio ad products ca be compressed alog all three dimesios ad the stored i the form of chuks (Sarawagi ad Stoebraker, 994). Chukig is a techique that is used to partitio a d-dimesioal array ito smaller d- dimesioal uits. However a high compressio ratio is eeded to offset the potetially huge secodary storage access times. This effectively ruled out stadard compressio techiques such as Huffma Codig (Cormack, 985), LZW ad its variats (Lempel ad Ziv, 977; Hut 998) as well as

4 Arithmetic Codig (Lagdo, 984). These schemes eable decodig to the origial data with 00% accuracy, but suffer from modest compressio ratios (Ng ad Ravishakar, 997). O the other had the treds aalysis ature of decisio makig meas that query results do ot eed to reflect 00% accuracy. For example, durig a drill-dow query sequece i ad-hoc data miig, iitial queries i the sequece usually determie the truly iterestig queries ad regios of the database. Providig approximate, yet reasoably accurate aswers to these iitial queries gives users the ability to focus their exploratios quickly ad effectively, without cosumig iordiate amouts of valuable system resources (Hellerstei, Haas ad Wag, 997). This meas that lossy schemes which exhibit relatively high compressio ad ear 00% accuracy would be the ideal solutio to achievig acceptable performace for OLAP queries. This chapter ivestigates ad presets the performace of a ovel scheme, called Prime Factor Compressio (PFC) ad compares it agaist the well kow Wavelet approach (Vitter ad Wag, 998; Vitter ad Wag 999, Chakrabarti ad Garofalakis, 000). Recet results have idicated that the Prime Factor Compressio scheme outperforms the Wavelet scheme i terms of error stability, maitaiig a very low ad virtually costat level of accuracy irrespective of the size of the query (Pears ad Houlisto, 007). This is i marked cotrast to the Wavelet scheme which exhibits large swigs i accuracy with varyig query size. This problem has bee attributed to the thresholdig techique used to reduce the size of the ecoded data (Garofalakis ad Gibbos, 004) ad is a itegral part of the Wavelet compressio scheme. The Prime Factor scheme, o the other had does ot rely o thresholdig but keeps a smaller versio of every data elemet from the origial data ad is thus able to achieve a much higher degree of error stability which is importat from a Data Aalysts poit of view. I this chapter we provide a fuller exploratio of the PFC scheme, icludig detailed results o various differet types of datasets ad a formal evaluatio of its performace o sparse data. Wavelet Data Compressio Scheme The Wavelet scheme compresses by trasformig the source data ito a represetatio (the Wavelet Trasform Array or WTA) that cotais a large umber of zero or ear-zero values. The trasformatio uses a series of averagig operatios that operate o each pair of eighborig source data elemets. I this maer the origial data is trasformed ito a data set (the Level trasform set) that cotais the averages of pairs of elemets from the origial data set. The pairwise averagig process is the applied recursively o the Level trasform set. The process cotiues i this maer util the size of the trasformed set is equal to. I order to be able to recostruct the origial data the pair-wise differeces betwee eighbors is kept i additio to the average. The WTA array the cosists of all pair-wise averages ad differeces accumulated across all levels. A thresholdig fuctio is the applied o the WTA to remove a large fractio of array elemets which are small i value. The thresholdig fuctio applies a weightig scheme o the WTA elemets as those elemets at the higher levels play a more sigificat role i recostructio tha their couterparts at the lower levels. For details of the wavelet ecodig ad decodig techiques the reader is referred to (Vitter et al, 999). Wavelet Decodig The decodig process recostructs the origial data by usig the trucated versio of the WTA. The decodig process is best illustrated with the followig example. Figure shows how the

5 coefficiets of the origial array are recostructed from the WTA (the C coefficiets hold the origial array while the S coefficiets hold the WTA). S0 S S S3 S4 S5 S6 S7 C0 C C C3 C4 C5 C6 C7 Figure : The Wavelet Recostructio Process Ay coefficiet C(i) is recostructed by usig its acestor S coefficiets i the path from the root ode to itself. Thus for example, C(0) = S(0) +S()+S() +S(4) ad C() = S(0)+S()+S()-S(4). Cosider a sceario where S(4), although relatively large i compariso to S(0), S() ad S(), is thresholded out due to its lower weightig, thus leadig to iaccuracies i the estimatio of C(0) ad C(). This is symptomatic of the geeral case where the lower level coefficiets are sigificat cotributors to the recostructio process but are thresholded out to make way for their more highly weighted acestor coefficiets. The problem is especially acute i the case of data that is both skewed ad have a high degree of variability. THE PRIME FACTOR SCHEME I respose to the problems associated with wavelets, we preset the Prime factor Scheme. The scheme works broadly o the same lies as the Wavelet techique i the sese that data compressio is used to reduce the size of the data cube prior to processig of OLAP queries. OLAP queries are ru o the compressed data, which is decoded to yield a approximate aswer to the query. Our ecodig scheme uses pre-processig to reduce the degree of variatio i the source data which results i better compressio. A overview of the ecodig process is give i the followig sectio. Overview of Prime Factor Ecodig Scheme The data is first scaled usig the stadard mi-max scalig techique. With this techique, ay value V i the origial cube is trasformed ito V, where V =(V-mi)*(max-mi)/(maxmi)+mi, where mi, max represet the miimum ad maximum values respectively i the origial data cube; mi ad max are the correspodig miimum ad maximum values of the

6 scaled set of values. Each scaled value V is the approximated by its earest prime umber i the scale rage [mi, max]. The choice of mi ad max iflueces both the degree of compressio ad the error rate as we shall show later i the Experimetal Setup sectio. The ratioale behid the scalig is to iduce a higher degree of homogeeity betwee values by compressig the origial scale [mi, max] ito a smaller scale [mi, max], with mi mi ad max < max. I doig so, this preprocessig step improves the degree of compressio. The scaled data cube is the partitioed ito equal sized chuks. The size of the chuk c represets the umber of cells that are eclosed by the chuk. The size of the chuk affects the decodig (query processig) time. Higher values mea fewer chuks eed to be decoded, thus reducig the decodig time (see Theorem 3 i the sectio o O lie Recostructio of Queries ). Each chuk is ecoded by the prime factor ecodig algorithm which yields a array cotaiig c cells. Although the ecoded versio has twice the umber of cells it is much smaller i size sice each cell is very small i umerical value. I fact, our experimetal results reveal that the vast majority of values are very small itegers (see Figures 3a ad 3b i the Experimetal Setup sectio). Figure below summarizes the Prime Factor ecodig process. I additio to trasformig values to very small itegers, the other major beefit of the algorithm is that the itegers are highly skewed i value towards zero. For example, o the Cesus data set (US Cesus) with a chuk size of 64, over 75% of the values tured out to be zero. These results were also bore out with the sythetic data sets that we tested our scheme o. The source (origial) data i some of these data sets were ot skewed i ature, showig that the skew was iduced, rather tha beig a iheret feature of the origial data. Mi Max Scalig Scaled values i tighter rage Chukig of scaled values Chuked iput stream Prime Factor Ecodig Prime Factor array Elias Ecodig of Prime Factor Array Ecoded Stream Figure : Overview of the Prime Factor Scheme We exploited the skewed ature of the ecoded data by usig the Elias variable legth codig scheme (Elias, 975). Elias ecodig works by ecodig smaller itegers i a smaller umber of bits, thus further improvig the degree of compressio. The ext sectio will describe the details ivolved i step 3 of the above process, the PFC ecodig algorithm. The Prime Factor Ecodig Algorithm The algorithm takes as its iput the scaled set of values produced by the mi-max scalig techique. For each chuk, every scaled value is coverted ito the prime umber that is closest to it. We refer to each such prime umber as a prime factor. The algorithm makes use of two

7 operators which we defie below. Both operators α ad β take as their iput a pair of prime factors V k ad V k+. Defiitio : The operator α is defied by α(v k,v k+ ) = earest prime (V k +V k+ +I(V k+ )- I(V k )), where I(V k+ ), I(V k ) deote the ordial (idex) positios of V k+ ad V k o the prime umber scale. The operator takes two primes (V k,v k+ ) adds them together with the differece i idex positios betwee the d prime ad the st prime ad coverts the sum obtaied to the earest prime umber. Defiitio : The operator β is defied by β(v k,v k+ ) = earest prime(v k +V k+ ) The α operator is applied pair-wise across all values (a total of c prime factors) i the chuk. This yields a stream of c/ prime factors. The α operator is the applied recursively o the processed stream util a sigle prime factor is obtaied. This recursive procedure gives rise to a tree of height log c where c is the chuk size. We refer to this tree as the Prime Idex Tree. I parallel with the costructio of the prime idex tree we costruct aother tree, called the Prime Tree. The prime tree is costructed i the same maer as its prime idex couterpart except that we apply the β operator, istead of the α operator. We illustrate the costructio of the trees with the help of the followig example. For the sake of simplicity we take the cube to be of size 4, the chuk size c to be 4 (i.e. we have oly oe chuk) ad the scale rage [mi, max] to be [0, 0] Figu 3 6 Figure 3a: The costructio of the Prime Idex Tree Figure 3b: The costructio of the Prime Tree For the prime idex tree (Figure 3a), the prime values 37 ad at the leaf level are summed as 37++I()-I(37), which is trasformed to its earest prime umber, 9. Similarly 7 ad 97 whe processed yield the prime umber, 73. The ode values 9 ad 73 i tur yield a root value of 33. As show i Figure 3b, the leaf values 37 ad whe summed ad coverted ito its earest prime yields a value of 37. Followig the same process, we obtai values of 67 ad 99 for the remaiig odes. We aotate each iteral ode by its correspodig idex value. The ecoded array E ow cosists of the differeces i idex positios across correspodig positios i the two trees. These differeces are referred to as differetials. For the example above the array turs out to be E = {5, -, }. We also store the idex value of each prime tree root separately i aother array R. Each chuk gives rise to its ow prime tree root, ad for the simple example above, sice we have oly oe chuk, this results i a sigle value {47} for R.

8 Decodig requires the use of the two arrays E ad R. We start with the first value i array R, which is 47, ad the add it to the first value i array E which is 5, yieldig the value 5. We use 5 as a idex ito a table of primes (this table has the pseudo prime 0 added to it with idex value 0) ad extract 33 as the correspodig prime value. We ow search for pairs of primes (P, P) which satisfy the coditio: α(p, P) = 33, β(p3, P4) = 99, I(P3) = I(P+) ad I(P4) = I(P-) (), where P3, P4 are the correspodig odes to P, P i the prime tree This search i geeral would yield a set S of cadidate pairs for (P, P) rather tha a sigle pair. I order to extract the correct pair, we associate a iteger with each iteral ode of the Prime Idex tree which records the ordial positio withi the set S which would eable us to desced to the ext level of the tree. I this case this iteger turs out to be 0 sice there is oly oe pair that satisfies coditio (). These itegers are collectively referred to as offsets. The complete set of offsets for the tree above is {0, 0, 0}. The complete versio of E cotais the sequece of differetials followed by the sequece of offsets. Thus for our example we have E = {5,,-,0,0,0}. We are ow able to decode by descedig both trees i parallel ad recover the origial set of leaf values 37,, 7 ad 97. For ease of uderstadig, the ecodig process above has bee described for a -d dimesioal case. The extesio to d dimesios follows aturally by ecodig alog each dimesio i sequece. For example, if we have a dimesioal cube <D, D > we would first costruct - dimesioal chuks. With a chuk size of 6 ad the use of equal width across each dimesio, each chuk would cosist of a 4 by 4 -d array. We first ru the ecodig process across dimesio D. This would yield a -dimesioal array cosistig of 4 prime root values for each chuk. The differetials ad offsets that result from this ecodig are stored i a temporary cache. The 4 prime root values from ecodig o D are the subjected to the ecodig process across dimesio D to yield the fial ecodig. The differetials ad offsets that result from ecodig alog D are merged with those from ecodig alog dimesio D to yield the fial set of ecoded values. Ecodig Performace The ecodig takes place i four steps as give i Figure. Steps ad, ivolvig scalig ad chukig ca be doe together. As data is read it ca be scaled o the fly with the chukig process. If the origial data is stored i dimesio order (D, D, D d ) with the rightmost idices chagig more rapidly, the it ca be show the I/O complexity ivolved i chukig is N N O log M, where M is the available memory size ad B is the block size of the uderlyig B B B database system. The reader is referred to [Vitter 999] for a proof. The I/O complexity of steps 3 N ad 4 is O. Thus the I/O complexity is bouded by the N N O log M term required for the B B B B chukig step. However it should be oted that this oly reflects a oe time cost i reorgaizig the data from dimesio order to chuked format. Oce this is doe, o further chukig is required as updates to values do ot affect the chuk structure. Thus the time complexity o a regular basis would be N bouded above by O. The oly exceptio occurs whe the dimesios are reorgaized ad B either grow or shrik as a result. This would require repetitio of the chukig step.

9 The Ratioale behid the Prime Factor Scheme Prime umbers provide us with a atural way of costraiig the search space sice they are much less dese tha ordiary itegers. The first 00 positive itegers are distributed across oly 5 primes. At the same time the primes themselves become less dese as we move up the iteger scale (Adrews, 994). The ext 900 positive itegers after 00 oly cotai 43 primes. This meas that the prime umber ecodig techique has good scalability with respect to data value size. From the error poit of view the use of primes itroduces oly small errors as it is possible to fid a prime i close eighborhood to ay give iteger (Adrews, 994). Theorem below gives the distributio of primes i the geeral case. Theorem The umber of prime umbers less tha or equal to a give umber N approaches N for large N. log ( N) e Proof: The reader is referred to (Adrews, 994) for a proof. Theorem reiforces the observatios made above. Firstly, the divisio by the logarithm term esures that the primes are less dese that ordiary itegers. Secodly, the average gap betwee a N give prime N ad its successor is approximately N / = loge N. loge N This meas that ecodig a umber N usig its earest prime will result i a absolute error of log e N o the average. These properties make prime umbers a attractive buildig block for ecodig umerical data. The basic idea that we utilize is to covert a stream of primes ito a sigle prime, the Prime Tree root value by a series of pair wise add operatios. We the augmet the root value with a set of coefficiets to eable us to decode. The use of prime umbers eables us to drastically reduce the search space ivolved i decodig ad as a cosequece reduce the space required to store the ecoded data. As a example cosider the prime root value of 9. I order to decode (assumig that we simply use the Prime Tree) to the ext level we have to cosider just eight combiatios (5,3), (3,5), (7,3), (3,7), (,9), (9,), (3, 7) ad (7,3). O the other had if prime umbers were ot used as the basis, the we would have a total of 30 combiatios to cosider i geeral, if N were the prime root, the N+ combiatios would have to be tested. The use of the Prime Idex tree i cojuctio with the Prime Tree eables us to costrai the search space eve further. With the itroductio of the former we are able to distiguish betwee pairs such as (5,3) ad (3,5). The pair (5,3) ecodes as 3 i the Prime Idex tree ad 9 i the Prime tree, whereas (3,5) ecodes as 3 i the Prime Idex tree ad 9 i the Prime tree. Apart from this, the other major beefit of growig two trees istead of oe is that we ca ecode takig the differetials betwee correspodig odes across the two trees rather tha ode values themselves. Sice the two trees evolve from a commo set of leaf values, the α ad β operators evaluate to approximately the same values which i tur causes the differetials to be much smaller tha the ode values themselves (see Figure 3a i the Experimetal Setup sectio).

10 Compariso of the Wavelet ad Prime Factor Schemes The wavelet ad prime factor schemes both use the cocept of reducig the origial data to differetials betwee progressively icreasig sub sets. However oe major differece is that that the Wavelet scheme uses thresholdig to drop some of the differetials. As oted before, this makes it ustable with respect to the error rate (Garofalakis ad Gibbos, 004) ad this is cofirmed by our results which we preset later. I cotrast, the Prime Factor scheme does ot use thresholdig but exploits the skew iduced by the prime factor trasformatio to ecode the resultig coefficiets usig a Elias code. This meas that every value i the origial data set is represeted i the ecoded versio which results i much greater stability over the Wavelet scheme. Aother major differece is that the ecoded data i the Wavelet scheme are represeted as relatioal tuples i the form <i, i, i d, c> where each i k (i the rage to d) idetifies the idex positio withi dimesio k i which the survivig coefficiet with value c is located. For large values of d, the degree of compressio obtaied ca degrade quite severely as each idex takes up additioal storage. I cotrast, the Prime Factor scheme does ot require ay such idexes as thresholdig is ot used ad thus we would expect it to have relatively better compressio for high dimesioality data. The Prime Factor scheme also copes well i the case of sparse data. All zero values i the origial data will scale to 0 with the scalig scheme that we use. Let us cosider a example where we have a ru of six zero s followed by the primes 0 ad 6 i the scaled data set ad suppose that we use a chuk size of 8. Sice we have exteded the prime umber set to iclude the pseudo prime 0, the 0 values ecode to 0 ad this results i a left prefix tree where we have o-zero values preceded by a ru of zero values. Figures 4a ad 4b below show the resultig trees N N Figure 4a: Left Prefix Prime Idex Tree Figure 4b: Left Prefix Prime Tree Similarly, whe the origial data stream cosists of a sequece of o zero values followed by a ru of zero values, we obtai a right prefix tree after ecodig, where a o zero iteral ode has a zero valued ode as its siblig i each of the Prime Idex ad Prime trees. The left ad right prefix trees ca be collapsed ad stored compactly as idicated by the followig Lemma ad Theorem. Lemma If β(p, P ) = 0, the we must have P +P =0.

11 Proof: From the defiitio of β, we have β(p, P ) = P +P +s = (), where s is a iteger that rouds the sum of P +P to the earest prime factor. Suppose that s 0. We ow cosider two cases, s>0 ad s<0. Case s<0. Sice s<0, it follows that P +P <------(), otherwise, it will be impossible to roud the sum of P +P to 0. We also have P +P (3) sice P 0, P 0. From () ad (3) it follows that P +P =0. Case s>0. The proof is similar to Case above. Sice s >0, the oly way that we ca satisfy () is for P +P 0 ---(4), i order to have ay chace of roudig to 0. From (3) ad (4) it follows that P +P =0. Theorem If i a left prefix or right prefix tree a give ode ecodes to zero, the it follows that the etire sub tree uder that ode ad its associated leaf values must be zero valued. Proof: We use the proof by cotradictio method as a proof sketch. We start with a pair of leaf odes havig values P ad P. Suppose that P 0, P 0. Foe the prime idex tree, we have α(p, P ) = P +P +I(P )-I(P )+r = (5), where r is a iteger that rouds the value of the sum of the precedig terms to the earest prime factor. From Lemma we have P +P =0. Substitutig this i (5), we have α(p, P ) = I(P )-I(P )+r =0. Thus I(P )=I(P )+r. Thus, if r>0, we have I(P )>I(P ), which i tur meas that P >P. With P >P ad P, P 0, we have P +P >0, which leads to a cotradictio. Similarly, with r<0, we have P >P, which agai leads to P +P >0. Thus, i either case, our origial assumptio of P 0, P 0 must be i error ad we have P =0 ad P =0. A similar proof holds for the prime tree. This meas that ay ode with a value of 0 must give rise to a pair of zero valued child odes (this is true of both the Prime Idex ad Prime trees). Each of these zero valued child odes i tur will yield childre who have zero valued. This completes the proof sketch. We ow retur to the example i Figures 4a ad 4b. I Figure 4a we ca collapse the tree by pruig all braches for the sub trees rooted at odes N ad N. This meas that we oly eed to store a total of six coefficiets, made up of 3 differetials ad 3 offsets (as opposed to a total of 4 for the o-sparse case). Thus it ca be see that the Prime Factor scheme adapts to a sparse data set without the eed to keep explicit idexes to keep track of the zero values i the data set. Ecodig Error Rate for the Prime Factor Algorithm I this sectio we preset a formal aalysis for the average relative error ivolved i decodig with the Prime Factor scheme. Theorem below quatifies this error rate. log ( ) Theorem 3 :The average relative error is approximately e CV, where C is the chuk size, V adv is the average value of a elemet withi a chuk

12 Proof Sketch: Cosider the Prime tree give i Figure 5 below with chuk size 8. P5 P3 P4 P9 P0 P P P P P P3 P4 P5 P6 P7 P8 Figure 5: Error Tree for Chuk Size 8 The value of a iteral ode, say P9 isgive by P9 = P + P + P, P where is a error term that deotes the approximatio to the earest prime umber Similarly,P = P + P + ad P = P + P +. Substitutig P for P 4 P4, P3 3 9 Thus it ca be see that a paret ode s value is give by the sum of the values of the leaf ode values that ca be reached by the paret ode plus the error terms at the itermediate ode levels. I geeral, we have P ad P 0 P0, P we obtai = P + P + P + P + 3, the root ode value give by P P + C C C i P i, i i= i= C The sum of the values T i the chuk is give by T 4 i the expressio for P P, P + P 4, P 3 + = P, P 0 i= 9 P i = C Thus we have d = P From Theorem, we C T = have the C i= P i, i average () gap betwee a prime umber P ad its ext as log e (P).This result ca be used to estimate loge ( P LC + P as RC ), where P LC,P the average value of the error coefficiet represet the left ad right give paret ode, the diviso by is ecessary because the Prime Factor RC childre of a algorithm ecodes each value with its earest prime umber.

13 Put P i + P i = KiT, where Ki for i =,,., C - Each error term ca ow be writte as P log ( ) i K, i = e i T C C Thus from () above we have d log ( ) log ( ) + ( )log ( ) e KiT e Ki C e T usig the Taylor series expasio for the log fuctio ad takig a first order approximatio for 0 K. i i= C i= C i= From the defiitio of K, we have K log ( C) i i i= ( K i ) + ( C )loge( T ) Sice log( C) << ( C ) << ( C )loge( T ), we have d ( C ) loge( T ) d ( C ) log ( CV ) The relative error, e. Note : the scalig terms to scale the differece d ad CV C V the sum CV to the origial data scale do ot eed to be cosidered as they cacel each other off i the umerator ad deomiator. Thus d loge( CV ) for large V C The above result captures the relatioship betwee the error rate, chuk size ad scale rage. As the chuk size icreases, the error rate grows as a log fuctio. This is show visually i Figure b, whereby the average error rate icreases sub liearly with respect to icreasig chuk size. The above expressio also shows that for a give chuk size C, the error rate decreases with a icrease i scale rage size. As the scale rage size icreases, the V term icreases at a faster rate tha the log (CV) term, thus causig a drop i the error rate. e this tred. ON LINE RECONSTRUCTION OF QUERIES Figure0b displays I this sectio we examie how multi-dimesioal rage sum queries are aswered with the Prime Factor scheme. These queries are of the form: {l i D i h i l i, h i {0,,, D i -}, h i = l i + i }, where i is a positive iteger. We will first derive a expressio for the time complexity ivolved i aswerig a -dimesioal query as some of the cocepts ivolved are shared with the geeral the -dimesioal case.

14 Aswerig Oe Dimesioal Rage Sum Queries I this case we eed to cosider three regios R, R ad R 3 as show i Figure 6 below. The query requires the sum of all elemets i the -dimesioal cube which are i the rage [l, h ], with l = ad h =690. The bulk of the query resides i regio R (the shaded regio) which is aliged with the chuk structure. The sum across this regio ca be aswered very efficietly by takig the sum of the root values for the chuks that spa this regio, thus avoidig the eed for decodig across a large portio of the query. Sice each of these values correspods to the root associated with the Prime tree they are a close approximatio to the sum cotaied withi a chuk. R R R 3 Figure 6 Decodig ad aswerig a dimesioal query Furthermore, with a large eough chuk size we would expect that the root values comprise a very small fractio of the size of the origial cube ad thus be either held i memory or stored o a small umber of blocks o secodary storage (with a chuk size of 56 for example, these root values comprise just /56 of the umber of elemets withi the origial cube). Note that this is true irrespective of the dimesioality of the query or that of the cube. The oly decodig that is ecessary is for the two chuks at the eds of the query, i.e. regios R ad R 3. Thus it ca be see that the chukig has helped to miimize the effort ivolved i decodig. It is also importat to ote that the decodig effort is depedet o the dimesioality of the query ivolved ad ot o the dimesioality of the cube. For example, for a query such as [l 3, h 3 ] (say with l 3 = ad h 3 = 690) o dimesio 3 of a 5-dimesioal cube, the umber of chuks to be decoded is still sice we oly eed to cosider the oe dimesioal slice alog dimesio 3 which is aggregated across dimesios,, 4 ad 5. Agai, this observatio holds for the geeral case as well. However, the amout of effort (i terms of the umber of chuks to be decoded) ivolved will deped o the dimesioality of the query as we shall see i the ext sectio. I the ext sectio we will preset a expressio for the geeral case, ivolvig queries o dimesios. The -Dimesioal Case l = h = 690 Suppose that the query spas dimesios. O each of these dimesios we ca defie three regios, the two regios at the eds of the query ad the middle regio which is aliged with the chuk structure. This leads to a total of 3 regios. Out of these the majority of the query is aliged with the chuk structure i a sigle cotiguous regio i -dimesioal space, thus requirig decodig across 3 regios. Each of these regios is o a boudary of the query i - dimesioal space ad thus teds to be small i size. Theorem 4 below quatifies the amout of decodig effort ivolved i the geeral case. Theorem 4 The total umber of chuks to be decoded for a -dimesioal query is 3 + Δ (d i ) where d i = i i = c

15 Proof Sketch: We will first cosider the regios o the corers of the cube defied by the query. There are such corers. Each of these corer regios are cotaied i exactly oe chuk. This leaves a total of (3 ) regios, occurrig alog dimesios. Each of these regios alog dimesio i will spa (d i ) chuks, sice of the total umber comprises the corer chuks. Thus the total umber of these chuks is 3 (d i ). To this umber we i = must add the corer regios. This completes the proof. d R R 5 R d R 6 R 7 R 9 R 3 R R 4 8 Figure 7a: The Regio aliged with the chuk structure Figure 7b: Regios bouded by a -dimesioal query Figures 7a ad 7b provide a geometrical explaatio for the D case. Figure 7a illustrates the regio that requires o decodig, ad it ca be see that it covers the bulk of the query space. Figure 7b gives a complete breakdow of regios bouded by the query. The regio R 7 correspods to the shaded regio i Figure 7a; R, R, R 3, R 4 represet the four corer regios that each require decodig a sigle chuk; the regios R 5, R 6, R 8 ad R 9 i geeral spa larger regios cotaiig multiple chuks that require decodig. Theorem 5 For a give query dimesioality, the chukig scheme improves decodig d efficiecy by a miimum factor of i i = 3 d i i = Proof Sketch: Follows from Theorem 4 above ad the fact that a -dimesioal query requires decodig d i chuks if aggregate sums (prime tree root values) are ot stored at chuk level. i = IMPLEMENTATION CONSIDERATIONS I this sectio we will provide a brief overview of the implemetatio of some of the major compoets of PFC. The system was implemeted etirely i Java. We made use of a table of prime umbers (Alfeld) readily available from the Web. All other fuctioality was custom built. Ecodig The three mai fuctios that were used extesively with respect to ecodig were Fid_Nearest_Prime(N), Evaluate_Alpha(N, M) ad Evaluate_Beta(N, M). The Fid_Nearest_Prime(N) fuctio was fairly straightforward to implemet as the use of the prime table meat that the complexity of testig for primes was avoided. The other two fuctios,

16 Evaluate_Alpha(N, M) ad Evaluate_Beta(N, M), formed the core of the PFC scheme ad were used to build the Prime Idex ad Prime Trees respectively. These fuctios essetially took two prime umbers as their argumets ad coverted their sum to the earest prime umber (the Evaluate_Alpha fuctio also added a compoet that ivolved the differece i idex positios betwee the two give primes). Thus these two fuctios were quite straightforward to implemet as well, as they oly ivolved simple computatios. Decodig With respect to decodig there were two cases to cosider. The first case ivolved chuks that required o recovery of the idividual leaf odes withi the chuk. This case covers a large fractio of the chuks ivolved i a query, as quatified by Theorem 4. For these chuks all that was required was extractio of the prime tree root value ad subsequet re-scalig to trasform the value back to the origial source data scale rage. The secod case ivolved chuks that required actual decodig ad this was accomplished through a fuctio, Desced_Prime_Tree(N), that was used to desced oe level dow the Prime Tree. Basically, this fuctio took a prime umber deotig the acestor ode i the correspodig tree as its argumet, ad the retured a set cotaiig pairs of primes that were cadidates for the child odes. We used a heuristic to speed up the geeratio of these cadidate sets. For a give acestor ode, with prime value N, we extracted the prime umber M that is earest to N/. The pair (M, M) is guarateed to be a cadidate for the Prime Tree child odes. This pair was used as a startig poit for the geeratio of the rest of the pairs. Figure 8 below illustrates the algorithm used for geeratig the Prime Tree cadidate set. Desced_Prime_Tree (N) { C={}; M = earest_prime ( N/ ) ; C = C {(M, M)}; Idx = I(M); // the idex value of prime M U = M; L = earest_prime(p[idx-]); // the largest prime less tha M while (L mi ad U max) // [mi..max] represets the scale rage { V = β(u, L); if (V = = N) the { C = C {(L, U), (U, L)}; U = earest_prime(p[i[u]+]); // icrease upper value i prime pair } else if (V <N) the U = earest_prime(p[i[u]+]); // icrease upper value i prime pair else L = earest_prime(p[i[l]-]); // decrease lower value i prime pair } } Figure 8: Algorithm for geeratig Cadidate Prime Pairs I order to further speed up the process of decodig we cached the prime pairs for a give iteger N. The decodig process beefits from such a cache as prime tree odes with a give value N are likely to recur may times across differet chuks. I such cases, the process of geeratig prime pairs is reduced to a fast i-memory table lookup. Our experimets show that this cache occupied less tha % of the storage of the origial source data.

17 With the compilatio of the cadidate prime pairs, the decodig process reduces to applyig the differetial ad offset coefficiets to desced both trees i parallel to extract the leaf level values, as described earlier i the sectio o The Prime Factor Ecodig Algorithm. EXPERIMENTAL SETUP I this sectio we describe the experimets we carried out with the Prime Factor ad Wavelet schemes. The experimetatio could be broadly divided ito two mai categories: A performace compariso betwee the two schemes, A ivestigatio ito the performace of the PFC scheme with respect to key parameters We use two metrics, Compressio Ratio, ad Relative Error to quatify the degree of compressio ad error respectively. The compressio ratio (cr) is defied by: cr = ecoded data size i bytes/size of origial data set i bytes, while the relative error (re) is give by: re = S- R /S, where S represets sum of the query o the origial (u-ecoded) data set, ad R is the recostructed sum after decodig has bee carried out with the Prime Factor or Wavelet schemes. Compariso of the Prime Factor ad Wavelet Schemes I this sectio we focus o a performace compariso betwee the Prime Factor ad Wavelet schemes o a rage of differet data sets. We used both real-world data from the (US Cesus) that was used i (Vitter et al, 998; 999, Chakrabarti et al, 000) as well as sythetic data sets for this purpose. The sythetic data sets were modeled o the criteria used i (Ng ad Ravishakar, 997). Two parameters, degree of skew ad degree of variatio were used i their geeratio. The degree of skew was take to be high whe 70 percet of the data elemets were draw from 30% of the domai value rage, with the other 30% of the data draw from a uiform data distributio i the domai value rage. O the other had, the degree of variatio was take to be high whe the differece i data elemet values is more tha 00% of the average data elemet value. Whe the differeces i data elemet values was o more tha 0% of the average data elemet value, the the data variatio was take to be low. Variatio i the value of these two parameters gave rise to four differet data sets: (No Skew, Low Variace), (High Skew Low Variace), (No Skew, High Variace) ad (High Skew, High Variace). Decodig Time The first experimet was ru o the US Cesus data ad was desiged to gai a isight ito how the decodig times varied with the size (i.e. the portio of the data that the query retrieved, expressed as a percetage) of the query posed. We used a chuk size of 64 ad a scale rage of [0..307] for the PFC ad a threshold of 7 % for the Wavelet. These parameter settigs esured that the two schemes produced roughly the same degree of compressio i order to isolate the effect of ecoded file size o the timigs. At each value of query size we ra 50 tests, cosistig of a batch of 0 for each value of the query dimesioality parameter which raged from to 5, ad the average time was recorded across the 50 rus. Figure 9 shows that the Prime Factor outperformed the Wavelet scheme i the etire rage tested. The two curves start at roughly the same poit at the lower ed of the size scale, but diverge quite

18 rapidly as the query size icreases. The good performace of the Prime Factor scheme ca be attributed to two factors. Firstly, its lower average (take across the etire rage of query size) decodig time per data elemet was 0.00 ms versus ms for the Wavelet. I the case of the Prime Factor scheme the iformatio (i.e. the decodig coefficiets) ecessary for decodig a data elemet is localized withi the chuk that ecapsulates it. There is o cocept of a sigle global tree, sice each chuk is ecoded separately, thus givig rise to a collectio of idepedet trees. This meas that chuks ca be decoded idepedetly from each other, ad oly those coefficiets belogig to chuks that require decodig eed to be examied. O the other had, the Wavelet scheme ecodes usig a sigle tree ad thus query processig ivolves searchig through the etire set of wavelet coefficiets to determie the cotributios made by each idividual coefficiet (Vitter, 999). Secodly, PFC decodes a much smaller umber of data elemets, as oly chuks alog the boudaries of a query eed to be decoded (as proved by Theorem 5). Prime Factor vs Wavelet Decodig Time Decode Time (ms) Prime Factor Wavelet Timig Query Size (%) Figure 9: Effect of Query Size o Decodig Time Error Rates I the ext set of experimets we compared the two schemes o error rate. As before we varied the query size i steps of 5 i the rage 5% to 90% ad measured the relative error at each step. For each size value we ra 0 tests, with each test retrievig data from differet regios i the data set. For each batch of 0 rus we measured the miimum, maximum ad average relative error values (all expressed as percetages) for the PFC (with scale rage of [0..307]) ad the Wavelet (with thresholdig set at 7%). As ca be see from Tables ad for roughly the same compressio ratio the two schemes have sigificatly differet error throughout the rage tested. The PFC scheme exhibits a small average error rate of aroud 0.3% right up to the 90% mark (this was true for higher percetages of the query size parameter - up to 99% which we have omitted for reasos of brevity). As Theorem demostrates the average error rate for the PFC is essetially depedet o the scale rage ad the chuk size used ad is ot a fuctio of query size. The differetial betwee the average error rates for the two schemes is much higher at the lower ed of the query size rage (e.g. from 5% to 55%) tha at the upper ed. At the upper ed of the query size rage the Wavelet s performace improves as the top level wavelet coefficiet represetig the overall mea of the data set assumes more importace i recostructio.

19 The relative error for the PFC scheme is remarkably stable throughout the query size rage. The miimum, average ad maximum values are much closer together tha its Wavelet couterparts. With the Wavelet scheme we have sigificat deviatios betwee the miimum ad maximum error values i the etire size rage. This is i lie with previous research (Vitter et al, 999), (Garofalakis et al, 004). I practice error stability is importat as this would ispire more cofidece i users of the accuracy of results to ad-hoc OLAP queries. We cross checked these results with the sythetic data sets that we geerated ad observed the same treds. Compressio Ratio PF Wavelet (7%) 7.37 Table : Comparative Compressio Rates Query Size (%) Relative Error (%) Mi Avg PFC Mi PFC Avg Max PF Max 307 Wavelet 307 Wavelet 307 Wavelet 5% % % % % % % % % % % % % % % % % % A Table : Comparative error rates Sesitivity of Prime Factor Performace o Key Parameters We ivestigated the effects of scale rage size ad chuk size o performace. We also looked at the distributios of the ecoded coefficiets i order to gai a isight ito whether the Elias code would be a effective represetatio mechaism. The experimets were coducted o the real world US Cesus data set. Effect of Scale Rage Size I this set of experimets we tested the effect of scale rage o both compressio ratio ad accuracy. We kept the lower boud of the scale at 0, ad varied the upper boud from 0 to 009 i approximate steps of 00. Figures 0a ad 0b track the effects of this variatio o compressio ratio ad relative error. We measured the relative error o queries that radomly picked a 0% sample of the data (i actual fact the relative error was obtaied by averagig across 0 rus for each value of the scale rage parameter).

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies