COMPRESSING DATA CUBE IN PARALLEL OLAP SYSTEMS

Size: px

Start display at page:

Download "COMPRESSING DATA CUBE IN PARALLEL OLAP SYSTEMS"

Evan Gray
5 years ago
Views:

1 COMPRESSING DATA CUBE IN PARALLEL OLAP SYSTEMS Frak Dehe 1, Todd Eavis 2, ad Boyog Liag 3* 1 School of Computer Sciece, Carleto Uiversity, 1125 Coloel By Drive, Ottawa, Caada K1S 5B6 frak@dehe.et 2 Computer Sciece Software Egieerig, Cocordia Uiversity, 1455 De Maisoeuve Blvd. West, Motreal, Caada, H3G 1M8 eavis@cs.cocordia.ca *3 School of Computer Sciece, Carleto Uiversity, 1125 Coloel By Drive, Ottawa, Caada K1S 5B6 byliag@coect.carleto.ca ABSTRACT This paper proposes a efficiet algorithm to compress the cubes i the progress of the parallel data cube geeratio. This low overhead compressio mechaism provides block-by-block ad record-by-record compressio by usig tuple differece codig techiques, thereby maximizig the compressio ratio ad miimizig the decompressio pealty at ru-time. The experimetal results demostrate that the typical compressio ratio is about 30:1 without sacrificig ruig time. This paper also demostrates that the compressio method is suitable for Hilbert Space Fillig Curve, a mechaism widely used i multi-dimesioal idexig. Keywords: OLAP, Data cube, Compressig, Parallel processig 1 INTRODUCTION Data warehouses provide the primary support for Decisio Support Systems (DSS) ad Busiess Itelligece (BI) systems. Oe of the most iterestig recet themes i this area has bee the computatio ad maipulatio of the data cube, a relatioal model that ca be used to support O-Lie Aalytical Processig (OLAP). Data cube-based OLAP systems pre-compute multiple views of selected data by aggregatig values across all possible attribute combiatios (a group-by i database termiology). For a d-dimesioal iput set, there are 2 d possible group-bys. The resultig data structures ca the be used to dramatically accelerate visualizatio ad query tasks associated with large iformatio sets [30]. Withi the cotext of massive data volumes, data cube computatio has to be very efficiet with respect to speed ad space. May research studies have show that parallel computatio effectively speeds up data cube costructio. Data cube compressio, o the other had, ot oly is crucial for computig ad storig data cubes i limited space but also reduces I/O access time. Though compressio algorithms are quite commo i the literature, most are poorly suited to database/cube eviromets as they (i) offer relatively poor compressio ratios or (ii) result i sigificat ru-time pealties [20, 37, 39, 40]. This paper focuses o the ivestigatio of eve more efficiet data cube compressio techiques i the cotext of parallel OLAP computatio systems, the Parallel Algorithms for New Data Warehousig Architectures (PANDA) [12, 13, 14, 15, 16] framework i particular. We propose a efficiet data cube compressio algorithm, XTDC Exteded Tuple Differetial Data Cube Codig, as well as a group of correspodig data structures that ca be employed i the cotext of high performace parallel OLAP computatio. This paper also demostrates that the XTDC method ca be applied to the Hilbert Space Fillig Curve - a appealig mechaism for multi-dimesioal idexig frameworks [29, 35, 48]. We also propose two data cube computatio algorithms, radom query ad sub cube costructio, based o the compressed form of the XTDC data structure. The experimetal results show that the typical compressio ratio for a full data cube i the PANDA system, i which the fact table has 10 dimesios ad 10 6 tuples, is 29.4 to 1 (without sacrificig ruig time). The dimesioal data reductio is from 9778MB to 332MB (96.6%). The sigle view compressio ratios are betwee 26 ad 51 to 1. S184

2 The paper is orgaized as follows. Sectio 2 reviews the most iterestig database compressio techiques. Sectio 3 proposes the efficiet data cube compressio algorithm, XTDC, ad correspodig data structures as well as two data cube computatio algorithms based o the compressed data structure. This sectio also applies the Hilbert Space Fillig Curve techique to XTDC. Sectio 4 presets the performace aalysis, ad Sectio 5 cocludes the paper ad discusses possible extesios of our methods. 2 RELATED WORK I the cotext of data warehouse applicatios, we are oly iterested i lossless data compressio techiques, which allow the origial data to be fully recovered from its compressed form. There are two very importat properties of covetioal lossless data compressio techiques: 1). Data is processed serially (FIFO), ad 2). The ecoder ad decoder share the same data model [17, 24, 25, 44, 45, ad 49]. We ote that the radom access of databases, such as tuple query, isertio, deletio, ad update, coflicts with the serial processig ad cosistet data model properties of traditioal data compressio. Database compressio techiques have bee researched sice the1990 s [4, 5, 9, 10, 21, 34, 37, 39, 40, 41, ad 47]. Some of them are more iterested i page (block) level compressio. For example, Oracle applies a block-based dictioary compressio techique, which reaches about a 3.1 compressio ratio for a database of 55GB of data without a performace pealty i data warehouse applicatios [39]. Aother actively researched class of database compressio solutios tries to fid more efficiet data distributios from the characteristics ad kowledge of the relatio, thus achievig high compressio ratios. Also the tuple-structure of a relatio is preserved i its compressed form i order to support high performace database operatios through the avoidace of uecessary compressio ad decompressio. Ray, Haritsa, ad Seshadri proposed the Colum Based Attribute Compressio algorithm (COLA) i 1995 [40]. COLA uses a separate frequecy distributio table for each attribute i a relatio. The experimetal compressio ratio is 21.27% for a sythetic umeric relatio. Bit compressio (BIT) is a well-kow techique that represets each umerical attribute i bits istead of bytes. Goldstei, Ramakrishma, ad Shaft proposed a derivative algorithm of BIT i 1998 by compressig relatios i blocks [20] (we refer to it as Block-BIT i this paper). The typical compressio ratios o real data sets are betwee 3 ad 4 to 1. The CPU cost of decompressig a relatio is approximately 1/10 the CPU cost of GZIP. Ng ad Ravishakar proposed a block-orieted database compressio techique, the Tuple Differetial Codig (TDC) method [37]. I TDC, all attributes i a relatio are mapped ito umeral domais. Tuples are coverted ito ordial umbers i ascedig mixed-radix order. A compressed block oly stores the value of the first tuple as a referece. Each succeedig tuple is replaced by its differece with respect to its precedig tuple. The i th tuple i the relatio ca be recostructed from the first tuple ad the first i 1 differece values. TDC uses the differece i the tuple level, which also keeps the tuple iformatio of each attribute. The typical compressio ratios of TDC are betwee 4 ad 6.6 to 1 for the tables with 10 6 tuples with 8 dimesios [37]. 3 DATA CUBE COMPRESSION The multi-dimesioal model is the most popular model used i data warehousig eviromets to support OLAP operatios [30]. Data cubes, geerated from fact tables, cosist of the surrogate keys of the dimesioal tables plus the measure fields. These surrogate keys are usually cosecutive itegers, which are automatically geerated durig the data warehouse Extract, Trasform, ad Load (ETL) stage. Cosiderig the properties of the data warehouse ad data cube computatio, we propose the exteded Tuple Differetial Data Cube Codig (XTDC) strategies that build upo a umber of ideas of existig techiques, specifically, TDC [37], BIT, ad Block-BIT [20]: 1). Treat dimesioal data ad measure data separately. The mai I/O access task is related to the large views, which have a large umber of records ad high dimesios. The major objective i compressig data cubes is therefore to reduce the storage of dimesioal data. Give the characteristics of data warehouse applicatios, the dimesioal data are usually represeted as itegers. The compressio ratio metioed S185

3 later i this paper is the ratio of the size of the origial dimesioal data divided by the size of the compressed dimesioal data. 2). Compute tuple differeces to compress the dimesioal data at the block level. XTDC uses the fudametal idea of the Tuple Differetial Codig [37] to code the tuples i block wise fashio. Each tuple is represeted usig the differece betwee it ad its precedig tuple. I order to avoid the risk of data overflow i the case of large views with high dimesios, two tuple operatios, tuple-plus ad tuple-mius, are proposed to support a wider rage ad faster data computatio. 3). Compact the differeces ito bits. 4). Compact all the differeces (the compressed dimesioal data) together to remove gaps caused by bytealigmet. All the measure data are stored i the secod part of the block. 5). Dyamically determie the umber of tuples to be compressed ito oe block accordig to the value of the maximum differece i oe block 6). Use a couter mechaism to represet cosecutive 1-differeces. For those views that have low dimesios but a large umber of tuples, there is very high probability that the differece values of cojuctive tuples are 1 s because the attribute values of each dimesio are usually cosecutive itegers. 7). Keep the compressio iformatio i each block. The iformatio, such as the umber of tuples i the curret block ad umber of bits of each differece value, are stored i the block header. They are dyamically calculated durig the compressio process ad is used durig decompressio. XTDC is a block-level lossless data cube compressio techique. It uses the kowledge of the characteristics of the multi-dimesioal data model to guide the compressio ad decompressio processes. XTDC preserves tuple structure i compressed views i order to get the beefits of database compressio. The followig sectio describes the details of the XTDC compacted data structure. 3.1 The XTDC Compacted Data Structure XTDC keeps the compressio iformatio i each block, amely the block header, i order to preserve the tuplestructure i compressed data for high performace data access ad to compress data cubes i block level. I this way, each compressed data block cotais all ecessary iformatio to decompress this block or to localize the required data (tuples) from the compressed data directly. As Table 3.1 presets, a typical structure of a compressed block cosists of three parts: 1). The Block header cotais the compressio iformatio for this block. The legth ad the cotet of the block header may vary accordig to the differet compressio algorithms. It is also good for efficiet idexig to store the ucompressed first tuple i the block header. 2). The Dimesioal data area cotais i bit form all the compressed dimesioal data the differece values of the tuples i this block, i order to avoid the spare bits betwee tuples that may be caused by byte-aligmet. 3). The Measure data area cotais all the measure data i origial form (ucompressed format). The offset of this segmet i the curret block is give by: measure offset = legth of block header + (umber of tuples couter) (umber of bits for dimesioal data)/8. The offset of the measure data of i th tuple is (measure offset)+(i 1) (umber of bytes for measure data). Table 3.1. The XTDC data structure of a compressed data view block Block header Legth of block header Number of tuples for this block Number of bits for dimesioal data of each tuple Number of bytes for measure data of each tuple Couter First tuple i ucompressed format Dimesioal data Compacted tuple differeces (i bit) Measure data Measure data of 2 d tuple (Ucompressed) S186

4 Because most data cube operatios are read-oly i data warehouse applicatios, XTDC focuses o storig as may ecoded tuples i a block as possible, rather tha desigig a more flexible data structure for update operatios. The followig sectio discusses the details of usig this data structure i the XTDC algorithms. 3.2 The Exteded Tuple Differetial Data Cube Codig Algorithm XTDC As we discussed previously, the priciple idea of the tuple differetial codig algorithms is to store the differece values of the cosecutive tuples. The mixed-radix values of tuples ca be calculated accordig to Eq.(3.1) of Defiitio 1[37]. Defiitio 1 A relatioal scheme R=< A 1,A 2,...,A > is a sequece of attribute domais, where A i = {0, 1,..., A i 1} for 1 i. The value of oe tuple < a 1, a 2,..., a > is defied as: ϕ a 1, a2,..., a = ai Aj (3.1) i= 1 j= i+ 1 I eterprise level data warehousig eviromets, a view to be compressed may have high dimesios with high cardialities. Cosequetly, there is a very high risk of data overflow whe mappig each tuple to a mixed-radix value i such eviromets. However, whe views are fully sorted, the differeces betwee two cojuctive tuples are usually very small. We propose two tuple operators, tuple_plus ad tuple_mius, i order to ecode views safely ad efficietly. Theorem 1 gives the priciple of the operators. Theorem 1 Give two cosecutive tuples: < a 1, a 2,..., a > ad < a 1, a 2,..., a >, the differece value of these two tuples is: ϕ a' 1, a' 2,..., a' ϕ a1, a2,..., a = ( a' i ai ) Aj (3.2) i= 1 j= i+ 1 Algorithm 1 calculates the differece by directly maipulatig the attribute values of the tuples to avoid data overflow. It is also efficiet because it reduces the multiplicative operatios to additio operatios. Algorithm 1. Tuple Mius Iput: Two tuples T 1, T 2 with d-dimesio; Cardialities (C[i]) for each attribute domai (A[i]) Output: The differece betwee the mixed-radix value of T 2 ad T 1 1: differece = 0; 2: for i = 0 to d 1 do 3: differece = differece * C[i] + (T2[i] T1[i]); 4: ed for 5: retur differece; Durig the decompressio process of XTDC, we ca exploit the fact that the precedig tuple has already bee decoded ito its ucompressed format whe decodig the curret tuple. We propose the tuple_plus operator to operate directly o attributes of tuples i order to avoid computig the mixed-radix values. XTDC is a block-level compressio techique usig the XTDC compacted data structure. The ecoded dimesioal data, the differeces, are compacted by bits ad grouped together to save maximal space. The block iformatio is stored i the block header. The umber of bits for each tuple differece value is dyamically determied by the maximum value of the differeces. Algorithm 2 presets the details of the XTDC data cube compressio algorithm. Algorithm 2. XTDC Data Cube Compressio Algorithm Iput: A view (i buf) to be compressed ad its metadata Output: The Compressed view (out buf) 1: Create a block header cotais the first tuple; S187

5 2: idex = 0; processed_tuples = 0; couter = 0; 3: for all tuple[i] of i_buf do 4: differece = tuple_mius(tuple[i], tuple[i 1]); 5: Compute the umber of tuples i this block; 6: if tuple[i] is ot the last tuple ad ca be fit i the block the 7: if cosecutively differece == 1 the 8: couter + +; 9: else 10: differece_buf[idex + +] = differece; 11: ed if 12: max_differece = max(differece,max_differece); 13: tuples_per_block + +; 14: else 15: Compute offset of measure data i this block; 16: for j = 1 to tuples_per_block do 17: if (j > couter) the 18: compact differet_buf[j] ito log 2 (max_differece) bits i i_buf; 19: ed if 20: out_buf[offset + +] = i buf[processed_tuples + j, dimesio 1]; 21: ed for 22: complete curret block-header; 23: processed_tuples+ = tuples_per_block; 24: if tuple[i] is ot the last tuple the 25: copy tuple[i] to ew block-header; 26: idex = 0; couter = 0; tuples_per_block = 1; 27: ed if 28: ed if 29: ed for For each block, XTDC employs two phases: 1). The Computatio phase: XTDC calculates the differeces of cojuctive tuples by usig the tuple_mius (lie 4) ad dyamically computes the umber of tuples i oe compressed block (lie 5). XTDC checks every tuple to determie if it ca fit i the curret block accordig to the chagig value of the maximum differece (max_differece) ad the umber of cosecutive 1 s(couter). The differeces are stored i a buffer (differece_buf) i iteger form i this phase. 2). The Compact phase: After collectig eough differeces for oe block (or all tuples have bee ecoded), XTDC computes the offset (offset) of the measure segmet of curret block (lie 15). The measure data is copied to the measure area of the block i ucompressed format (lie 20). All of the differeces, calculated i the first phase, are compacted i bit format ito the dimesioal area of the block (lie18). Each of these differeces occupies log 2 (max_differece) bits. Fially, XTDC completes the block header (lie 22) ad starts to compute the ext block. Note that the XTDC techique supports access to each tuple at the block level without loadig the whole view ad decompressig it. However, i this particular project, we use the XTDC iterface to decompress the whole view immediately after loadig it ito mai memory. So, i our project, the XTDC decompressio algorithm loads the whole compressed view ad retrieves the compressio iformatio from the block header. It the computes the tuples oe by oe usig the tuple_plus, ad simply copies the measure data to output buffer. The ext block is processed (if ecessary) whe all tuples of the curret block are decoded. 3.3 Applyig the Hilbert Space Fillig Curve Techique to XTDC The XTDC algorithm uses the tuple differetial codig method to compress data cubes. Tuples are sorted i a specific order ad the coverted ito a iteger represetatio. The differece betwee cosecutive itegers is calculated ad used to represet the origial data. The method that performs the iteger mappig must be oe-tooe i order to covert (decompress) the compressed iteger back to a uique tuple represetatio. The Hilbert S188

6 Space Fillig Curves techique traces a uique pathway though the poits of a multidimesioal space. I this sese, it may be used to provide a clea oe-to-oe mappig of a tuple to its ordial or idexed positio i the hypercubic space. We apply the Hilbert orderig to XTDC compressio process i two phases: 1). Hilbert orderig maps the tuples (dimesioal data) of the views to the sequece of the Hilbert Space Fillig Curve ad sorts the views by these sequetial values. We ote that a Hilbert re-sortig is required here because the core cube aggregatio algorithms use a lowx orderig. 2). XTDC ecodig uses steps similar to the stadard XTDC approach except that the tuple_mius is the simple iteger mius, ad the first tuple is stored i its Hilbert sequece value i the block header. The decompressio process is composed of two phases: XTDC decodig ad Hilbert de-orderig. It is very importat to ote that the PANDA system utilizes the Hilbert Space Fillig Curve to compute the sorted views for multidimesioal idexig. As a result, there is a sigificat potetial to improve the data cube compressio performace because we effectively ca get the Hilbert sorted views for free. 3.4 Compressed Data Cube Computatio Ulike covetioal data compressio techiques, XTDC preserves the tuple structure i compressed data, thereby allowig the OLAP computatio system to maipulate the data cube i compressed format. By avoidig uecessary decompressio ad compressio computatios, XTDC ot oly reduces the storage requiremet ad I/O badwidth but also reduces mai memory requiremets. The XTDC algorithms are able to retrieve oe sigle tuple from a compressed view without decodig the whole block. They also improve the quality of idex structures such as B-trees ad R-trees by reducig the umber of leaf blocks. Algorithm 3 presets the steps of localizig a specific tuple i a compressed view. Algorithm 3. Locatig Oe Specific Tuple i a Compressed view. Iput: A compressed view i XTDC data format. Dimesioal data of the specific tuple, t Output: Measure data of t (NULL for o-existig tuple) 1: Locate the block that may cotai t by checkig the first tuple i block headers 2: Load the etire block ito mai memory 3: Compute the differece (v) betwee the required tuple (t) ad the first tuple 4: Accumulate the first i differet values util we reach oe that is equal to OR greater tha v 5: Retur NULL if the differet value is greater tha v 6: Compute the offset of the measure data segmet i the curret block accordig to the header iformatio 7: Retur the i th measure data Geeratig sub views from a give view (paret view) is oe of the primary operatios of data cube computatio [16, 32]. The XTDC techique allows OLAP computatio systems, such as PANDA, to compute a compressed sub view from a compressed paret view directly. Defiitio 2 Give a paret view, R=< A 1,A 2,...,A >, ϕ < a 1, a 2,..., a > is the value of tuple < a 1, a 2,..., a >. Its k- subview is R =< A 1,A 2,...,A k-1,a k+1,...,a >, ϕ' < a 1, a 2,..., a k-1, a k+1,..., a > is the value of tuple < a 1, a 2,..., a k-1, a k+1,..., a >. Theorem 2 Give a paret view R, the tuple value, ϕ', of its k-subview, R, is: ϕ ' = ϕdiv Al Al + ϕ mod Aj (3.4) l= k l= k+ 1 j= k+ 1 Algorithm 4 computes a k-subview from a paret view usig the XTDC data structure. S189

7 Algorithm 4. Costruct a Compressed k-subview From a Compressed Paret View. Iput: V p, a compressed paret view i XTDC data structure. Output: V s, the compressed k-subview i XTDC data structure. 1: Iitialize view buffer: view_buf. 2: repeat 3: Load oe block of V p. 4: for all tuple(t i ) of Vp do 5: Compute the value v pi of tuple t i. 6: Get measure data m pi of t i. 7: Compute the correspodig tuple value, v si, i V s. 8: Accumulate the measure data of V s : view_buf[v si ]+= m pi. 9: ed for 10: util all blocks of V p are processed. 11: Costruct the k-subview: Compact the view_buf i XTDC format. 3.5 Compressig Data Cube i the PANDA System PANDA supports high performace parallel data cube computatios. Its I/O Maager is a sigificat feature that hadles efficiet I/O access durig the maipulatio of massive data sets. We implemet the XTDC algorithm as a Compressio Iterface ad plug the iterface ito the I/O Maager i order to compress (ad decompress) data cubes. I the data-writig phase, the tuples i the view buffer are compressed by block before they are physically writte to disk. I the data-loadig phase, the etire compressed view is loaded from disk ad decompressed i mai memory (Iput Buffer). The details of the system structure ad the XTDC Compressio Iterface implemetatio are discussed i [16, 33]. I this sectio, we discussed the efficiet data cube compressio algorithm, XTDC, ad its correspodig compact data structure as well as two OLAP operatios radom poit query ad sub view geeratio based o this data structure. We also demostrated that the XTDC algorithms ca utilize the Hilbert Space Fillig Curve techique. Therefore, it has potetial for use i OLAP systems that use the Hilbert space techique for multidimesioal idexig. Fially, we itroduced the strategy of applyig the XTDC algorithms to a parallel OLAP computig system PANDA. I the ext sectio, we will demostrate the experimetal results ad evaluate the performace of the XTDC algorithms. 4 EVALUATION This sectio evaluates the performace of data cube compressio techiques implemeted i the PANDA System. The mai goal of data cube compressio is to reduce the space requiremets of data cube computatio while maitaiig reasoable respose time. Our tests therefore focus o two mai issues: compressio ratio (CR) ad compressio/decompressio speed. I the cotext of data cube compressio, our implemetatios compress the dimesioal data ad leave the measure data i ucompressed form. Our evaluatios use the dimesioal data compressio ratio, CR = (dimesio size without compressio) / (dimesio size with compressio). Both compressio ad decompressio processes are ivolved i data cube computatio. We use wall-clock ruig time to evaluate the speed performace for both sigle view computatio ad full data cube computatio. I the multi-dimesioal model, a data cube is orgaized i exactly the same format as that of a covetioal relatioal table. I order to compare the compressio efficiecy betwee XTDC ad the existig database compressio techiques, we implemeted the TDC [37] ad the BIT database compressio algorithms. It is worth otig that whe we apply the TDC database compressio algorithm to achieve data cube compressio i PANDA, we follow the fudametal ideas of [37] except that we use our tuple computatio algorithms, tuple_mius ad tuple_plus, to avoid data overflow durig the computatio of tuple differeces. As proposed by [37], TDC uses Ru Legth Codig (RLC) to ecode the umber of leadig zero compoets i each differece. I our particular implemetatio, we store the differeces i iteger form (4-byte), which costs less tha usig RLC ecodig. We S190

8 also plug a Ope Source covetioal compressio library, BZIP [6, 43]. All of our tests were coducted o a Liux cluster, whose primary characteristics are listed below [26]: Liux Kerel xsmp (Redhat 7.3) 64-processor (dual processor odes) 32-ode Beowulf cofiguratio Gigabit Etheret (1000Mbps) Switch with a 32Gbps Each ode has a 2 GHz Itel Xeo processor, 1.5GB RAM, ad 60 GB IDE disks. We will look at a sequece of data cube compressio tests, each desiged to highlight oe importat characteristic. We evaluate fact tables with 6 to 10 dimesios. The umber of tuples i these fact tables rages from 100K to 2M. The fact tables themselves are created with PANDA s Data Geerator [16] by specifyig parameters such as the umber of tuples i the data set, the umber of dimesios, ad the cardiality i each dimesio. I effect, we utilize a set of base parameters ad the vary exactly oe of these parameters i each of the tests. These base parameters are (with defaults listed i parethesis): a) Fact Table Size (1M) ad b) Dimesio Cout (10). Table 4.1 lists the cardialities used i all of our test cases. Table 4.1. The meta data of testig data cubes Name of Dimesio A B C D E F G H I J Cardiality Sigle View Compressio The efficiecy of compressio techiques ca be clearly evaluated o sigle view tests. We use the Data Geerator [16] to geerate a group of fact tables with 500K, 1M, 2M, 5M, ad 10M tuples respectively. All of these fact tables have 10 dimesio attributes ad 1 measure field. We arbitrarily create a group of sigle views correspodig to each fact table by usig the Partial Data Cube geeratio module i PANDA. Each of these views has 7 dimesios ad 1 measure field. The umber of tuples ad the origial size of these views are listed i Table 4.2. For each sigle view, we apply differet compressio techiques, icludig BIT, TDC, XTDC, BZIP, ad Liux GZIP. Because GZIP ad BZIP are global rage data compressio techiques, their compressio ratios are computed as total compressio ratio. The BZIP libraries [43] are plugged ito the same test haress as BIT, TDC, ad XTDC. Both optios for best compressio (gzip -9) ad for fast compressio (gzip -1) of GZIP are used to evaluate compressio ratio ad ruig speed. Table 4.2. The data volume of views ABCDFJG Tuples i the Fact Table 10M 5M 2M 1M 500K Tuples i the View Ucompressed Size(MB) Figure 4.1 shows that the compressio ratios of BIT, TDC, BZIP, ad GZIP are betwee 5 ad 12 to 1. With the icreasig size of views, the ratios of covetioal compressio techiques (BZIP, GZIP) slightly icrease because there are better data distributios i a larger rage. The umber of tuples i a view does ot affect the compressio ratio of BIT because its compressio ratio is oly determied by the umber of bits for every tuple, which correspods to the cardialities of the dimesios. The compressio ratio of TDC remais stable because it always stores differeces i iteger form, which costs 4 bytes i our system, o matter how small the differeces are. The experimets show that XTDC reaches compressio ratios betwee 26 ad 51 to 1, which are much higher tha the other techiques. I XTDC, the umber of bits required to store the differeces i a block is determied by the maximal differece of cosecutive tuples i that block. Both the bit compactio techique ad the couter mechaism help XTDC to reach higher compressio ratios with a icreasig umber of tuples i a view. S191

9 Figure 4.1. Compressio ratio comparisos for sigle view compressio Figure 4.2 presets the total ruig time (compressio time plus decompressio time) of these compressio algorithms. Data cube compressio techiques (BIT, TDC, ad XTDC) have the same rage of ruig times, which are much faster tha covetioal oes. Figure 4.2. Total rutime (compressio ad decompressio) compariso for sigle view compressio S192

10 4.2 Full Cube Compressio I full cube tests, 2 d sigle views are created. These views cover all the possible combiatios from 1-dimesio to d-dimesios. The efficiecy of sigle view compressio will defiitely affect the full cube compressio. Figure 4.3 presets the average of compressio ratios for full cube computatio. The fact tables of these cubes have 10 dimesios ad the same cardiality distributio as listed o Table 4.1. Cosistet with the result i Figure 4.1, XTDC reaches a much higher compressio ratio tha the others (BIT, TDC) do. Figure 4.3. The compariso of full cube compressio ratios It is worth otig that the fully materialized data cube is much bigger tha the fact table. I oe of our test cases usig a fact table with 10 dimesios ad 10 6 tuples the dimesioal data of the fact table is 40MB, while the total dimesioal data i the full data cube geerated by this fact table is 9778 MB. XTDC reaches a 29.4 to 1 compressio ratio, which reduces the dimesioal data from 9778 MB to 333MB. Figure 4.3 also shows that XTDC is more efficiet i terms of compressig the full cube that has bee geerated by the fact table with the same dimesios but a larger umber of tuples. I this experimet, XTDC reaches a 31.8:1 compressio ratio whe the fact table of the cube cotais tuples. Note that the compressio ratios are lower o the full cube tha the sigle views we tested i Sectio 4.1. As we discussed previously, XTDC uses oe differece value (several bits i may cases) to represet the dimesioal data of oe tuple. As the umber of dimesios decrease (ad most views have less tha the 7 dimesios used i the sigle view test), the ability to compress dimiishes as well. Coversely, the greater the umber of dimesios, the greater the beefit for compressio. We also ote that, as same as i sigle view compressio, the compressio ratios of BIT ad TDC are ot sigificatly affected as the umber of tuples icreases. S193

11 Figure 4.4. Rutime for parallel full cube geeratio with compressio Figure 4.4 presets the speedup of PANDA with data cube compressio o multiple processors. The result shows that XTDC, as well as BIT ad TDC, works very well with PANDA s parallel data cube computatio. The ruig times are very close to the origial oes. We do ot iclude covetioal compressio techiques i the full cube tests because of the results of the sigle view experimets. I fact, the BZIP compressio libraries sigificatly slow dow full cube computatio. I oe of our experimets, PANDA with BZIP compressio takes 872 secods to geerate a full cube usig a fact table that has 10 dimesios ad tuples. The total compressio ratio is 7.7 to 1. As we ca see from Figure 4.3 ad Figure 4.4, XTDC reaches a compressio ratio of 22.5 to 1 for a larger data cube, which is geerated by usig a fact table with 10 dimesios ad 10 6 tuples, ad does so i less tha 900 secods. 5 CONCLUSION AND FUTURE WORK This paper proposes a efficiet data cube compressio algorithm, XTDC, ad its correspodig compact data structure. Buildig upo a umber of existig compressio algorithms, XTDC is effectively a combiatio of the followig techiques: 1). Tuple differetial codig: Tuples i the views are mapped to iteger values ad the differeces of the cojuctive tuples are used to represet the views. 2). Bit compactio: The tuple differeces are stored i bit form ad are compacted. 3). Block-wise compressio: The tuples are compressed i blocks (pages). This ot oly icreases the compressio ratio by reducig the value rage of the differeces but also makes more efficiet data access sice all the compressio iformatio is localized i idividual blocks. 4). Hadlig dimesioal data ad measure data separately to remove the gaps caused by byte-aligmet. 5). Usig meta data iformatio: Kowledge of the data cube is used whe compressig ad decompressig. The XTDC techique preserves the tuple structure i compressed data cubes. Its data structure makes the compressed blocks accessible to commo idexig methods such as B-trees or the packed R-trees that are actually used by PANDA. Because all iformatio about tuples is ecoded i idividual blocks, the data cube operatio ca be doe whe the data cubes are still compressed. We propose two algorithms for radom access ad sub cube geeratio based o compressed data, which shows the possibility of maipulatig the compressed data cube S194

12 without decodig the whole views, thereby improvig OLAP computig performace. We also discuss that the Hilbert Space Fillig Curve techique is well suited to the XTDC algorithms. Therefore, the XTDC techique has great potetial for use i practical cube systems that use space fillig curves for multidimesioal idexig. The experimetal results show that the XTDC techique is well suited for parallel OLAP computig systems. By itegratig the XTDC algorithms ito the PANDA system, the storage space requiremets for OLAP computatio are greatly reduced with very little performace pealty. The typical compressio ratio is 29.4 to 1 for a full cube geeratio, i which the fact table has 10 dimesios ad 10 6 tuples. The dimesioal data reductio is from 9778MB to 332MB (96.6%). The experimets also demostrate that the XTDC algorithms have the ability to achieve higher compressio ratios for larger data cubes which have more dimesios ad more tuples. Because the XTDC techique preserves the structural iformatio of the data cube i compressed form, it would be possible to exted the data cube operatios o compressed data usig the curret desig. First, it is very coveiet to idex compressed data blocks for fast radom tuple locatio as the first tuple of a block is stored i the block header i ucompressed format. Because the umber of blocks is reduced for the compressed view, the size of the idex file is sigificatly decreased as well. Secod, implemetig compressed data cube computig iside PANDA will ot oly save mai memory space durig data cube computatio but also avoid most of the data compressio ad decompressio processes. Third, as we oted, the XTDC algorithms is suitable to the Hilbert Space Fillig Curve, ad the PANDA system utilizes the Hilbert orderig to compute the sorted views for multidimesioal idexig. As a result, there is a sigificat potetial to improve the data cube compressio performace because we effectively ca get the Hilbert sorted views for free. Both data cube compressio ad multidimesioal idexig ca share the full beefit of Hilbert orderig. 6 REFERENCES [1] Adamso, C. & Veerable, M (1998) The Data Warehouse Desig solutios. Joh Wiley & Sos, Ic. [2] Berstei, P. A., Hadzilacos, V., & N. Goodma, N. (1987) Cocurrecy cotrol ad recovery i database systems, [3] Brackett, M. (1996). The Data Warehouse Challege. Joh Wiley & Sos, New York, NY, USA. [4] Brisaboa, N., Iglesias, E., Navarro, G., & Parama, J. (2003) A efficiet compressio code for text databases. I ECIR, volume 2633 of Lecture Notes i Computer Sciece, pages Spriger-Verlag. [5] Buchsbaum, A., Caldwell, D., Church, K., Fowler, G., & Muthukrisha, S. (2000) Egieerig the compressio of massive tables: a experimetal approach. I Symposium o Discrete Algorithms, pages [6] Burrows, M. & Wheeler, D. J. (1994) A block-sortig lossless data compressio algorithm. Digital System Research Ceter Research Report, May [7] Che, Y., Dehe, F., Eavis, T., & Rau-Chapli, A. (2004) Parallel ROLAP data cube costructio o sharedothig multiprocessors. I Distributed ad Parallel Databases, volume 15, umber 3, May 2004, pages [8] Che, Y., Dehe, F., Eavis, T., & Rau-Chapli, A. (2004) Buildig large ROLAP data cubes i parallel. I IDEAS, pages [9] Che, Z., Gehrke, J., & Kor, F. (2001) Query optimizatio i compressed database systems. I SIGMOD Coferece. [10] Che, Z. & Seshadri, P. (2000) A algebraic compressio framework for query results. I ICDE, pages S195

13 [11] Neilso Thomas Debevoise. The Data Warehouse Method. Pretic Hall,Ic., Upper Saddle River, New Jersy, USA, [12] Dehe, F., Eavis, T., Hambrusch, S., & Rau-Chapli, A. (2002) Parallelizig the data cube. Iteratioal Coferece o Database Theory. [13] Dehe, F., Eavis, T., & Rau-Chapli, A. (2001). Coarse graied parallel olie aalytical processig (OLAP) for data miig. Lecture Notes i Computer Sciece, [14] Dehe, F., Eavis, T., & Rau-Chapli, A. (2001). Computig partial data cubes for parallel data warehousig applicatios. Lecture Notes i Computer Sciece, [15] Dehe, F., Eavis, T., & Rau-Chapli, A. (2002) Parallel multi-dimesioal ROLAP idexig. I proceedigs of the 3rd IEEE/ACM Iteratioal Symposuim o Cluster Computig ad the Grid (CCGrid2003), pages , Tokyo, Japa, October [16] Eavis., T. (2004) Parallel OLAP Computig. PhD thesis, Dalhousie Uiversity. [17] Gagie. T. (2003) Dyamic legth-restricted codig. Master s thesis, Uiversity of Toroto. [18] Gioviazzo, W. (2000) Object-Orieted Data Warehouse Desig. Pretice Hall, Ic.,Upper Saddle River, New Jersey, USA. [19] Goil, S. & Choudhary, A. (1999) A parallel scalable ifrastructure for OLAP ad data miig. I Iteratioal Database Egieerig ad Applicatio Symposium, pages [20] Goldstei, J., Ramakrisha, R., & Shaft, U. (1998) Compressig relatios ad idexes. I ICDE, pages [21] Graefe, G. & Shapiro, L. D. (1991) Data compressio ad database performace. I Proc. ACM/IEEE-CS Symp. o Applied Computig, Kasas City, MO. [22] Gray, J. & Reuter, A. (1992) Trasactio processig: Cocepts ad techiques. Morga Kaufma Publishers Ic. Sa Fracisco, CA, USA. [23] Hariaraya, V., Rajarama, A., & Ullma, J. (1996) Implemetig data cubes efficietly. I Proceedig ACM SIGMOD Coferece, pages [24] Held, G & Marshall, T. (1991) Data Compressio. Joh Wiley & Sos, New York, NY, USA, third editio. [25] Hoffma, R. (1997) Data Compressio i Digital Systems. Chapma & Hall. [26] HPCVL. Retrieved March 8, 2007 from the World Wide Web: [27] Huag, X., Lu, H., & Li, Z. (1997) Computig data cubes usig massively parallel processors. I Proceedig 7th Parallel Computig Workshop (PCW 97), Caberra, Australia. [28] Imo, W. H. (1992) Buildig the Data Warehouse. Joh Wiley & Sos. [29] Jurges, M. (2002) Idex Structures for Data Warehouses, Lecture Notes i Computer Sciece, Spriger- Verlag. [30] Kimball, R., Reeves, L., Ross, M., & Thorthwaithe, W. (1998) The Data Warehouse Lifecycle Toolkit. Joh Wiley & Sos, Ic. [31] Kimball, R. & ad Ross, M. (2002) The Data Warehouse Toolkit. Joh Wiley & Sos, Ic, secod editio. S196

14 [32] Li, J., Rotem, D., & Srivastava, J. (1999). Aggregatio algorithms for very large compressed data warehouses. I Proceedig of the 25th VLDB Coferece, pages [33] Liag, B. (2005) Compressig Data Cube i Parallel OLAP Systems. Master thesis, Carleto Uiversity. [34] Sayed, A, Hoque, L., McGregor, D, & Wilso, J. (2002) Databases compressio usig a offlie dictioary method. I ADVIS, volume 2457 of Lecture Notes i Computer Sciece, pages Spriger-Verlag. [35] Moo, B, Jagadish, H., Faloutsos, C., & Saltz, J. (2001). Aalysis of the clusterig properties of the Hilbert space-fillig curve. Kowledge ad Data Egieerig, 13(1): [36] Ng, R., Wager, A., & Yi, Y. (2001) Iceberg-cube computatio with pc cluster. I Proceedig of 2001 ACM SIGMOD Coferece o Maagemet of Data, pages [37] Ng, W. K. & Ravishakar, C. V. (1997) Block-orieted compressio techiques for large statistical databases. Kowledge ad Data Egieerig, 9(2): [38] Pada. Project website: [39] Poess, M. & Potapov, D. (2003) Data compressio i oracle. I Frederick H. Lochovsky, editor, Proceedigs of the 29th VLDB Coferece, Berli, Germay. [40] Ray, G., Haritsa, J., & Seshadri, S. (1995) Database compressio: A performace ehacemet tool. I Iteratioal Coferece o Maagemet of Data. [41] Roth, M. & Va Hor, S. (1993) Database compressio. SIGMOD Record, 22(3), September [42] Sarawagi, S., Agarwal, R., & Gupta, A. (1996) O the computig the data cube. IBM Research Report. [43] Seward, J. bzip2 ad libbzip2, Retrieved from the World Wide Web, March 8, 2007: [44] Storer, J. (1988) Data Compressio Methods ad Theory. Computer Sciece Press Ic, Rockville, Md. [45] Wayer, P. (2000) Compressio Algorithms for Real Programmers. Morga Kaufma: Sa Diego. [46] GNU Website. Retrieved from the World Wide Web March 8, 2007: [47] Westma, T., Kossma, D., Helmer, S., & Moerkotte, G. (2000) The implemetatio ad performace of compressed databases. SIGMOD Record, 29(3): [48] Yu, C. (2002) High-Dimedioal Idexig, Lecture Notes i Computer Sciece Spriger-Verlag. [49] Ziv, J. & Lempel, A. (1977) A uiversal algorithm for sequetial data compressio. IEEE Trasactios o Iformatio Theory, 23(3): S197

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity