Efficient Bulk Loading of Large High-Dimensional Indexes

Size: px

Start display at page:

Download "Efficient Bulk Loading of Large High-Dimensional Indexes"

Tamsyn Hampton
5 years ago
Views:

1 It. Cof. o Data Warehousig ad Kowledge Discovery DaWaK 99 Efficiet Bulk Loadig of Large High-Dimesioal Idexes Christia Böhm ad Has-Peter Kriegel Uiversity of Muich, Oettigestr. 67, D Muich, Germay {boehm,kriegel}@iformatik.ui-mueche.de Abstract. Efficiet idex costructio i multidimesioal data spaces is importat for may kowledge discovery algorithms, because costructio times typically must be amortized by performace gais i query processig. I this paper, we propose a geeric bulk loadig method which allows the applicatio of user-defied split strategies i the idex costructio. This approach allows the adaptatio of the idex properties to the requiremets of a specific kowledge discovery algorithm. As our algorithm takes ito accout that large data sets do ot fit i mai memory, our algorithm is based o exteral sortig. Decisios of the split strategy ca be made accordig to a sample of the data set which is selected automatically. The sort algorithm is a variat of the well-kow Quicksort algorithm, ehaced to work o secodary storage. The idex costructio has a rutime complexity of O( log ). We show both aalytically ad experimetally that the algorithm outperforms traditioal idex costructio methods by large factors. 1. Itroductio Efficiet idex costructio i multidimesioal data spaces is importat for may kowledge discovery tasks. May algorithms for kowledge discovery [JD 88, KR 90, NH 94, EKSX 96, BBBK 99], especially clusterig algorithms, rely o efficiet processig of similarity queries. I such a settig, multidimesioal idexes are ofte created i a preprocessig step to kowledge discovery. If the idex is ot eeded for geeral purpose query processig, it is ot permaetly maitaied, but discarded after the KDD algorithm is completed. Therefore, the time spet i the idex costructio must be amortized by rutime improvemets durig kowledge discovery. Usually, idexes are costructed usig repeated isert operatios. This dyamic idex costructio, however, causes a serious performace degeeratio. We show later i this paper that i a typical settig, every isert operatio leads to at least oe access to a data page of the idex. Therefore, there is a icreasig iterest i fast bulk-loadig operatios for multidimesioal idex structures which cause substatially fewer page accesses for the idex costructio. A secod problem is that idexes must be carefully optimized i order to achieve a satisfactory performace (cf. [Böh 98, BK 99, BBJ+ 99]). The optimizatio objectives [BBKK 97] deped o the properties of the data set (dimesio, distributio, umber of objects, etc.) ad o the types of queries which are performed by the KDD algorithm (rage queries [EKSX 96], earest eighbor queries [KR 90, NH 94], similarity jois [BBBK 99], etc.). O the other had, we may draw some advatage from the fact that we do ot oly kow a sigle data item at each poit of time (as i the dyamic idex costructio) but a large amout of data items. It is a commo kowledge that a higher faout ad storage utilizatio of the idex pages ca be achieved by applyig bulk-load operatios. A higher faout yields a better search performace. Kowig all data a priori allows us to choose a alterative data space partitioig. As we have show i [BBK 98a], a strategy of splittig the data space ito two equally-sized portios causes, uder certai circumstaces, a poor search performace i cotrast to a ubalaced

2 split. Therefore, it is a importat property of a bulk-loadig algorithm that it allows to exchage the splittig strategy accordig to the requiremets specific to the applicatio. The curretly proposed bulk-loadig methods either suffer from poor performace i the idex costructio or i the query evaluatio, or are ot suitable for idexes which do ot fit ito mai memory. I cotrast to previous bulk-loadig methods, we preset i this paper a algorithm for fast idex costructio o secodary storage which provides efficiet query processig ad is geeric i the sese that the split strategy ca be easily exchaged. It is based o a extesio of the Quicksort algorithm which facilitates sortig o secodary storage (cf. sectio 3.3 ad 3.4). The split strategy (sectio 3.2) is a user-defied fuctio. For the split decisios, a sample of the data set is exploited which is automatically geerated by the bulk-loadig algorithm. 2. Related Work Several methods for bulk-loadig multidimesioal idex structures have bee proposed. Space-fillig curves provide a meas to order the poits such that spatial eighborhoods are maitaied. I the Hilbert R-tree costructio method [KF 94], the poits are sorted accordig to their Hilbert value. The obtaied sequece of poits is decomposed ito cotiguous subsequeces which are stored i the data pages. The page regio, however, is ot described by the iterval of Hilbert values but by the miimum boudig rectagle of the poits. The directory is built bottom up. The disadvatage of Hilbert R-trees is the high overlap amog page regios. VAM-Split trees [JW 96], i cotrast, use a cocept of hierarchical space partitioig for bulk-loadig R-trees or KDB-trees. Sort algorithms are used for this purpose. This approach does ot exploit a priori kowledge of the data set ad is ot adaptable. Buffer trees [BSW 97] are a geeralized techique to improve the costructio performace for dyamic isert algorithms. The geeral idea is to collect isert operatios to certai braches of the tree i buffers. These operatios are propagated to the ext deeper level wheever such a buffer overflows. This techique preserves the properties of the uderlyig idex structure. 3. Our New Techique Durig the bulk-load operatio, the complete data set is held o secodary storage. Although oly a small cache i the mai memory is required, cost itesive disk operatios such as radom seeks are miimized. I our algorithms, we strictly separate the split strategy from the core of the costructio algorithm. Therefore, we ca easily replace the split strategy ad thus, create a arbitrary overlap-free partitio for the give storage utilizatio. Various criteria for the choice of directio ad positio of split hyperplaes ca be applied. The idex costructio is a recursive algorithm cosistig of the followig subtasks: determiig the tree topology (height, faout of the directory odes, etc.) choice of the split strategy exteral bisectio of the data set accordig to tree topology ad split strategy costructio of the idex directory. 3.1 Determiatio of the Tree Topology The first step of our algorithm is to determie the topology of the tree resultig from our bulk-load operatio. The height of the tree ca be determied as follows [Böh 98]: h = log Ceff,dir --- ) + 1 C eff,data

3 1 x; 50, ,000 : 50,000 4 y; 16,667 3 y; 16, ,333 : 16, ,667 : 33,333 5 y; 16,666 2 y; 33,333 16,667 : 16, ,666 : 16,667 Figure 1: The Split Tree. The faout is give by the followig formula: faout( h, ) = mi( h 2, C max,dir ) C eff,data C eff,dir 3.2 The Split Strategy I order to determie the split dimesio, we have to cosider two cases: If the data subset fits ito mai memory, the split dimesio ad the subset size ca be obtaied by computig selectivities or variaces from the complete data subset. Otherwise, decisios are based o a sample of the subset which fits ito mai memory ad ca be loaded without causig too may radom seek operatios. We use a simple heuristic to sample the data subset which loads subsequet blocks from three differet places i the data set. 3.3 Recursive Top-Dow Partitioig Now, we are able to defie a recursive algorithm for partitioig the data set. The algorithm cosists of two procedures which are ested recursively (both procedures call each other). The first procedure, partitio(), that is called oce for each directory page has the followig duties: call the topology module to determie the faout of the curret directory page call the split-strategy module to determie a split tree for the curret directory page call the secod procedure, partitio_acc_to_split_tree() The secod procedure partitios the data set accordig to the split dimesios ad the proportios give i the split tree. However, the proportios are ot regarded as fixed values. Istead, we will determie lower ad upper bouds for the umber of objects o each side of the split hyperplae. This will help us to improve the performace of the ext step, the exteral bipartitioig. Let us assume that the ratio of the umber of leaf odes o each side of the curret ode i the split tree is l : r, ad that we are curretly dealig with N data objects. A exact split hyperplae would exploit the proportios: l r N left = N ad N. l + r right = N = N N l + r left Istead of usig the exact values, we compute a upper boud for N left such that N left is ot too large to be placed i l subtrees with height h 1 ad a lower boud for N left such that N right is ot too large for r subtrees: N max,left = l C max,tree ( h 1) N mi,left = N r C max,tree ( h 1) A overview of the algorithm is depicted i C-like pseudocode i figure 2. For the presetatio of the algorithm, we assume that the data vectors are stored i a array o secodary

4 idex_costructio (it ) { it h = (it)(log (/Ceffdata) / log (Ceffdir) + 1) ; partitio (0,, h) ; } partitio (it start, it, it height) { if (height == 0) {... // write data page, propagate ifo to paret retur ; } it f = faout (height, ) ; SplitTree st = split_strategy (start,, f) ; partitio_acc_to_splittree (start,, height, st) ;... // write directory page, propagate ifo to paret } partitio_acc_to_splittree (it start, it, it height, SplitTree st) { if (is_leaf (st)) { partitio (start,, height - 1) ; retur ; } it mtc = max_tree_capacity (height - 1) ; _maxleft = st->l_leaves * mtc ; _mileft = N - st->r_leaves * mtc ; _real = exteral_bipartitio (start,, st->splitdim, _mileft, _maxleft) ; partitio_acc_to_splittree (start, _real, st->leftchild, height) ; partitio_acc_to_splittree (start + _real, - _real, st->rightchild, height) ; } Figure 2: Recursive Top-Dow Data Set Partitioig. storage ad that the curret data subset is referred to by the parameters start ad, where is the umber of data objects ad start represets the address of the first object. The procedure idex_costructio() determies the height of the tree ad calls partitio() which is resposible for the geeratio of a complete data or directory page. The fuctio partitio() first determies the faout of the curret page ad calls split_strategy() to costruct a adequate split tree. The partitio_acc_to_splittree() is called to partitio the data set accordig to the split tree. After partitioig the data, partitio_acc_to_splittree() calls partitio() i order to create the ext deeper idex level. The height of the curret subtree is decremeted i this idirect recursive call. Therefore, the data set is partitioed i a top-dow maer, i.e. the data set is first partitioed with respect to the highest directory level below the root ode. 3.4 Exteral Bipartitioig of the Data Set Our bipartitioig algorithm is comparable to the well-kow Quicksort algorithm [Hoa 62, Sed 78]. Bipartitioig meas to split the data set or a subset ito two portios accordig to the value of oe specific dimesio, the split dimesio. After the bipartitioig step, the lower part of the data set cotais values i the split dimesio which are

5 lower tha a threshold value, the split value. The values i the higher part will be higher tha the split value. The split value is iitially ukow ad is determied durig the ru of the bipartitioig algorithm. Bipartitioig is closely related to sortig the data set accordig to the split dimesio. I fact, if the data is sorted, bipartitioig of ay proportio ca easily be achieved by cuttig the sorted data set ito two subsets. However, sortig has a complexity of o( log ), ad a complete sort-order is ot required for our purpose. Istead, we will preset a bipartitioig algorithm with a average-case complexity of O(). The basic idea of our algorithm is to adapt Quicksort as follows: Quicksort makes a bisectio of the data accordig to a heuristically chose pivot value ad the recursively calls Quicksort for both subsets. Our first modificatio is to make oly oe recursive call for the subset which cotais the split iterval. We are able to do that because the objects i the other subsets are o the correct side of the split iterval ayway ad eed o further sortig. The secod modificatio is to stop the recursio if the positio of the pivot value is iside the split iterval. The third modificatio is to choose the pivot values accordig to the proportio rather tha to reach the middle. Our bipartitioig algorithm works o secodary storage. It is well-kow that the Mergesort algorithm is better suited for exteral sortig tha Quicksort. However, Mergesort does ot facilitate our modificatios leadig to a O() complexity ad was ot further ivestigated for this reaso. I our implemetatio, we use a sophisticated scheme reducig disk I/O ad especially reducig radom seek operatios much more tha a ormal cachig algorithm would be able to. The algorithm ca ru i two modes, iteral or exteral, depedig o the questio whether the processed data set fits ito mai memory or ot. The iteral mode is quite similar to Quicksort: The middle of three split attribute values i the database is take as pivot value. The first object o the left side havig a split attribute value larger tha the pivot value is exchaged with the last elemet o the right side smaller tha the pivot value util left ad right object poiters meet at the bisectio poit. The algorithm stops if the bisectio poit is iside the goal iterval. Otherwise, the algorithm cotiues recursively with the data subset cotaiig the goal iterval. The exteral mode is more sophisticated: First, the pivot value is determied from the sample which is take i the same way as described i sectio 3.2 ad ca ofte be reused. A complete iteral bipartitio rus o the sample data set to determie a suitable pivot value. I the followig exteral bisectio (cf. figure 3), trasfers from ad to the cache are always processed with a blocksize half of the cache size. Figure 3a shows the iitializatio of the cache from the first ad last block i the disk file. The, the data i the cache is processed by iteral bisectio with respect to the pivot value. If the bisectio poit is i the lower part of the cache (figure 3c), the right side cotais more objects tha fit ito oe block. Oe block, startig from the bisectio poit, is writte back to the file ad the ext block is read ad iterally bisected agai. Usually, objects remai i the lower ad higher eds of the cache. These objects are used later to fill up trasfer blocks completely. All remaiig data is writte back i the very last step ito the middle of the file where additioally a fractio of a block has to be processed. Fially, we test if the bisectio poit of the exteral bisectio is i the split iterval. If the poit is outside, aother recursio is required.

6 (a) Iitializig the cache from file: file cache (b) Iteral bisectio of the cache: cache (c) Writig the larger half partially back to disk: file cache (d) Loadig oe further block to cache: file cache (e) Writig the larger half partially back to disk: file cache Figure 3: Exteral Bisectio. 3.5 Costructig the Idex Directory As data partitioig is doe by a recursive algorithm, the structure of the idex is represeted by the recursio tree. Therefore, we are able to create a directory ode after the completio of the recursive calls for the child odes. These recursive calls retur the boudig boxes ad the correspodig secodary storage addresses to the caller, where the iformatio is collected. There, the directory ode is writte, the boudig boxes are combied to a sigle boudig box comprisig of all boxes of child odes, ad the result is agai propagated to the ext higher level. A depth-first post-order sequetializatio of the idex is writte to the disk. 3.6 Aalytical Evaluatio of the Costructio Algorithm I this sectio, we will show that our bottom-up costructio algorithm has a average case time complexity of O( log ). Moreover, we will cosider disk accesses i a more exact way, ad thus provide a aalytically derived improvemet factor over the dyamic idex costructio. For the file I/O, we determie two parameters: The umber of radom seek operatios ad the amout of data read or writte from or to the disk. Uless o further cachig is performed (which is true for our applicatio, but caot be guarateed for the operatig system) ad provided that seeks are uiformly distributed variables, the I/O processig time ca be determied as t i/o = t seek seek_ops + t trasfer amout. I the followig, we deote by the cache capacity the umber of objects fittig ito the cache: cachesize = sizeof (object)

7 Lemma 1. Complexity of bisectio The bisectio algorithm has the complexity O(). Proof (Lemma 1) We assume that the pivot elemet is radomly chose from the data set. After the first ru of the algorithm, the pivot elemet is located with uiform probability at oe of the positios i the file. Therefore, the ext ru of the algorithm will have the legth k with a probability 1 for each 1 < k <. Thus, the cost fuctio C ( ) ecompasses the cost for the algorithm, + 1 compariso operatios plus a probability weighted sum of the cost for processig the algorithm with legth k 1, Ck ( ). We obtai the followig recursive equatio: C ( ) = k = 1 Ck ( 1) which ca be solved by multiplyig with ad subtractig the same equatio for 1. This ca be simplified to C ( ) = 2 + C ( 1), ad, C ( ) = 2 = O( ). Lemma 2. Cost Bouds of Recursio (1) The amout of data read or writte durig oe recursio of our techique does ot exceed four times the file-size. (2) The umber of seek operatios required is bouded by 8 seek_ops( ) log 2 ( ) Proof (Lemma 2) (1) follows directly from Lemma 1 because every compared elemet has to be trasferred at most oce from disk to mai memory ad at most oce back to disk. (2) I each ru of the exteral bisectio algorithm, file I/O is processed with a blocksize of cachesize/2. The umber of blocks read i each ru is therefore because oe extra read is required i the fial step. The umber of write operatios is the same ad thus 8 seek_ops( ) = 2 blocks_read ru () i log 2 ( ). Lemma 3. Average Case Complexity of Our Techique Our techique has a average case complexity of O( log ) uless the split strategy has a complexity worse tha O(). Proof (Lemma 3) For each level of the tree, the complete data set has to be bisectioed as ofte as the height of the split tree idicates. As the height of the split tree is determied by the directory page capacity, there are at most h ( ) C max,dir = O( log ) bisectio rus ecessary.therefore, our techique has the complexity O( log ). blocks_read bisectio ( ) r iterval i = 0 =

8 = 1,000,000 Improvemet Factor = 10,000,000 = 100,000,000 Cache Capacity Figure 4: Improvemet Factor for the Idex Costructio Accordig to Lemmata 1-5. Lemma 4. Cost of Symmetric Partitioig For symmetric splittig, the procedure partitio() hadles a amout of file I/O data of log 2 ) + log Cmax,dir ) 4 filesize ad requires radom seek operatios. Proof (Lemma 4) Left out due to space limitatios, cf. [Böh 98]. Lemma 5. Cost of Dyamic Idex Costructio log 2 ) + log Cmax,dir ) log 2 () Dyamic X-tree costructio requires 2 seek operatios. The trasferred amout of data is 2 pagesize. Proof (Lemma 5) For the X-tree, it is geerally assumed that the directory is completely held i mai memory. Data pages are ot cached at all. For each isert, the correspodig data page has to be loaded ad writte back after completig the operatio. Moreover, o better cachig strategy for data pages ca be applied, sice without preprocessig of the iput data set, o locality ca be exploited to establish a workig set of pages. From the results of lemmata 4 ad 5 we ca derive a estimate for the improvemet factor of the bottom-up costructio over dyamic idex costructio. The improvemet factor for the umber of seek operatios is approximately: Improvemet log 2 ) + log Cmax,dir ) It is almost (up to the logarithmic factor i the deomiator) liear i the cache capacity. Figure 4 depicts the improvemet factor (umber of radom seek operatios) for varyig cache sizes ad varyig database sizes.

9 4. Experimetal Evaluatio To show the practical relevace of our bottom-up costructio algorithm, we have performed a extesive experimetal evaluatio by comparig the followig idex costructio techiques: Dyamic idex costructio (repeated isert operatios), Hilbert R-tree costructio ad our ew method. All experimets have bee computed o HP9000/780 workstatios with several GBytes of secodary storage. Although our techique is applicable to most R-tree-like idex structures, we decided to use the X-tree as a uderlyig idex structure because accordig to [BKK 96], the X-tree outperforms other high-dimesioal idex structures. All programs have bee implemeted i C++. I our experimets, we compare the costructio times for various idexes. The exteral sortig procedure of our costructio method was allowed to use oly a relatively small cache (32 kbytes). Note that, although our implemetatio does ot provide ay further disk I/O cachig, this caot be guarateed for the operatig system. I cotrast, the Hilbert costructio method was implemeted with iteral sortig for simplicity. The costructio time of the Hilbert method is therefore uderestimated by far ad would worse i combiatio with exteral sortig whe the cache size is strictly limited. All Hilbert-costructed idexes have a storage utilizatio ear 100%. Figure 5 shows the costructio time of dyamic idex costructio ad of the bottom-up methods. I the left diagram, we fix the dimesio to 16, ad vary the database size from 100,000 to 2,000,000 objects of sythetic data. The resultig speed-up of the bulk-loadig techiques over the dyamic costructio was so eormous that a logarithmic scale must be used i figure 5. I cotrast, the bottom-up methods differ oly slightly i their performace. The Hilbert techique was the best method, havig a costructio time betwee 17 ad 429 sec. The costructio time of symmetric splittig rages from 26 to 668 sec., whereas ubalaced splittig required betwee 21 ad 744 sec. i the moderate case ad betwee 23 ad 858 sec. for the 9:1 split. I cotrast, the dyamic costructio time raged from 965 to 393,310 sec. (4 days, 13 hours). The improvemet factor of our methods costatly icreases with growig idex size, startig from 37 to 45 for 100,000 objects ad reachig 458 to 588 for 2,000,000 objects. The Hilbert costructio is up to 915 times faster tha the dyamic idex costructio. This eormous factor is ot oly due to iteral sortig but also due to reduced overhead i chagig the orderig attribute. I cotrast to Hilbert costructio, our techique chages the sortig criterio durig the sort process accordig to the split tree. The more ofte the sortig criterio is chaged, the more ubalaced the split becomes because the height of the Figure 5: Performace of Idex Costructio Agaist Database Size ad Dimesio.

10 split tree icreases. Therefore, the 9:1-split has the worst improvemet factor. The right diagram i figure 5 shows the costructio time for varyig idex dimesios. Here, the database size was fixed to 1,000,000 objects. It ca be see that the improvemet factors of the costructio methods (betwee 240 ad 320) are rather idepedet from the dimesio of the data space. Our further experimets, which are ot preseted due to space limitatios [Böh 98], show that the Hilbert costructio method yields a bad performace i query processig. The reaso is the high overlap amog the page regios. Due to improved space partitioig resultig from kowig the data set a priori, the idexes costructed by our ew method outperform eve the dyamically costructed idexes by factors up to Coclusio I this paper, we have proposed a fast algorithm for costructig idexes for high-dimesioal data spaces o secodary storage. A user-defied split-strategy allows the adaptatio of the idex properties to the requiremets of a specific kowledge discovery algorithm. We have show both aalytically ad experimetally that our costructio method outperforms the dyamic idex costructio by large factors. Our experimets further show that these idexes are also superior with respect to the search performace. Future work icludes the ivestigatio of various split strategies ad their impact o differet query types ad access patters. 6. Refereces [BBBK 99] Böhm, Braumüller, Breuig, Kriegel: Fast Clusterig Usig High-Dimesioal Similarity Jois, submitted for publicatio, [BBJ+ 99] Berchtold, Böhm, Jagadish, Kriegel, Sader: Idepedet Quatizatio: A Idex Compressio Techique for High-Dimesioal Data Spaces, submitted for publicatio, [BBK 98a] Berchtold, Böhm, Kriegel: Improvig the Query Performace of High-Dimesioal Idex Structures Usig Bulk-Load Operatios, It. Cof. o Extedig Database Tech., EDBT, [BBKK 97] Berchtold, Böhm, Keim, Kriegel: A Cost Model For Nearest Neighbor Search i High- Dimesioal Data Space, ACM PODS Symp. Priciples of Database Systems, [BK 99] Böhm, Kriegel: Dyamically Optimizig High-Dimesioal Idex Structures, subm., [BKK 96] Berchtold, Keim, Kriegel: The X-Tree: A Idex Structure for High-Dimesioal Data, It. Cof. o Very Large Data Bases, VLDB, [BSW 97] va de Bercke, Seeger, Widmayer: A Geeral Approach to Bulk Loadig Multidimesioal Idex Structures, It. Cof. o Very Large Databases, VLDB, [Böh 98] Böhm: Efficietly Idexig High-Dimesioal Data Spaces, PhD Thesis, Uiversity of Muich, Herbert Utz Verlag, [EKSX 96] Ester, Kriegel, Sader, Xu: A Desity-Based Algorithm for Discoverig Clusters i Large Spatial Databases with Noise, It. Cof. Kowl. Disc. ad Data Miig, KDD, [Hoa 62] Hoare: Quicksort, Computer Joural, Vol. 5, No. 1, [JD 88] Jai, Dubes: Algorithms for Clusterig Data, Pretice-Hall, Ic., [JW 96] Jai, White: Similarity Idexig: Algorithms ad Performace, SPIE Storage ad Retrieval for Image ad Video Databases IV, Vol. 2670, [KF 94] Kamel, Faloutsos: Hilbert R-tree: A Improved R-tree usig Fractals. It. Cof. o Very Large Data Bases, VLDB, [KR 90] Kaufma, Rousseeuw: Fidig Groups i Data: A Itroductio to Cluster Aalysis, Joh Wiley & Sos, [NH 94] Ng, Ha: Efficiet ad Effective Clusterig Methods for Spatial Data Miig, It. Cof. o Very Large Data Bases, VLDB, [Sed 78] Sedgewick: Quicksort, Garlad, New York, [WSB 98] Weber, Schek, Blott: A Quatitative Aalysis ad Performace Study for Similarity-Search Methods i High-Dimesioal Spaces, It. Cof. o Very Large Databases, VLDB, 1998.

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio