* (4.1) A more exact setting will be specified later. The side lengthsj are determined such that

Size: px

Start display at page:

Download "* (4.1) A more exact setting will be specified later. The side lengthsj are determined such that"

Theodore Jones
5 years ago
Views:

1 D D Chapter 4 xtensions of the CUB MTOD e present several generalizations of the CUB MTOD In section 41 we analyze the query algorithm GOINGCUB The difference to the CUB MTOD occurs in the case when the cube is empty : instead of calling the bruteforce method the GOINGCUB enlarges the cube with center until it contains data points In Section 42 we extend the algorithm GOING CUB to compute the nearest neighbors to some query point from the data set Next we generalize in Section 43 the expected runtime analysis of our algorithms to other wellbehaved probability distributions In Section 44 we develop extensions of our algorithms which work efficiently in the externalmemory model of computation introduced in [2] Section 45 shows an experimental comparison of implementations of the presented variants of the CUB MTOD e emphasize the factors by which the algorithms are faster than the bruteforce method A part of the results presented in this chapter has been published in [39] 41 The GOINGCUB variant If the cube considered by the CUB MTOD is empty the alternative to determining the nearest neighbor by bruteforce is to consider a larger cube with center and determine the points contained in it The size of the current cube should be increased as long as it contains no points of This searching algorithm is called the GOINGCUB method The first cube around contains an expected number of points is the expected number of points which are contained in the cube considered in the th iteration ie The side length of this cube is determined by the procedure SID_LNGT as described in Section 212 SID_LNGT gets a sequence of parameters # $%' ' + ' ' such that 687:9 A@CBD+ 5?> 41 A more exact setting will be specified later The side lengthsj are determined such that KLM+ > 42 The sequences ' and are strictly monotone increasing The termination of the algorithm is ensured and we define OPQK 6S7T9 BJD UV The following gives a schematic description of the algorithm 53

2 GOING_CUB repeat t:t+1 if : SID_LNGT : SAC_CUB else : BUT_FOC until return SAC_CUB stands for one of the orthogonal range searching procedures SCAN JCT and SCAN_JCT ocedure SID_LNGT was introduced in Section 212 It computes the side lengths of the cubes #%$ The order of dimensions corresponds to the increasing order of the side lengths of a box '+ This order depends only on the query point and is also computed by the SID_LNGT procedure but just only once in the first iteration in time e measure the running time 57689: of the GOINGCUB method by the number of performed comparisons and arithmetic operations For the expected runtime analysis we introduce the discrete random variable that represents the number of points of contained in the cube To enable compactness of formulas we define > Let? be the A random variable representing the number of tests required by an independent call of SAC_CUB to determine the points of contained in the cube The expected runtime of the GOINGCUB method is given by: where $ 57D GCJ 8KFM $ 5B689C: $ 57DF G I $ 57D GCJ 8LKNM I $ 5PO 8LQ G is the expected runtime of the SAC_CUB procedure: $ 5PD GJ 8KFM S T@ 5 VUX'Y[Z\ $? ' 5 B 43 Y'Z where $ is the corresponding constant of the SAC_CUB procedure 5PO 8Q G is the expected runtime of the bruteforce part of the GOINGCUB method: $ 5PO 8LQ G S T 5 ] U ^Y[_% $ ' 5 B 44 Y'_ where is the corresponding $ constant of the bruteforce method The expected runtime 5`DF G required to compute the side lengths of the cubes of dimensions is given by: $ 57DF G L +S T where Y[a is the constant of the SID_LNGT procedure 5 U 'Yab 1032# 54 and the order C 45

3 # # # # Next we present the expected runtime analysis of the SAC_CUB part for each of the 3 searching strategies : SCAN JCT and In the following we first give bounds on $ 5 DN G SCAN_JCT The analysis uses a bound on the probability that the cube and $ 5PO 8LQ G is empty: T U is empty e used for some random point $ ' ] The expected runtime to determine the side lengths of the cubes can be bounded by using 46 for OPQ for and 45: $ L 57DN G Ya 1032# L Yab ] +S +S Ya 1032# Ya Let ' with be discrete random variables such that point e use 1S and T U part as follows: $ 5PO 8LQ G Y[_% e used T S Y _ Y _ Y _ Y[_% 5 T S S S S U 5 ] S S S T T T e bound the expected runtime of the bruteforce U S 5 $ 5 U$ 5 5 ' U ' S# xpected runtime of GOING_CUB when searching with SCAN e choose the sequence such that ] U%$ 5 $+$ e specify the value later By 47 and by $ 57DF G OPQ we obtain: 1032B

4 Since for and we obtain by 48 and 42 the following: $ BO 8LQ G Y _ ] 5 Y _ +S Y[_% ] 'Y[_% 411 To estimate the scanning part we recall the analysis of the SCAN procedure which we presented in Section 212 e focus on the number 5 D K J of performed tests of coordinates In Section 212 we denoted by the box obtained by relaxing the side lengths of the cube to unit intervals with respect to those dimensions different from the first ones in the list For OPQ we consider: $ 5 D K J C C 5 4 C S where 5 D K J is the number of comparisons required by an independent call of SCAN event 5 and the event ] ' are independent for all we obtain C 5 5 C ] Because of which implies: By 46 and 414 we obtain: where $ 5 D K J by If 49 we obtain: C then $ 5 D K J $ 5 D K J the following holds: $ $ 5 D K J +S C C L +S S 5 T $ $ 5 U 5 $ 5 D K J $ 5 D K J $ 5 D K J C C B $ 5 which we obtain by 27 Note that $ 5 D K J Let be the smallest such that S C 5 +S Since the is also bounded Then by 42 and

5 since Now by 42 and 49 we get implies for Thus the inequality $ 5 D K J 5 5 For such that holds and by 417 we get: 7 for we have: OPQ which 418 By and 418 we obtain an upper bound on the expected runtime of the GOING_CUB method: The function obtain $ 5B68L9: $ 5B689C: 7 is monotone increasing in $ on Thus we choose xpected runtime of GOING_CUB when searching with JCT $ Obviously 5 68L9: $ 5 8 G G 7 K where 5 8 G G K is the number of comparisons required by JCT for a cube expected to contain $ points By 35 we have 5 8 G G K 7 $ The aim is to prove that 57689: P X The function P takes its minimum at ' e set Because of the small probability of an empty first cube we choose here the sequence to grow more slowly than in the case of SCAN: By 47 we obtain: ' 4 $ 57DN G 1032# $ In order to bound 57O 8LQ G 5 5 $ Now by 48 we obtain the following bound on 5 O 8Q G : $ 5PO 8LQ G ] we use the following inequalities which we obtain by 42: where Y_ Y[_% 5 +S Y_ ] Y[_% +S Y _ ^Y _ +S OPQ 5 and

6 T By for we have : S This together with 420 implies $ 5PO 8LQ G S Y _ S Y _ For the runtime analysis of the procedure JCT we focus on the number of tests of coordinates and tests of bit array entries In the following we bound $ 5 8 G G K 5 for OPQ Let ' be discrete random variables such that $ e obtain: $ 5 8 G G K [ 5 S S points outside ' 5 ] U [ S Since 5 and ] are independent events for all we have 5 [ 5 ' To enable compactness of formulas we set for all and ] Note that 5 for OPQ Furthermore the side length of the first cube is smaller than the side length of the th cube Thus by 424: points outside ' By 31 and 32 we have S Thus S 5 L points outside ' Thus by 423 and 428 we obtain for $ 5 8 G G K 5 [ OPQ : 58 S [

7 This together with 43 and 421 implies: Summarizing $ 5PD GJ 8KFM $ X 5P68L9: Y[Z 7 Y[Z 7 'Y[Z 7 B where $ thus 5P689C: xpected runtime of GOING_CUB when searching with JCT_SCAN e recall the analysis of the JCT_SCAN procedure presented in Section 312 If the expected number of points in 7 fulfills A then JCT_SCAN consists essentially of the SCANpart since rejecting is done only with respect to one of the dimensions In this case the 7 expected running time of JCT_SCAN is which is also the running time of SCAN 7 if Thus for reasons of simplicity if then we consider the GOINGCUB method to call the searching algorithm SCAN In this case the expected runtime of the method is P Assume $ e prove that 5 689C: when searching is done by JCT_SCAN Observe that the function $ is monotone increasing in e choose the sequence such that $ The expected running times 5`DN G $ and 5PO 8LQ G can be bounded as shown in 419 and 422 respectively To estimate the expected running time required to search the cubes for points we use 43 and prove a bound on 5 8 G _D K J 5 where OPQ 5 8 _D K J is the number of comparisons required by an independent call of JCT_SCAN and is composed of the number of performed comparisons to partition the dimensions and the number of tests of coordinates and tests of bit array entries required in the rejectpart and scanningpart Let be the partition of variables done in the th iteration by JCT_SCAN It is computed to fulfill 327 and 330 and it can be determined with at most comparisons Let be the points which were not rejected in the th iteration Thus contains all points such that $ for all dimensions Only these points are scanned with respect to the dimensions of e have: 5B8 G _D K J C 5 ] 5B8 G G K 5 5 D K J where 5B8 G G K and 5 D K call of and by an independent call of JCT The reject part is given by 5#8 G G K 5 ] are the numbers of comparisons required by an independent respectively points outside

8 T U e recall the arguments used in the analysis of the expected runtime of GOING_CUB with the searching algorithm JCT By 423 and 430 we obtain: By and 433 we obtain: 5 8 G G K points outside 5 5 ] Let? be the number of comparisons required to scan some data point with respect to its dimen sions in Then the scan part of the expected runtime is given by 5 D K J 5 5? 5 which holds since events with respect to different dimensions of the points are independent By arguments similar to those in 412 and 413 we get? 5? Based on the analysis of SCAN property 327 of partition and arguments of 25 we obtain for OPQ :? 5? 435 $ ' where is the geometric mean of the side lengths of the box are the points contained in the box which is obtained by relaxing the side lengths of the cube to unit intervals except those with respect to the dimensions in In analogy to 413 the following inequalities are easily seen to hold: where is some data point Following the arguments of 329 and 334 in Section 312 we obtain for OPQ :? This combined with and 43 implies by using T T U $ 57D GCJ 8LKNM U Y[Z Y[Z 7 7 e summarize the results in the following theorem S Since we chose 60 $ we obtain 5768L9:

9 Theorem 41 The GOING_CUB method finds the nearest neighbor from the set some query point and has the expected asymptotic runtime of the CUB MTOD of data points to In practice GOING_CUB may perform better than the CUB MTOD This consideration is confirmed by the experiments in Section 45 and it is based on the fact that the GOINGCUB starts with a smaller cube Furthermore the CUB MTOD calls the bruteforce method if the considered cube does not contain any points of 61

10 42 NearestNeighbor Search 421 The algorithm The GOINGCUB method can be extended to compute the nearest neighbors of the query point from the point set The extended GOINGCUB method works as follows : the points in the current cube are determined if still less than points have been found so far then a larger cube is considered e use the notations introduced in Section 41 In the th iteration the procedure SID_LNGT is called with the parameter $ # and computes side lengths such that For the first iteration we choose GOING_CUB repeat 4 The following gives a schematic description of the algorithm if then : SID_LNGT C : SAC_CUB 5 else until if then 5 SLCT 5 else return 5 5 ST_UPDAT SLCT Set contains the nearest neighbors to the query point is the set of points found by the end of the which is together with 5 th iteration represents the set of points outside the cube the result of a slightly modified SAC_CUB The iteration terminates if If X the additional 5 nearest neighbors A which we still need are extracted from the set by the SLCTprocedure The procedure SLCT computes the distances of all points of to the query selects the smallest distances and returns as result A the corresponding points from Using a linear time selection algorithm the running time of SLCT Y'Z Y Z for some appropriate constant e provide the set of points outside since we do not want to test in the next iteration the points of which have been found so far e need some small modifications of SAC_CUB which is one of the searching procedures SCAN JCT and JCT_SCAN: 62

11 Since SCAN checks each point of its input set it can simply maintain two index lists for the points inside and for those outside of the current cube JCT gets the set 5 as a bit array where an entry is set to 1 if and only if this represents the index of a point outside After the rejecting phase the bit array contains 1entries exactly for the points in and represents the result e built the bit array by the bitwiseand between 5 and the bitwisecomplement of JCT_SCAN has the bit array 5 as part of the input At the end of the rejecting phase we have a bit array with 1entries exactly for the points to be scanned in this th iteration The scanning phase has as result and bit array is built as described above 7$ The order of dimensions corresponds to the increasing order of the side lengths of a box '+ This order depends only on the query point and is computed by the SID_LNGT procedure only once in the first iteration in time 422 The expected runtime analysis e introduce the discrete random variable? to represent the total of tests of coordinates and tests of bit array entries required by an independent call of SAC_CUB 5 Let be here the discrete $ random variable for the number of points found during the first iterations e have for % and we define % In practice the numbers could be chosen to depend on the number of nearest neighbors which are still to be found thus if 5 then 5 5 The analysis of the algorithm with random variables is involved e consider a fixed sequence that is known at the beginning of the method and which satisfies 41 The termination of the algorithm is ensured and we can define OPQ $ The expected runtime $ 5P68L9: of the GOINGCUB is given by: 5B689: $ 5PDN G I $ 57D GJ 8KFM $ 57D G $ where 57D GCJ 8KFM is the expected runtime of the SAC_CUB procedure: $ PD GJ 8KFM 5 ^Y[Z $? ' $ +S and 57D G is the expected runtime of step SLCT: $ D G 5 +S 'Y[_ Y'Z Y[_ The values and are the corresponding constants $ of the SAC_CUB procedure and of the SLCTprocedure respectively The expected runtime 5 DN G for the computation of the side lengths of the cubes is given by: $ 5PDN G L +S T 5 U Ya 1032 where Y[a is the constant of the SID_LNGT procedure introduced in Section 212 In the following we first bound the probability time of SLCT SID_LNGT and SAC_CUB and then determine the expected running

12 The probability of cube enlargement To bound the probability we use the following theorem see [22] based on the Chernoff Bound Theorem 42 [22] Let be the discrete variable which counts the total number of successes by performing independent Bernoulli trials with the success probability of a trial Then 8 for each OPQ L where is the variance of Taking 8 in Theorem 42 we get the bound: Corollary 41 Let be the discrete variable which counts the total number of successes by performing independent Bernoulli trials with the success probability of a trial Then where OPQ Corollary 42 is the discrete random variable for the number of nearest neighbors found by GO ING_CUB during the first iterations and a constant such that $ where OPQ Then the following holds: 440 oof $ ' Let be the random variable for the number of points of within Thus and for some point $ '+ the probability equals Let Because of we get by Corollary 41: By using L L L ' we obtain the stated inequality e choose to fulfill OPQ with OPQ $ e have By 440 and we obtain: ] where e specify the exact value of later ]

13 # # The total expected runtime of the GOINGCUB method e introduce some preliminary notations and observations The event 5 following 2 disjoint events for any : is the union of the $ S# 5 S# 5 $ 443 where point for any 5 e denote by 5 and by the left event and the right event of the union in 443 respectively e will use the following inequality: S# oposition 41 The expected runtime of step SLCT is S# S# oof e have: $ D G Y[_% 5 $ 5 5 +S Y[_% S S Since variables are independent we obtain by and 445 the following for OPQ : S# 5 S# 5 e used the fact 5 5 If then obviously Altogether we have: $ 5PD G Y[_% Y[_ Y[_% Y[_ In a similar way we bound the expected running time to compute the side lengths of the cubes L and 442 and we obtain: and the order e proceed in analogy to 47 use $ 57DF G 5 ^Ya # 446

14 N In the following we bound the expected runtime $ 5 D GCJ 8KFM for each of the considered searching procedures SCAN JCT and JCT_SCAN OPQ we obtain A If SCAN is used then for iteration $ 5? ' S where $ '+ and is as introduced in Section 212 the box obtained by relaxing the side lengths of the cube to unit intervals with respect to those dimensions different from the first ones in The equation 447 is easily verified by adapting the analysis of SCAN to the fact that only points outside are scanned By using and the fact that the points are drawn independently at random we obtain: S# This implies together with 22 and $ 5PD GJ 8KFM Y[Z +S 5 $ 5 D K J B C where 5 D K J is the total number of tests required by an independent call of SCAN Clearly 5 D K J C $ and by 27 5 D K J C $ for OPQ Note that the upper bound on 5`D GCJ 8KFM procedure is used e choose the smallest possible value of which fulfills 441 that is If then: 7 C 7 $ Thus by using 437 5`D GJ 8KFM $ 57D GJ 8KFM is monotone increasing in for this case when the SCAN can be bounded as follows: 7 7 If we have for big enough and In this case if the smallest such that Since 66 It can be shown that then we have by 442: L L Let be

15 T By 442 we get: $ 5PD GJ 8KFM procedure SCAN is 5 for If L ith this we obtain then the expected runtime of the searching Thus by 446 and oposition 41 the expected runtime of the GOINGCUB method with the searching procedure SCAN is for Notice that our method is only of interest if is significantly smaller then since tests are needed in any case and the bruteforce method cannot be beaten asymptotically by the GOING CUB method if B B e consider the procedure JCT to determine the point sets By 437 we obtain: $ L 5PD GJ 8KFM 5 'Y[Z $ 5B8 G G 5 P K L ' +S where 5 8 G G K is the number of comparisons required by JCT for a cube expected to contain points For the runtime analysis of the procedure JCT we focus on the number of tests of coordinates and tests of bit array entries e will count also the additional bit operations to build bit array The discrete random variables ' are defined as in Section 41 such that $ e obtain: $ 5B8 G G K C 5 ' points outside 5 ' U S Thus $ 57D GCJ 8LKNM Y[Z Y Z $ The essential step is to bound +S S S S S $ S 5 5 for is the event stating that dimension of point lies outside the interval Observing that 5 we obtain: 5 5 $ 5 ] ] L [ OPQ where $ S# 5 451

16 where is the side length of the first cube e also used 444 the fact that the points are drawn independently at random and that in each iteration the side length increases Obviously since 5 5 the inequalities and 427 imply: $ 57D GCJ 8LKNM Y[Z 7 Y[Z 7 7 Thus the expected 7 $ runtime 57689: X JCT is where ] The function P with takes its minimum at $ and is monotone increasing on e choose % If % $ then 5B689: P $ otherwise 5P689C: X T U of the GOINGCUB method with the searching procedure C e consider the procedure JCT_SCAN to determine the point sets based on the results obtained in Section 312 and Section 41 If The analysis is 7 then the GOINGCUB method calls the searching algorithm SCAN as an extremal 4 case of the procedure JCT_SCAN In this case the expected runtime of the method is 7 Now assume To compute partition such that 327 and 330 are fulfilled we need at most comparisons To bound the rejecting part in iteration OPQ we investigate 5#8 G G K 5 5B8 G G K is the number of comparisons required by an independent call of JCT_SCAN In analogy to 432 and 449 and by using 451 we have 5B8 G G K 5 T points outside ' 5 U By using the properties of the partition 5B8 G G K 5 The scanning part in iteration S S 5 S and 336 we get: 5 OPQ is given by: L P where is the box obtained by relaxing the side lengths of the cube to unit intervals except those with respect to the dimensions in and the first dimensions in the list By using the 68

17 same arguments as in 448 we obtain the following bound for the scanning part: S 5 S In analogy to the analysis presented in Section 41 we obtain by the properties of the partition the following: S ` ` which correspond to the dimen ` where is the geometric mean of those side lengths of sions in Since S 5 L for $ Thus 5B68L9: $ we obtain 5 D GCJ 8LKNM where Since in both cases the upper bound on the expected runtime of the GOINGCUB method is monotone increasing in we choose Thus the expected runtime of the GOINGCUB method with the searching procedure JCT_ SCAN is e summarize the results obtained for the GOINGCUB method with the searching procedure JCT_SCAN: Theorem 43 The GOINGCUB method with the searching procedure JCT_SCAN finds the nearest neighbors from the set of points to some query point with an expected asymptotic runtime of ] if and if the points of are drawn independently at random under uniform distribution The method requires storage and 43 Other distributions preprocessing time In the following we present a generalization of the runtime analysis of the discussed methods to other wellbehaved distributions For reasons of clarity we focus here only on the analysis of the CUB MTOD with the orthogonal range searching procedures SCAN JCT and SCAN_JCT Let be the random vector that represents the coordinates of a point and let $ ' be the joint distribution function of the random vector e consider the points of to be drawn independently at random under the following distributions: are independent e have a Independent distribution: The random variables is the distribution function of the random variable b Positivelybounded distributionsee [15]: e say that exists constants such that measure and for? $? S where is positivelybounded over the bounded convex open region if there for all where has Lebesgue The distribution function has the following property:? where? is a region of volume measure 69

18 T U xpected runtime analysis under independent distribution e assume that the points of are drawn independently at random under independent distribution For a given query point we $ e introduce the functions $ S with is con such Since the are continuous and monotone increasing the function tinuous and monotone increasing as well Thus we can determine the side length of the cube S by binary search where $ of the cube L The permutation of dimensions is chosen 452 is the parameter of the algorithm which determines the side length e measure the running time of the CUB MTOD by the number of comparisons of coordinates and bit array entries e omit here the computation of the side length and of the order of dimensions which depends on the complexity of computing the distribution functions equals Since the $ points are drawn independently at random the expected number of points in the $ Let be defined such First we consider the SCAN procedure The expected number $ 5 D K J SCAN procedure is given by: where $ 5 D K J S of tests of coordinates of the are obtained by relaxing the side lengths of the cube to unit intervals with respect to those equals By 25 and by Lemma 22 we obtain the same results as in Section 212: dimensions that are different from the first ones in the list The S $ 5 D K J 70 7

19 U e proceed in analogy to the analysis in Section 214 and we obtain that the CUB MTOD with the searching procedure SCAN performs an expected number The expected number $ 5 8 G G K JCT is given by Since S we obtain by Lemma 23 $ 578 G G K of comparisons of coordinates of comparisons of coordinates and bit array entries of the procedure P The rest of the analysis of the CUB MTOD with the searching procedure JCT is done as discussed in Section 31 e skip the details of the running time analysis of the procedure JCT_SCAN Its expected number of comparisons of coordinates and bit array entries is given in 325 where $ the functions are defined in 452 instead of representing the side lengths of the box '+ The crucial obtained in Section 312 can be expressed as function $ 5B8 G _D K $ observation is that the upper bound on 5 8 G _D K J of see 334 and 335 thus it holds also in the case when we assume independent distribution of the points Thus the CUB MTOD with the searching procedure JCT_SCAN P ] performs an expected number of comparisons of coordinates and bit array entries T xpected runtime analysis under positivelybounded distribution and at random under positivelybounded distribution In the following we use the notations and 5B8 G _D K J as introduced above e consider the procedure JCT to determine the point set e assume wlog $ ' where and is the box where the ' $ '+ S e consider the points of to be drawn independently 5 D K J 5B8 G G K $ represents the side length of $ ' For each point we obtain 453 and is given by 23 The side length is determined by the procedure SID_LNGT as described in Section 213 such 454 S where is a real parameter in $ in dimension of SID_LNGT By 453 and 454 we have for 71 :

20 e consider the order of dimensions to be defined as mentioned in Section 212 to correspond to the increasing order of the side lengths of the box By 22 and Lemma 22 we obtain: where L 0 $ 5 D K J S For the probability of an empty cube we have Note that by 453 we have To obtain a total asymptotic runtime of we guarantee 455 which is satisfied for 7 e choose 7 and we obtain that the expected asymptotic runtime of the CUB $ MTOD with SCAN is if the points are ' drawn independently at random from under positivelybounded distribution e consider the procedure JCT to determine the point set $ The expected number 578 G G K of comparisons of coordinates and bit array entries of the procedure JCT is given by $ 5B8 G G K S S $ $ # e used 454 and Lemma 22 As mentioned above we have the upper bound B of the CUB MTOD with JCT is T X U of is 7 X $ ' if the points are drawn independently at random from distribution Thus the expected runtime A suitable value The expected asymptotic runtime of the CUB MTOD with JCT is under positivelybounded The expected runtimes of the procedures SCAN and JCT can be bounded in terms of the volume of the box and the constants and This can be achieved also in the analysis of the procedure JCT_SCAN by combining the above results e obtain in a straightforward manner that the procedure JCT_SCAN has an expected runtime $ P of T U assuming that the points are ' drawn independently at random from under positivelybounded distribution Theorem 44 The CUB MTOD with one of the orthogonal range searching procedures SCAN JCT and SCAN_JCT has the expected asymptotic runtime given in Theorem 22 Theorem 31 and Theorem 32 respectively if the points of are drawn independently at random under independent distribution or positivelybounded distribution 72

21 5 44 xternalmemory NearestNeighbor Search Many database applications involve massive datasets which do not fit in the main memory of modern machines In such cases the data structures need to be stored on external storage devices such as disks A major performance bottleneck is the InputOutput IO communication between fast internalmemory and slow external storage Minimizing the IO communication is the essential problem considered in the design of externalmemory data structures and algorithms First we define the IO model of computation and then present an externalmemory variant of the CUB MTOD and of the CLL MTOD 441 The xternalmemory Model of Computation Modeling accurately memory and disks systems is a complex task which has been broadly surveyed e shall work in the externalmemory model of computation introduced by Aggarwal and Vitter [2] This standard twolevel disk model focuses on the fact that typical disks read or write large blocks of contiguous data at once in order to amortize the access time over a large amount of data The model has the following parameters [ ]: number of objects in the problem instance number of objects in the problem solution number of objects that can fit into internal memory number of objects per disk block where An IO operation is the reading or writing a block from or into disk respectively see Figure 41 Computation is only performed on objects in internal memory The measures of performance in this model are the number of IOs used to solve a problem as well as the amount of space disk blocks used and the internal memory computation time P M B D Figure 41: An IO moves contiguous elements between disk and main memory More complex multilevel memory models have been considered in the literature where several disks are used in parallel to increase the performance of IO systems e refer to a recent survey by Vitter [64] e primarily want to model the extremely long disk access time relative to the internal running time Therefore we ignore the internal computation time in our algorithmic performance analysis 442 The xternalmemory Data Structure e investigate the nearestneighbor problem in the externalmodel of computation ere an object or item is a coordinate of some data point The number of objects of the problem instance is of some data point 73 or an index

22 e consider the GOINGCUB method introduced in Section 41 and we compute the expected number of IO operations performed by the external variant of this method to determine the nearest neighbor to a query point from the data set e assume that the data points are drawn independently at random under uniform distribution e compare the external GOINGCUB with the external bruteforce method which stores the points with their coordinates successively in contiguous locations of the external memory The external bruteforce method loads the data from the disk and determines the nearest neighbor to the query with IO operations The GOINGCUB method considers cubes of increasing side lengths around the query point until a cube containing points of is found In order to determine the points of a cube expected to contain points the GOINGCUB method calls the orthogonal searching algorithm JCT This algorithm is the only one among the presented searching procedures having an efficient extension in the externalmemory model of computation In the case of the JCT procedure the objects to be tested in the main memory are stored in contiguous locations in disk blocks More precisely the external data structure consists mainly of the arrays 5 introduced in Section 311 For a dimension the entries 5 $ are stored successively in contiguous locations on the disk The th entry is 5 $ where is the th largest coordinate in the dimension and is the index of the corresponding point Additionally the external data structure contains all points of such that all coordinates for some point are stored successively in contiguous locations of the external memory The external data structure is stored in blocks on the disk First we analyze the expected IO complexity of the JCT procedure which is required to determine the set of the points contained in a cube itself expected to contain data points The expected IO complexity of testing coordinates The external JCT procedure consists essentially of reading blocks containing objects 5 $ for each dimension and testing in the main memory whether the corresponding coordinates are still out of the interval $ The objects are read exactly in the same order as the order considered by the internal JCT procedure described in Section 311 : 5 $ F 5 $ for some index and 5 $ F 5 $ for some index see Figure 42 Thus for a given dimension almost all data blocks which are read from the disk contain exclusively relevant data which needs to be tested in the main memory In fact for each dimension there might be at most two read blocks that contain also coordinates which are not of interest The number of read operations required by JCT to test the coordinates of points is bounded by $ points outside points outside S S The additional multiplicative factor 2 comes from the fact that an array entry 5 $ contains 2 objects Let be the side lengths of the intervals By 31 and 32 we get $ points outside P By using the linearity of the expectation we obtain that the expected number of IO transfers performed by JCT in order to make the necessary test of coordinates is bounded by 7 where is the expected number of points of contained in the cube

23 _ T T _ dim j B items B items 0 T j [1] T j [2] T j [l] 1 Figure 42: Almost all blocks contain exclusively relevant data The expected IO complexity of marking rejected points The internal searching algorithm JCT uses a bit array to mark the rejected points in order to determine the points inside the cube A realistic assumption is that this bit array fits in the main memory Under this hypothesis the total IO complexity of the JCT procedure to determine the points of contained in the cube is In case the bit array has to be stored in external memory we proceed as follows to mark the rejected points hile testing coordinates the algorithm fills up a part of the main memory with indices of points which have to be marked as rejected This part of the memory consists of contiguous locations and has size After the fillup phase we determine the blocks of the bit array we need for the marking process and load them into a part of the main memory of size consisting of free contiguous locations The bit entries of interest are set to 1 in the blocks loaded in the main memory If necessary we repeatedly replace those blocks that are not relevant in the current phase anymore These fillup and marking phases are repeated until all coordinates were tested Blocks of the bit array that are already in the memory due to a previous phase can be reused The worst case situation is that the expected number of IO operations required to mark the rejected points is linear in the number of rejecting steps that is This is based on the fact that loaded blocks of the bit array may contain only one entry of interest e propose an alternative procedure to determine the unrejected points Instead of using a bit array we gather the indices of the rejected points during the process of testing coordinates This is done by repeatedly filling up the internal memory sorting the indices such that each index appears only once and writing those indices in contiguous locations of the external memory e end up having blocks each of them containing sorted indices By using the external merge sort algorithm introduced by Aggarwal and Vitter [2] we obtain a list with the indices of the rejected points in increasing order in 1032 _ IOs This list is stored in _ U blocks in the external memory The list is scanned and we determine in T _ U The missing indices are stored in _ U in the cube from the list blocks in the external memory They represent the points contained and are the result of the external JCT procedure IOs the missing indices 75 7

24 Y # # $ _ Y e take a closer look at the above bound 1032 _ 032 of and the base of the logarithm is large in the order of Thus _ For typical values is less 1032 than 4 for all realistic values of and Thus in the above bound the important term is not the term but the term in the denominator of This observation represents an important feature of the external sorting The total IO complexity e consider in the following the realistic assumption that the bit array fits in the main memory In this case the JCT procedure performs expected at most 4 IOs to determine the points of contained in some cube where $ The GOINGCUB method works as described in Section 41 and we choose the sequence such that e specify the value of the parameter later The expected number of points in the cube fulfills where OPQ and OPQ ' Let $ By using 428 and 421 we obtain the following upper bound on the expected number of IOs performed by GOINGCUB during the JCT phase: where S T@ S S S VUX P P points outside ' S S $ P Next we look at the expected number of IO operations performed in the bruteforce part of the external GOINGCUB method If the current cube is the first cube not to be empty then the bruteforce method is called which loads the coordinates of the data points and determines the minimum distance of these points to the query Since the data points contained in are not necessarily stored successively in the external memory the bruteforce method loads each point separately and requires IO operations per loaded point in the worst case e proceed in analogy to 48 and 422 and obtain that the expected number of IOs performed in the bruteforce part of the GOINGCUB is bounded by for an appropriate constant Y Y A realistic assumption is that a data point or a query point fits in the main memory In this case the computation of the side lengths can be done entirely in the main memory since they depend only on the Y 76 Y

25 _ If then by 419 the computation of the side lengths requires 7 IO operations This is based on the fact that as seen in Section 213 the computation of takes steps ach step involves an evaluation of the actual volume which takes IO operations Altogether we obtain the following upper bound on the number of IOs required by the external GOINGCUB to compute the nearest neighbor to the query point : 7 query point and the values % 7 expected _ Since the function takes its minimum at _ we set the parameter to equal _ ' Summarizing we obtain that the external GOINGCUB method finds the nearest neighbor from the _ IO operations data set to the query point in The external GOINGCUB 7 method compares favorably with the external bruteforce method if and which is a realistic assumption in the externalmemory model where the data points do not fit in the main memory 77

26 45 xperiments e present an experimental comparison of implementations of the presented variants of the CUB MTOD and GOINGCUB method e emphasize the factors by which the algorithms are faster than the bruteforce method $ 'N e test our algorithms on random test data e generate data sets in each of size according $ 'C to the uniform distribution For each data set a query point is chosen randomly in For each of the considered settings of and we generate 20 pairs each consisting of a random data set and a random query point Based on those 20 pairs we compute for each considered setting of and the average speedup of our algorithms with respect to the bruteforce method In our first experiments we consider query algorithms that do not need preprocessing Figure 43 shows 3 an experimental comparison for dimension and number of points between 1000 and The value on the vertical scale is the factor by which each algorithm is faster then the bruteforce method The speedup factors of the CUB MTOD and the GOINGCUB method using the SCAN searching procedure are represented by the curves cube and grow respectively The curves adaptcube and adaptgrow illustrate the speedup factors of the methods when the range searching procedure ADAPTIV_SCAN introduced in Section 22 is used The speedup factors of the algorithms measured in practice show a similar behavior to the speedup factor presented in Chapter 2 As expected the ADAPTIV_SCAN improves the query algorithms The observed improvement of the GOINGCUB method over the CUB MTOD is based on the fact that the GOINGCUB method starts with a smaller cube speedup Speedup with respect to bruteforce with dim100 adaptgrow adaptcube grow cube number of points speedup Speedup with respect to bruteforce number of nearest neighbors number of points Figure 43: Comparison of the CUB method and GOINGCUB method with the bruteforce method Figure 44: Comparison for nearestneighbor search of the GOINGCUB method using SCAN with the bruteforce method Figure 44 shows a comparison of the GOINGCUB method for nearestneighbor search with the bruteforce method which computes all distances of the points of to the query point and selects the smallest distances among them The GOINGCUB method performs the orthogonal range searching of the considered cubes by the searching algorithm SCAN The experiments have been carried out for dimension 3 number of points between 1000 and and number of nearest neighbors between and The experimental comparison reflects the speedup factor 7 proven in Section 422 Next we consider the GOINGCUB method with the searching procedures JCT and JCT_SCAN and compare it with the bruteforce method and the GOINGCUB method when using the procedure ADAPTIV_SCAN 78

27 60 50 Speedup with respect to bruteforce with d200 reject_scan reject adaptive_scan Speedup with respect to bruteforce with n10000 reject_scan reject adaptive_scan speedup speedup number of points Figure 45: Speedup factors for fixed dimension dimension Figure 46: Speedup factors for fixed number of points B Figure 45 shows an experimental comparison of the searching algorithms for dimension 3 and sizes of the data set between and Figure 46 shows the experimental comparison 333 of the searching algorithms for and dimension varying between 100 and 500 The value on the vertical scale is the factor by which each algorithm is faster then the bruteforce method Both figures illustrate that for high dimensions the methods using preprocessing dominate clearly the GOING CUB method with the ADAPTIV_SCAN searching procedure Furthermore the JCT_SCAN procedure is better than the procedure JCT as expected from the runtime analysis presented in Section 31 The speedup factor of the GOINGCUB method with the JCT_SCAN procedure is with respect to the bruteforce method Figure 46 points out the dependency of the speedup on the dimension for the methods using preprocessing A speedup factor above 30 is a considerable improvement in practice e conclude with the experimental comparison for nearestneighbor search In Figure 47 the GOINGCUB method using JCT_SCAN is compared with the bruteforce method The experiments 333 have been carried out for number of points dimension between 100 and 500 and number of nearest neighbors between and A growth of the speedup factor with the dimension can be seen ven for big values of the number of nearest neighbors the speedup factor of the GOINGCUB with respect to the bruteforce method is between 10 and 30 79

28 Speedup with respect to bruteforce for n10000 speedup number of nearest neighbors dimension Figure 47: Comparison for nearestneighbor search of the GOINGCUB method using JCT_SCAN with the bruteforce method 80

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems.

The divide-and-conquer paradigm involves three steps at each level of the recursion: Divide the problem into a number of subproblems. 2.3 Designing algorithms There are many ways to design algorithms. Insertion sort uses an incremental approach: having sorted the subarray A[1 j - 1], we insert the single element A[j] into its proper