D-Index: Distance Searching Index for Metric Data Sets

Size: px

Start display at page:

Download "D-Index: Distance Searching Index for Metric Data Sets"

Junior Porter
6 years ago
Views:

1 Multimedia Tools and Applications, 21, 9 33, 23 c 23 Kluwer Academic Publishers. Manufactured in The Netherlands. D-Index: Distance Searching Index for Metric Data Sets VLASTISLAV DOHNAL Masaryk University Brno, Czech Republic xdohnal@fi.muni.cz CLAUDIO GENNARO PASQUALE SAVINO ISI-CNR, Via Moruzzi, 1, 56124, Pisa, Italy c.gennaro@isti.cnr.it p.savino@isti.cnr.it PAVEL ZEZULA Masaryk University Brno, Czech Republic zezula@fi.muni.cz Abstract. In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation. We have also compared D-Index with other index structures and demonstrated its superiority on several real-life data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high rate of delete/insert operations. Keywords: metric spaces, similarity search, index structures, performance evaluation 1. Introduction The concept of similarity searching based on relative distances between a query and database objects has become essential for a number of application areas, e.g. data mining, signal processing, geographic databases, information retrieval, or computational biology. They usually exploit data types such as sets, strings, vectors, or complex structures that can be exemplified by XML documents. Intuitively, the problem is to find similar objects with respect to a query object according to a domain specific distance measure. The problem can be formalized by the mathematical notion of the metric space, so the data elements are assumed to be objects from a metric space where only pairwise distances between the objects can be determined. The growing need to deal with large, possibly distributed, archives requires an indexing support to speedup retrieval. An interesting survey of indexes built on the metric distance paradigm can be found in [4]. The common assumption is that the costs to build and to maintain an index structure are much less important compared to the costs that are needed to execute a query. Though this is certainly true for prevalently static collections of data,

2 1 DOHNAL ET AL. such as those occurring in the traditional information retrieval, some other applications may need to deal with dynamic objects that are permanently subject of change as a function of time. In such environments, updates are inevitable and the costs of update are becoming a matter of concern. One important characteristic of distance based index structures is that, depending on specific data and a distance function, the performance can be either I/O or CPU bound. For example, let us consider two different applications: in the first one, the similarity of text strings of one hundred characters is computed by using the edit distance, which has a quadratic computational complexity; in the second application, one hundred dimensional vectors are compared by the inner product, with a linear computational complexity. Since the text objects are four times shorter than the vectors and the distance computation on the vectors is one hundred times faster than the edit distance, the minimization of I/O accesses for the vectors is much more important that for the text strings. On the contrary, the CPU costs are more significant for computing the distance of text strings than for the distance of vectors. In general, the strategy of indexes based on distances should be able to minimize both of the cost components. An index structure supporting similarity retrieval should be able to execute similarity range queries with any search radius. However, a significant group of applications needs retrieval of objects which are in a very near vicinity of the query object e.g., copy (replica) detection, cluster analysis, DNA mutations, corrupted signals after their transmission over noisy channels, or typing (spelling) errors. Thus in some cases, special attention may be warranted. In this article, we propose a similarity search structure, called D-Index, that is able to reduce both the I/O and CPU costs. The organization stores data objects in buckets with direct access to avoid hierarchical bucket dependencies. Such organization also results in a very efficient insertion and deletion of objects. Though similarity constraints of queries can be defined arbitrarily, the structure is extremely efficient for queries searching for very close objects. A short preliminary version of this article is available in [8]. The major extensions concern generic search algorithms, an extensive experimental evaluation, and comparison with other index structures. The rest of the article is organized as follows. In Section 2, we define the general principles of the D-Index. The system structure, algorithms, and design issues are specified in Section 3, while experimental evaluation results can be found in Section 4. The article concludes in Section Search strategies A metric space M = (D, d) isdefined by a domain of objects (elements, points) D and a total (distance) function d a non negative and symmetric function which satisfies the triangle inequality d(x, y) d(x, z) + d(z, y), x, y, z D. Without any loss of generality, we assume that the maximum distance never exceeds the maximum distance d +. In general, we study the following problem: Given a set X D in the metric space M, pre-process or structure the elements of X so that similarity queries can be answered efficiently.

3 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q D, two fundamental similarity queries can be defined. A range query retrieves all elements within distance r to q, that is the set {x X, d(q, x) r}. A nearest neighbor query retrieves the k closest elements to q, that is a set A X such that A =k and x A, y X A, d(q, x) d(q, y). According to the survey [4], existing approaches to search general metric spaces can be classified in two basic categories: Clustering algorithms. The idea is to divide, often recursively, the search space into partitions (groups, sets, clusters) characterized by some representative information. Pivot-based algorithms. The second trend is based on choosing t distinguished elements from D, storing for each database element x distances to these t pivots, and using such information for speeding up retrieval. Though the number of existing search algorithms is impressive, see [4], the search costs are still high, and there is no algorithm that is able to outperform the others in all situations. Furthermore, most of the algorithms are implemented as main memory structures, thus do not scale up well to deal with large data files. The access structure we propose in this paper, called D-Index, synergistically combines more principles into a single system in order to minimize the amount of accessed data, as well as the number of distance computations. In particular, we define a new technique to recursively cluster data in separable partitions of data blocks and we combine it with pivot-based strategies to decrease the I/O costs. In the following, we first formally specify the underlying principles of our clustering and pivoting techniques and then we present examples to illustrate the ideas behind the definitions Clustering through separable partitioning To achieve our objectives, we base our partitioning principles on a mapping function, which we call the ρ-split function, where ρ is a real number constrained as ρ<d +. In order to gradually explain the concept of ρ-split functions, we first define a first order ρ-split function and its properties. Definition 1. Given a metric space (D, d), a first order ρ-split function s 1,ρ is the mapping s 1,ρ : D {, 1, }, such that for arbitrary different objects x, y D, s 1,ρ (x) = s 1,ρ (y) = 1 d(x, y) > 2ρ (separable property) and ρ 2 ρ 1 s 1,ρ 2 (x) s 1,ρ 1 (y) = d(x, y) >ρ 2 ρ 1 (symmetry property). In other words, the ρ-split function assigns to each object of the space D one of the symbols, 1, or. Moreover, the function must hold the separable and symmetry properties. The meaning and the importance of these properties will be clarified later. We can generalize the ρ-split function by concatenating n first order ρ-split functions with the purpose of obtaining a split function of order n. Definition 2. Givenn first order ρ-split functions s 1,ρ 1,...,sn 1,ρ in the metric space (D, d), a ρ-split function of order ns n,ρ = (s 1,ρ 1, s 1,ρ 2,...,sn 1,ρ ):D {, 1, } n is the mapping,

4 12 DOHNAL ET AL. such that for arbitrary different objects x, y D, is 1,ρ i (x) js 1,ρ j (y) s n,ρ (x) s n,ρ (y) d(x, y) > 2ρ (separable property) and ρ 2 ρ 1 is 1,ρ 2 i (x) js 1,ρ 1 j (y) = d(x, y) >ρ 2 ρ 1 (symmetry property). An obvious consequence of the ρ-split function definitions, useful for our purposes, is that by combining n ρ-split functions of order 1 s 1,ρ 1,...,sn 1,ρ, which have the separable and symmetric properties, we obtain a ρ-split function of order ns n,ρ, which also demonstrates the separable and symmetric properties. We often refer to the number of symbols generated by s n,ρ, that is the parameter n, as the order of the ρ-split function. In order to obtain an addressing scheme, we need another function that transforms the ρ-split strings into integers, which we define as follows. Definition 3. Given a string b = (b 1,...,b n )ofn elements, 1, or, the function : {, 1, } n [..2 n ] is specified as: n [b 1, b 2,...,b n ] 2 = 2 j 1 b j, if jb j b = j=1 2 n, otherwise When all the elements are different from, the function b simply translates the string b into an integer by interpreting it as a binary number (which is always <2 n ), otherwise the function returns 2 n. By means of the ρ-split function and the b operator we can assign an integer number i ( i 2 n ) to each object x D, i.e., the function can group objects from X D in 2 n + 1 disjoint sub-sets. Assuming a ρ-split function s n,ρ, we use the capital letter S n,ρ to designate partitions (sub-sets) of objects from X, which the function can produce. More precisely, given a ρ- split function s n,ρ and a set of objects X,wedefine S n,ρ [i] (X) ={x X sn,ρ (x) =i}.we call the first 2 n sets the separable sets. The last set, that is the set of objects for which the function s n,ρ (x) evaluates to 2 n, is called the exclusion set. For illustration, given two split functions of order 1, s 1,ρ 1 and s 1,ρ 2,aρ-split function of order 2 produces strings as a concatenation of strings returned by functions s 1,ρ 1 and s 1,ρ 2. The combination of the ρ-split functions can also be seen from the perspective of the object data set. Considering again the illustrative example above, the resulting function of order 2 generates an exclusion set as the union of the exclusion sets of the two first order split functions, i.e. (S 1,ρ 1[2] (D) S1,ρ 2[2] (D)). Moreover, the four separable sets, which are given by the combination of the separable sets of the original split functions, are determined as follows: {S 1,ρ 1[] (D) S1,ρ 2[] (D), S1,ρ 1[] (D) S1,ρ 2[1] (D), S1,ρ 1[1] (D) S1,ρ 2[] (D), S1,ρ 1[1] (D) S1,ρ 2[1] (D)}. The separable property of the ρ-split function allows to partition a set of objects X in sub-sets S n,ρ [i] (X), so that the distance from an object in one sub-set to another object in a different sub-set is more than 2ρ, which is true for all i < 2 n. We say that such disjoint separation of sub-sets, or partitioning, is separable up to 2ρ. This property will be used during the retrieval, since a range query with radius ρ requires to access only one of the separable sets and, possibly the exclusion set. For convenience, we denote a set of separable

5 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 13 partitions {S n,ρ [] (D),...,Sn,ρ [2 n 1] (D)} as {Sn,ρ [ ] (D)}. The last partition S n,ρ [2 n ](D) is the exclusion set. When the ρ parameter changes, the separable partitions shrink or widen correspondingly. The symmetric property guarantees a uniform reaction in all the partitions. This property is essential in the similarity search phase as detailed in Sections 3.2 and Pivot-based filtering In general, the pivot-based algorithms can be viewed as a mapping F from the original metric space (D, d) toat-dimensional vector space with the L distance. The mapping assumes a set T ={p 1, p 2,...,p t } of objects from D, called pivots, and for each database object o, the mapping determines its characteristic (feature) vector as F(o) = (d(o, p 1 ), d(o, p 2 ),...,d(o, p t )). We designate the new metric space as M T (R t, L ). At search time, we compute for a query object q the query feature vector F(q) = (d(q, p 1 ), d(q, p 2 ),...,d(q, p t )) and discard for the search radius r an object o if L (F(o), F(q)) > r (1) In other words, the object o can be discarded if for some pivot p i, d(q, p i ) d(o, p i ) > r (2) Due to the triangle inequality, the mapping F is contractive, that is all discarded objects do not belong to the result set. However, some not-discarded objects may not be relevant and must be verified through the original distance function d( ). For more details, see for example [3] Illustration of the idea Before we specify the structure and the insertion/search algorithms of D-Index, we illustrate the ρ-split clustering and pivoting techniques by simple examples. Though several different types of first order ρ-split functions are proposed, analyzed, and evaluated in [7], the ball partitioning split (bps) originally proposed in [12] under the name excluded middle partitioning, provided the smallest exclusion set. For this reason, we also apply this approach, which can be characterized as follows. The ball partitioning ρ-split function bps ρ (x, x v ) uses one object x v D and the medium distance d m to partition the data file into three subsets BPS 1,ρ [], BPS1,ρ [1] and BPS 1,ρ [2] the result of the following function gives the index of the set to which the object x belongs (see figure 1). ifd(x, x v ) d m ρ bps 1,ρ (x) = 1 ifd(x, x v ) > d m + ρ otherwise (3)

6 14 DOHNAL ET AL. Figure 1. The excluded middle partitioning. Note that the median distance d m is relative to x v and it is defined so that the number of objects with distances smaller than d m is the same as the number of objects with distances larger than d m. For more details see [11]. In order to see that the bps ρ -split function behaves in the desired way, it is sufficient to prove that the separable and symmetric properties hold. Separable property. We have to prove that x BPS 1,ρ [], x 1 BPS 1,ρ [1] : d(x, x 1 ) > 2ρ. If we consider Definition 1 of the split function, we can write d(x, x v ) d m ρ and d(x 1, x v ) > d m + ρ. Since the triangle inequality among x, x 1, x v holds, we get d(x, x 1 ) + d(x, x v ) d(x 1, x v ). If we combine these inequalities and simplify the expression, we get d(x, x 1 ) > 2ρ. Symmetric property. We have to prove that x BPS 1,ρ 2 [], x 1 BPS 1,ρ 2 [1], y BPS 1,ρ 1 [2] : d(x, y) >ρ 2 ρ 1 and d(x 1, y) >ρ 2 ρ 1 with ρ 2 ρ 1. We prove it only for the object x because the proof for x 1 is analogous. Since y BPS 1,ρ 1 [2] then d(y, x v ) > d m ρ 1. Moreover, since x BPS 1,ρ 2 [] then d m ρ 2 d(x, x v ). By summing both the sides of the above inequalities we obtain d(y, x v ) d(x, x v ) >ρ 2 ρ 1. Finally, from the triangle inequality we have that d(x, y) > d(y, x v ) d(x, x v ) and then d(x, y) ρ 2 ρ 1. As explained in Section 2.1, once we have defined a set of first order ρ-split functions, it is possible to combine them in order to obtain a function which generates more partitions. This idea is depicted in figure 2, where two bps-split functions, on the two-dimensional space, are used. The domain D, represented by the grey square, is divided into four regions {S 2,ρ [ ] (D)} ={S2,ρ [] (D), S2,ρ [1] (D), S2,ρ [2] (D), S2,ρ [3] (D)}, corresponding to the separable partitions. The partition S 2,ρ [4] (D), the exclusion set, is represented by the brighter region and it is formed by the union of the exclusion sets resulting from the two splits.

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 15 Figure 2. Clustering through partitioning in the two-dimensional space. Figure 3. Example of pivots behavior.

Provided the distance between any object and p i is known, the gray area represents the region of objects x, that do not belong to the query result.

7 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 15 Figure 2. Clustering through partitioning in the two-dimensional space. Figure 3. Example of pivots behavior. In figure 3, the basic principle of the pivoting technique is illustrated, where q is the query object, r the query radius, and p i a pivot. Provided the distance between any object and p i is known, the gray area represents the region of objects x, that do not belong to the query result. This can easily be decided without actually computing the distance between q and x, by using the triangle inequalities d(x, p i ) + d(x, q) d(p i, q) and d(p i, q) + d(x, q) d(x, p i ), and respecting the query radius r. Naturally, by using more pivots, we can improve the probability of excluding an object without actually computing its distance with respect to q. 3. System structure and algorithms In this section we specify the structure of the Distance Searching Index, D-Index, and discuss the problems related to the definition of ρ-split functions and the choice of reference objects.

8 16 DOHNAL ET AL. Then we specify the insertion algorithm and outline the principle of searching. Finally, we define the generic similarity range and nearest neighbor algorithms Storage architecture The basic idea of the D-Index is to create a multilevel storage and retrieval structure that uses several ρ-split functions, one for each level, to create an array of buckets for storing objects. On the first level, we use a ρ-split function for separating objects of the whole data set. For any other level, objects mapped to the exclusion bucket of the previous level are the candidates for storage in separable buckets of this level. Finally, the exclusion bucket of the last level forms the exclusion bucket of the whole D-Index structure. It is worth noting that the ρ-split functions of individual levels use the same ρ. Moreover, split functions can have different order, typically decreasing with the level, allowing the D-Index structure to have levels with a different number of buckets. More precisely, the D-Index structure can be defined as follows. Definition 4. Given a set of objects X, the h-level distance searching index DI ρ (X, m 1, m 2,...,m h ) with the buckets at each level separable up to 2ρ, is determined by h independent ρ-split functions s m i,ρ i (i = 1, 2,..., h) which generate: { m S 1,ρ 1[2 Exclusion bucket E i = m 1 ](X) ifi = 1 S m i,ρ i[2 m i ] (E i 1) ifi > 1 {{ m S 1,ρ 1[ ] (X) } if i = 1 Separable buckets {B i,, B i,1,...,b i,2 m i 1} = { m S i,ρ i[ ] (E i 1 ) } if i > 1 From the structure point of view, you can see the buckets organized as the following two dimensional array consisting of 1 + h i=1 2m i elements. B 1,, B 1,1,...,B 1,2 m 1 1 B 2,, B 2,1,...,B 2,2 m 2 1. B h,, B h,1,...,b h,2 m h 1, E h All separable buckets are included, but only the E h exclusion bucket is present exclusion buckets E i<h are recursively re-partitioned on level i + 1. Then, for each row (i.e. the D- Index level) i, 2 m i buckets are separable up to 2ρ thus we are sure that do not exist two buckets at the same level i both containing relevant objects for any similarity range query with radius r x ρ. Naturally, there is a tradeoff between the number of separable buckets and the amount of objects in the exclusion bucket the more separable buckets there are, the greater the number of objects in the exclusion bucket is. However, the set of objects in the exclusion bucket can recursively be re-partitioned, with possibly different number of bits (m i ) and certainly different ρ-split functions.

9 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS Insertion and search strategies In order to complete our description of the D-Index, we present an insertion algorithm and a sketch of a simplified search algorithm. Advanced search techniques are described in the next section. Insertion of object x X in DI ρ (X, m 1, m 2,...,m h ) proceeds according to the following algorithm. Algorithm 1. Insertion for i = 1 to h if s m i,ρ i (x) < 2 m i x B m i, s i,ρ exit; end if end for x E h ; i (x) ; Starting from the first level, Algorithm 1 tries to accommodate x into a separable bucket. If a suitable bucket exists, the object is stored in this bucket. If it fails for all levels, the object x is placed in the exclusion bucket E h. In any case, the insertion algorithm determines exactly one bucket to store the object. Given a query region Q = R(q, r q ) with q D and r q ρ, a simple algorithm can execute the query as follows. Algorithm 2. Search for i = 1 to h return all objects x such that x Q B i, s m i, i (q) ; end for return all objects x such that x Q E h ; The function s m i, i (q) always gives a value smaller than 2 m i, because ρ =. Consequently, one separable bucket on each level i is determined. Objects from the query response set cannot be in any other separable bucket on the level i, because r q is not greater than ρ and the buckets are separable up to 2ρ. However, some of them can be in the exclusion zone, and Algorithm 2 assumes that exclusion buckets are always accessed. That is why all levels are considered, and also the exclusion bucket E h is accessed. The execution of Algorithm 2 requires h + 1 bucket accesses, which forms the upper bound of a more sophisticated algorithm described in the next section Generic search algorithms The search Algorithm 2 requires to access one bucket at each level of the D-Index, plus the exclusion bucket. In the following two situations, however, the number of accesses can

10 18 DOHNAL ET AL. even be reduced: if the query region is contained in the exclusion partition of the level i, then the query cannot have objects in the separable buckets of this level and only the next level, if it exists, must be considered; if the query region is contained in a separable partition of the level i, the following levels, as well as the exclusion bucket for i = h, need not be accessed, thus the search terminates on this level. Another drawback of the simple algorithm is that it works only for search radii up to ρ. However, with additional computational effort, queries with r q >ρcan also be executed. Indeed, queries with r q >ρcan be executed by evaluating the split function s r q ρ. In case s r q ρ returns a string without any, the result is contained in a single bucket (namely B s rq ρ ) plus, possibly, the exclusion bucket. Let us now consider that the string returned contains at least one. We indicate this string as (b 1,...,b n ) with b i ={, 1, }. In case there is only one b i =, we must access all buckets B, whose index is obtained by substituting in s r q ρ the with and 1. In the most general case we must substitute in (b 1,...,b n ) all the with zeros and ones and generate all possible combinations. A simple example of this concept is illustrated in figure 4. In order to define an algorithm for this process, we need some additional terms and notation. Definition 5. We define an extended exclusive OR bit operator,, which is based on the following truth table: bit bit Figure 4. Example of use of the function G.

11 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 19 Notice that the operator can be used bit-wise on two strings of the same length and that it always returns a standard binary number (i.e., it does not contain any ). Consequently, s 1 s 2 < 2 n is always true for strings s 1 and s 2 of length n (see Definition 3). Definition 6. Given a string s of length n, G(s) denotes a sub-set of n ={, 1,...,2 n 1}, such that all elements e i n for which s e i are eliminated (interpreting e i as a binary string of length n). Observe that, G( ) = n, and that the cardinality of the set is the second power of the number of elements, that is generating all partition identifications in which the symbol is alternatively substituted by zeros and ones. In fact, we can use G( ) to generate, from a string returned by a split function, the set of partitions that need be accessed to execute the query. As an example let us consider that n = 2asinfigure 4. If s 2,r q ρ = ( 1) then we must access buckets B 1 and B 3. In such a case G( 1) ={1, 3}. If s 2,r q ρ = ( ) then we must access all buckets, as G( ) ={, 1, 2, 3} indicates. We only apply the function G when the search radius r q is greater than the parameter ρ. We first evaluate the split function using r q ρ (which is greater than ). Next, the function G is applied to the resulting string. In this way, G generates the partitions that intersect the query region. In the following, we specify how such ideas are applied in the D-Index to implement generic range and nearest neighbor algorithms Range queries. Given a query region Q = R(q, r q ) with q D and r q d +.An advanced algorithm can execute the similarity range query as follows. Algorithm 3. Range search 1. for i = 1 to h 2. if s m i,ρ+r q i (q) < 2 m i then (exclusive containment in a separable bucket) 3. return all objects x such that x Q B m i, s i,ρ+rq i (q) ; exit; 4. end if 5. if r q ρ then (search radius up to ρ) 6. if s m i,ρ r q i (q) < 2 m i then (not exclusive containment in an exclusion bucket) 7. return all objects x such that x Q B m i, s i,ρ rq i (q) ; 8. end if 9. else (search radius greater than ρ) 1. let {l 1, l2,...,lk}=g(s m i,r q ρ i (q)) 11. return all objects x such that x Q B i,l1 or x Q B i,l2 or...or x Q B i,lk ; 12. end if 13. end for 14. return all objects x such that x Q E h ;

12 2 DOHNAL ET AL. In general, Algorithm 3 considers all D-Index levels and eventually also accesses the global exclusion bucket. However, due to the symmetry property, the test on the line 2 can discover the exclusive containment of the query region in a separable bucket and terminate the search earlier. Otherwise, the algorithm proceeds according to the size of the query radius. If it is not greater than ρ, line 5, there are two possibilities. If the test on line 6 is satisfied, one separable bucket is accessed. Otherwise no separable bucket is accessed on this level, because the query region is, from this level point of view, exclusively in the exclusion zone. Provided the search radius is greater than ρ, more separable buckets are accessed on a specific level as defined by lines 1 and 11. Unless terminated earlier, the algorithm accesses the exclusion bucket at line Nearest neighbor queries. The task of the nearest neighbor search is to retrieve k closest elements from X to q D, respecting the metric M. Specifically, it retrieves a set A X such that A =k and x A, y X A, d(q, x) d(q, y). In case of ties, we are satisfied with any set of k elements complying with the condition. For convenience, we designate the distance to the k-th nearest neighbor as d k (d k d + ). The general strategy of Algorithm 4 works as follows. At the beginning (line 1), we assume that the response set A is empty and d k = d +. The algorithm then proceeds with an optimistic strategy (lines 2 through 13) assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails, see the test on line 14, additional steps of search are performed to find the correct result. Specifically, the first phase of the algorithm determines the buckets (maximally one separable at each level) by using the range query strategy with radius r = min{d k,ρ}. In this way, the most promising buckets are accessed and their identifications are remembered to avoid multiple bucket accesses. If d k ρ the search terminates successfully and A contains the result. Otherwise, additional bucket accesses are performed (lines 15 through 22) using the strategy for radii greater than ρ of Algorithm 3, ignoring the buckets already accessed in the first (optimistic) phase. Algorithm 4. Nearest neighbor search 1. A =, d k = d + ; (initialization) 2. for i = 1 to h(first - optimistic - phase) 3. r = min{d k,ρ}; 4. if s m i,ρ+r i (q) < 2 m i then (exclusive containment in a separable bucket) 5. access bucket B i, s m i,ρ+r i (q) ; update A and d k; 6. if d k ρ then exit; (the response set is determined) 7. else 8. if s m i,ρ r i (q) < 2 m i (not exclusive containment in an exclusion bucket) 9. access bucket B i, s m i, i (q) ; update A and d k; 1. end if 11. end if 12. end for 13. access bucket E h ; update A and d k ; 14. if d k >ρ(second phase - if needed)

13 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS for i = 1 to h 16. if s m i,ρ+d k i (q) < 2 m i then (exclusive containment in a separable bucket) 17. access bucket B m i, s i,ρ+d k i (q) if not already accessed; update A and d k; exit; 18. else 19. let {b 1, b 2,...,b k }=G ( s m i,d k ρ i (q) ) 2. access buckets B i,b1,...,b i,b2, B i,bk if not accessed; update A and d k ; 21. end if 22. end for 23. end if The output of the algorithm is the result set A and the distance to the last (k-th) nearest neighbor d k. Whenever a bucket is accessed, the response set A and the distance to the k-th nearest neighbor must be updated. In particular, the new response set is determined as the result of the nearest neighbor query over the union of objects from the accessed bucket and the previous response set A. The value of d k, which is the current search radius in the given bucket, is adjusted according to the new response set. Notice that the pivots are used during the bucket accesses, as it is explained in the next subsection Design issues In order to apply D-Index, the ρ-split functions must be designed, which obviously requires specification of reference (pivot) objects. An important feature of D-Index is that the same reference objects used in the ρ-split functions for the partitioning into buckets are also used as pivots for the internal management of buckets Choosing references. The problem of choosing reference objects (pivots) is important for any search technique in the general metric space, because all such algorithms need, directly or indirectly, some anchors for partitioning and search pruning. It is well known that the way in which pivots are selected can affect the performance of proper algorithms. This has been recognized and demonstrated by several researchers, e.g. [11] or [1], but specific suggestions for choosing good reference objects are rather vague. In this article, we use the same sets of reference objects both for the separable partitioning and the pivot-based filtering. Recently, the problem was systematically studied in [2], and several strategies for selecting pivots have been proposed and tested. The generic conclusion is that the mean µ MT of distance distribution in the feature space M T should be as high as possible, which is formalized by the hypothesis that the set of pivots T 1 ={p 1, p 2,...,p t } is better than the set T 2 ={p 1, p 2,...,p t } if µ MT1 >µ MT2 (4) However, the problem is how to find the mean for given pivot set T. An estimate of such quantity is proposed in [2] and computed as:

14 22 DOHNAL ET AL. at random, choose l pairs of elements {(o 1, o 1 ), (o 2, o 2 ),...,(o l, o l )} from D; for all pairs of elements, compute their distances in the feature space determined by the pivot set T ; compute µ MT as the mean of these distances. We apply the incremental selection strategy which was found in [2] as the most suitable for real world metric spaces, which works as follows. At the beginning, a set T ={p 1 } of one element from a sample of m database elements is chosen, such that the pivot p 1 has the maximum µ MT value. Then, a second pivot p 2 is chosen from another sample of m elements of the database, creating a new set K ={p 1, p 2 } for fixed p 1, maximizing µ MT. The third pivot p 3 is chosen from another sample of m elements of the database, creating another set T ={p 1, p 2, p 3 } for fixed p 1, p 2, maximizing µ MT. The process is repeated until t pivots are determined. In order to verify the capability of the incremental (INC) strategy, we have also tried to select pivots at random (RAN) and with a modified incremental strategy, called the middle search strategy (MID). The MID strategy combines the original INC approach with the following idea. The set of pivots T 1 = {p 1, p 2,...,p t } is better than the set T 2 ={p 1, p 2,...,p t } if µ T 1 d m < µ T2 d m, where d m represents the average global distance in a given data set and µ T is the average distance in the set T. In principle, this strategy supports choices of pivots with high µ MT but also tries to keep distances between pairs of pivots close to d m. The hypothesis is that too close or too distant pairs of pivots are not suitable for good partitioning of metric spaces Elastic bucket implementation. Since a uniform bucket occupation is difficult to guarantee, a bucket consists of a header plus a dynamic list of fixed size blocks of capacity that is sufficient to accommodate any object from the processed file. The header is characterized by a set of pivots and it contains distances to these pivots from all objects stored in the bucket. Furthermore, objects are sorted according to their distances to the first pivot p 1 to form a non-decreasing sequence. In order to cope with dynamic data, a block overflow is solved by splitting. Given a query object q and search radius r a bucket evaluation proceeds in the following steps: ignore all blocks containing objects with distances to p 1 outside the interval [d(q, p 1 ) r, d(q, p 1 ) + r]; for each not ignored block, if o i, L (F(o i ), F(q)) > r then ignore the block; access remaining blocks and o i, L (F(o i ), F(q)) r compute d(q, o i ). If d(q, o i ) r, then o i qualifies. In this way, the number of accessed blocks and necessary distance computations are minimized Properties of D-index On a qualitative level, the most important properties of D-Index can be summarized as follows:

15 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 23 An object is typically inserted at the cost of one block access, because objects in buckets are sorted. Since blocks are of fixed length, a block can overflow and a split of an overloaded block costs another block access. The number of distance computations between the reference (pivot) and inserted objects varies from m 1 to h i=1 m i. For an object inserted on level j, j i=1 m i distance computations are needed. For all response sets such that the distance to the most dissimilar object does not exceed ρ, the number of bucket accesses is maximally h + 1. The smaller the search radius, the more efficient the search can be. For r =, i.e. the exact match, the successful search is typically solved with one block access, unsuccessful search usually does not require any access very skewed distance distributions may result in slightly more accesses. For search radii greater than ρ, the number of bucket accesses is higher than h + 1, and for very large search radii, the query evaluation can eventually access all buckets. The number of accessed blocks in a bucket depends on the search radius and the distance distribution with respect to the first pivot. The number of distance computations between the query object and pivots is determined by the highest level of an accessed bucket. Provided the level is j, the number of distance computations is j i=1 m i, with the upper bound for any kind of query h i=1 m i. The number of distance computations between the query object and data objects stored in accessed buckets depends on the search radii and the number of pre-computed distances to pivots. From the quantitative point of view, we investigate the performance of D-Index in the next section. 4. Performance evaluation and comparison We have implemented the D-Index and conducted numerous experiments to verify its properties on the following three metric data sets: VEC 45-dimensional vectors of image color features compared by the quadratic distance measure respecting correlations between individual colors. URL sets of URL addresses attended by users during work sessions with the Masaryk University information system. The distance measure used was based on set similarity, defined as the fraction of cardinalities of intersection and union (Jacard coefficient). STR sentences of a Czech national corpus compared by the edit distance measure that counts the minimum number of insertions, deletions or substitutions to transform one string into the other. We have chosen these data sets not only to demonstrate the wide range of applicability of the D-Index, but also to show its performance over data with significantly different and realistic data (distance) distributions. For illustration, see figure 5 for distance densities of all our data sets. Notice the practically normal distribution of VEC, very discrete distribution of URL, and the skewed distribution of STR.

16 24 DOHNAL ET AL. Frequency VEC Frequency URL Frequency STR Distance Distance Distance Figure 5. Distance densities for VEC, URL, and STR. In all our experiments, the query objects are not chosen from the indexed data sets, but they follow the same distance distribution. The search costs are measured in terms of distance computations and block reads. We deliberately ignore the L distance measures used by the D-Index for the pivot-based filtering, because according to our tests, the costs to compute such distances are several orders of magnitude smaller than the costs needed to compute any of the distance functions of our experiments. All presented cost values are averages obtained by executing queries for 5 different query objects and constant search selectivity, that is queries using the same search radius or the same number of nearest neighbors D-Index performance evaluation In order to test the basic properties of the D-Index, we have considered about 11 objects for each of our data sets VEC, URL, and STR. Notice that the number of object used does not affect the significance of the results, since the cost measures (i.e., the number of block reads and the number of the distance computations) are not influenced by the cache of the system. Moreover, as shown in the following, performance scalability is linear with data set size. First, we have tested the efficiency of the incremental (INC), the random (RAN), and the middle (MID) strategies to select good reference objects. For each set of the selected reference objects and specific data set, we have built D-Index organizations of pre-selected structures. In Table 1, we summarize characteristics of resulting D-Index organizations, separately for individual data sets. We measured the search efficiency for range queries by changing the search radius to retrieve maximally 2% of data, that is about 2 objects. The results, presented in graphs Table 1. D-Index organization parameters. Set h No. of buckets No. of blocks Block size ρ VEC KB 12 URL KB.225 STR KB 14

17 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 25 Distance Computations VEC Distance Computations URL Distance Computations STR INC MID RAN Search radius (x1) INC MID RAN Search radius INC MID RAN Search radius Figure 6. Search efficiency in the number of distance computations for VEC, URL, and STR. Page Reads VEC Page Reads URL Page Reads STR INC MID RAN Search radius (x1) INC MID RAN Search radius INC MID RAN Search radius Figure 7. Search efficiency in the number of block reads for VEC, URL, and STR. for all three data sets, are available in figure 6 for the distance computations and in figure 7 for the block reads. The general conclusion is that the incremental method proved to be the most suitable for choosing reference objects for the D-Index. Except for very few search cases, the method was systematically better and can certainly be recommended for the sentences, the middle technique was nearly as good as the incremental, and for the vectors, the differences in performance of all three tested methods were the least evident. An interesting observation is that for vectors with distribution close to uniform, even the randomly selected reference points worked quite well. However, when the distance distribution was significantly different from the uniform, the performance with the randomly selected reference points was poor. Notice also some differences in performance between the execution costs quantified in block reads and distance computations. But the general trends are similar and the incremental method is the best. However, different cost quantities are not equally important for all the data sets. For example, the average time for quadratic form distances for the vectors were computed in much less than 9 µsec, while the average time spent computing a distance between two sentences was about 4.5 msec. This implies that the reduction of block reads for vectors was quite important in the total search time, while for the sentences, it was not important at all.

18 26 DOHNAL ET AL. Figure 8. Nearest neighbor search on sentences using different D-Index structures. Figure 9. Range search on sentences using different D-Index structures. The objective of the second group of tests was to investigate the relationship between the D-Index structure and the search efficiency. We have tested this hypothesis on the data set STR, using ρ = 1, 14, 25, 6. The results of experiments for the nearest neighbor search are reported in figure 8 while performance for range queries is summarized in figure 9. It is quite obvious that very small ρ can only be suitable when range search with small radius is used. However, this was not typical for our data where the distance to a nearest neighbor was rather high see in figure 8 the relatively high costs to get the nearest neighbor. This implies that higher values of ρ, such as 14, are preferable. However, there are limits to increase such value, since the structure with ρ = 25 was only exceptionally better than the one with ρ = 14, and the performance with ρ = 6 was rather poor. The experiments demonstrate that a proper choice of the structure can significantly influence the performance. The selection of optimized parameters is still an open research issue, that we plan to investigate in the near future. In the last group of tests, we analyzed the influence of the block size on system performance. We have experimentally studied the problem on the VEC data set both for the nearest neighbor and the range search. The results are summarized in figure 1. In order to obtain an objective comparison, we express the cost as relative block reads, i.e. the percentage of blocks that was necessary to execute a query.

19 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 27 Page Reads (%) block=4 1 block=124 block= Nearest neighbors Page Reads (%) block=4 block=124 block= Search radius (x1) Figure 1. Search efficiency versus the block size. We can conclude from the experiments that the search with smaller blocks requires to read a smaller fraction of the entire data structure. Notice that the number of distance computations for different block sizes is practically constant and depends only on the query. However, a small block size implies a large number of blocks; this means that in case the access cost to a block is significant (e.g., if blocks are allocated randomly) than large block sizes may become preferable. However, when the blocks are stored in a continuous storage space, small blocks can be preferable, because an optimized strategy for reading a set of disk pages, such as that proposed in [1], can be applied. For a static file, we can easily achieve such storage allocation of the D-Index through a simple reorganization D-Index performance comparison We have also compared the performance of D-Index with other index structures under the same workload. In particular, we considered the M-tree 1 [5] and the sequential organization, SEQ, because, according to [4], these are the only types of index structures for metric data that use disk memories to store objects. In order to maximize objectivity of comparison, all three types of index structures are implemented using the GIST package [9], and the actual performance is measured in terms of the number of distance comparisons and I/O operations the basic access unit of disk memory for each data set was the same for all index structures. The main objective of these experiments was to compare the similarity search efficiency of the D-Index with the other organizations. We have also considered the space efficiency, costs to build an index, and the scalability to process growing data collections Search efficiency for different data sets. We have built the D-Index, M-tree, and SEQ for all the data sets VEC, URL, and STR. The M-tree was built by using the bulk load package and the Min-Max splitting strategy. This approach was found in [5] as the most efficient to construct the tree. We have measured average performance over 5 different query objects considering numerous similarity range and nearest neighbor queries. The

20 28 DOHNAL ET AL. Distance Computations VEC Distance Computations URL Distance Computations STR dindex mtree seq dindex mtree seq dindex mtree seq Search radius (x1) Search radius Search radius Figure 11. and STR. Comparison of the range search efficiency in the number of distance computations for VEC, URL, Page Reads VEC Page Reads URL Page Reads STR dindex mtree seq Search radius (x1) dindex mtree seq Search radius dindex mtree seq Search radius Figure 12. Comparison of the range search efficiency in the number of block reads for VEC, URL, and STR. Distance Computations VEC Distance Computations URL Distance Computations STR dindex mtree seq dindex mtree seq dindex mtree seq Nearest neighbors Nearest neighbors Nearest neighbors Figure 13. Comparison of the nearest neighbor search efficiency in the number of distance computations for VEC, URL, and STR. results for the range search are shown in figures 11 and 12, while the performance for the nearest neighbor search is presented in figures 13 and 14. For all tested queries, i.e. retrieving subsets up to 2% of the database, the M-tree and the D-Index always needed less distance computations than the sequential scan. However, this was not true for the block accesses, where even the sequential scan was typically

21 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 29 Page Reads VEC Page Reads URL Page Reads STR dindex mtree seq Nearest neighbors dindex mtree seq Nearest neighbors dindex mtree seq Nearest neighbors Figure 14. and STR. Comparison of the nearest neighbor search efficiency in the number of block reads for VEC, URL, more efficient than the M-tree. In general, the number of block reads of the M-tree was significantly higher than for the D-Index. Only for the URL sets when retrieving large subsets, the number of block reads of the D-Index exceeded the SEQ. In this situation, the M-tree accessed nearly three times more blocks than the D-Index or SEQ. On the other hand, the M-tree and the D-Index required for this search case much less distance computations than the SEQ. Figure 12 demonstrates another interesting observation: to run the exact match query, i.e. range search with r q =, the D-Index only needs to access one block. As a comparison, see (in the same figure) the number of block reads for the M-tree: they are one half of the SEQ for vectors, equal to the SEQ for the URL sets, and even three times more than the SEQ for the sentences. Notice that the exact match search is important when a specific object is to be eliminated the location of the deleted object forms the main cost. In this respect, the D-Index is able to manage deletions more efficiently than the M-tree. We did not verify this fact by explicit experiments, because the available version of the M-tree does not support deletions. Concerning the block reads, the D-Index looks to be significantly better than the M- tree, while comparison on the level of distance computations is not so clear. The D-Index certainly performed much better for the sentences and also for the range search over vectors. However, the nearest neighbor search on vectors needed practically the same number of distance computations. For the URL sets, the M-tree was on average only slightly better than the D-Index Search, space, and insertion scalability. To measure the scalability of the D-Index, M-tree, and SEQ, we have used the 45-dimensional color feature histograms. Particularly, we have considered collections of vectors ranging from 1, to 6, elements. For these experiments, the structure of D-Index was defined by 37 reference objects and 74 buckets, where only the number of blocks in buckets was increasing to deal with growing files. The results for several typical nearest neighbor queries are reported in figure 15. The execution costs for range queries are reported in figure 16. Except for the SEQ organization, the individual curves are labelled with a number, representing either the number of nearest neighbors or the search radius, and a letter, where D stands for the D-Index and M for the M-tree.

22 3 DOHNAL ET AL. Distance Computations D 1 D 1 M 1 M SEQ Data Set Size (x1) Page Reads 3 1 D 25 1 D 1 M 1 M 2 SEQ Data Set Size (x1) Figure 15. Nearest neighbor search scalability. Distance Computations D 2 D 1 M 2 M SEQ Data Set Size (x1) Page Reads 3 1 D 25 2 D 1 M 2 M 2 SEQ Data Set Size (x1) Figure 16. Range search scalability. We can observe that on the level of distance computations the D-Index is usually slightly better than the M-tree, but the differences are not significant the D-Index and M-tree can save considerable number of distance computations compared to the SEQ. To solve a query, the M-tree needs significantly more block reads than the D-Index and for some queries, see 2M in figure 16, this number is even higher than for the SEQ.We have also investigated the performance of range queries with zero radius, i.e. the exact match. Independently of the data set size, the D-Index required one block access and 18 distance comparisons. This was in a sharp contrast with the M-tree, which needed about 6, block reads and 2, distance computations to find the exact match in the set of 6, vectors. The search scale up of the D-Index was strictly linear. In this respect, the M-tree was even slightly better, because the execution costs for processing a two times larger file were not two times higher. This sublinear behavior should be attributed to the fact that the M-tree is incrementally reorganizing its structure by splitting blocks and, in this way, improving the clustering of data. On the other hand, the D-Index was using a constant bucket structure, where only the number of blocks was changing. Such observation is posing another

23 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 31 research challenge, specifically, to build a D-Index structure with a dynamic number of buckets. The disk space to store data was also increasing in all our organizations linearly. The D-Index needed about 2% more space than the SEQ. In contrast, the M-tree occupied two times more space than the SEQ. In order to insert one object, the D-Index evaluated on average the distance function 18 times. That means that distances to (practically) one half of the 37 reference objects were computed. For this operation, the D- Index performed 1 block read and, due to the dynamic bucket implementation using splits, a bit more than one block writes. The insertion into the M-tree was more expensive and required on average 6 distance computations, 12 block reads, and 2 block writes. 5. Conclusions Recently, metric spaces have become an important paradigm for similarity search and many index structures supporting execution of similarity queries have been proposed. However, most of the existing structures are only limited to operate on main memory, so they do not scale up to high volumes of data. We have concentrated on the case where indexed data are stored on disks and we have proposed a new index structure for similarity range and nearest neighbor queries. Contrary to other index structures, such as the M-tree, the D-Index stores and deletes any object with one block access cost, so it is particularly suitable for dynamic data environments. Compared to the M-tree, it typically needs less distance computations and much less disk reads to execute a query. The D-Index is also economical in space requirements. It needs slightly more space than the sequential organization, but at least two times less disk space compared to the M-tree. As experiments confirm, it scales up well to large data volumes. Our future research will concentrate on developing methodologies that would support optimal design of the D-Index structures for specific applications. The dynamic bucket management is another research challenge. We will also study parallel implementations, which the multilevel hashing structure of the D-Index inherently offers. Finally, we will carefully study two very recent proposals to index structures based on selection of reference objects [6, 13] and compare the performance of D-index with these approaches. Acknowledgments We would like to thank the anonymous referees for their comments on the earlier version of the paper. Note 1. The software is available at

32 DOHNAL ET AL. References 1. T. Bozkaya and Ozsoyoglu, Indexing large metric spaces for similarity search queries, ACM TODS, Vol. 24, No. 3, pp. 361 44, 1999. 2. B. Bustos, G. Navarro, and E.

24 32 DOHNAL ET AL. References 1. T. Bozkaya and Ozsoyoglu, Indexing large metric spaces for similarity search queries, ACM TODS, Vol. 24, No. 3, pp , B. Bustos, G. Navarro, and E. Chavez, Pivot selection techniques for proximity searching in metric spaces, in Proceedings of the XXI Conference of the Chielan Computer Science Society (SCCC1), IEEE CS Press, 21, pp E. Chavez, J. Marroquin, and G. Navarro, Fixed queries array: A fast and economical data structure for proximity searching, Multimedia Tools and Applications, Vol. 14, No. 2, pp , E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin, Proximity searching in metric spaces, ACM Computing Surveys. Vol. 33, No. 3, pp , P. Ciaccia, M. Patella, and P. Zezula, M-tree: An efficient access method for similarity search in metric spaces, in Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997, pp R.F.S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos, Similarity search without tears: The OMNI-family of all-purpose access methods, in Proceedings of the 17th ICDE Conference, Heidelberg, Germany, 21, pp V. Dohnal, C. Gennaro, P. Savino, and P. Zezula, Separable splits in metric data sets, in Proceedings of 9-th Italian Symposium on Advanced Database Systems, Venice, Italy, June 21, pp , LCM Selecta Group Milano. 8. C. Gennaro, P. Savino, and P. Zezula, Similarity search in metric databases through Hashing, in Proceedings of ACM Multimedia 21 Workshops, Oct. 21, Ottawa, Canada, pp J.M. Hellerstein, J.F. Naughton, and A. Pfeffer, Generalized search trees for database systems, in Proceedings of the 21st VLDB Conference, 1995, pp B. Seeger, P. Larson, and R. McFayden, Reading a set of disk pages, in Proceedings of the 19th VLDB Conference, 1993, pp P.N. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, ACM- SIAM Symposium on Discrete Algorithms (SODA), 1993, pp P.N. Yianilos, Excluded middle vantage point forests for nearest neighbor search, Tech. rep., NEC Research Institute, 1999, Presented at Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop, Jan. 15, C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, Indexing the Distance: An efficient method to KNN processing, in Proceedings of the 27th VLDB Conference, Roma, Italy, 21, pp Vlastislav Dohnal received the B.Sc. degree and M.Sc. degree in Computer Science from the Masaryk University, Czech Republic, in 1999 and 2, respectively. Currently, he is a Ph.D. student at the Faculty of Informatics, Masaryk University. His interests include multimedia data indexing, information retrieval, metric spaces, and related areas.

D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 33 Claudio Gennaro received the Laurea degree in Electronic Engineering from University of Pisa in 1994 and the Ph.D. degree in Computer and Automation Engineering from Politecnico di Milano in 1999.

studies were in the field of Performance Evaluation of Computer Systems and Parallel Applications.

Pasquale Savino graduated in Physics at the University of Pisa, Italy, in 198.

25 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 33 Claudio Gennaro received the Laurea degree in Electronic Engineering from University of Pisa in 1994 and the Ph.D. degree in Computer and Automation Engineering from Politecnico di Milano in He is now researcher at IEI an institute of National Research Council (CNR) situated in Pisa. His Ph.D. studies were in the field of Performance Evaluation of Computer Systems and Parallel Applications. His current main research interests are Performance Evaluation, Similarity retrieval, Storage Structures for Multimedia Information Retrieval and Multimedia document modelling. Pasquale Savino graduated in Physics at the University of Pisa, Italy, in 198. From 1983 to 1995 he has worked at the Olivetti Research Labs in Pisa; since 1996 he has been a member of the research staff at CNR-IEI in Pisa, working in the area of multimedia information systems. He has participated and coordinated several EU-funded research projects in the multimedia area, among which MULTOS (Multimedia Office Systems), OSMOSE (Open Standard for Multimedia Optical Storage Environments), HYTEA (HYperText Authoring), M-CUBE (Multiple Media Multiple Communication Workstation), MIMICS (Multiparty Interactive Multimedia Conferencing Services), HERMES (Foundations of High Performance Multimedia Information Management Systems). Currently, he is coordinating the project ECHO (European Chronicles on Line). He has published scientific papers in many international journals and conferences in the areas of multimedia document retrieval and information retrieval. His current research interests are multimedia information retrieval, multimedia content addressability and indexing. Pavel Zezula is Full Professor at the Faculty of Informatics, Masaryk University of Brno. For many years, he has been cooperating with the IEI-CNR Pisa on numerous research projects in the areas of Multimedia Systems and Digital Libraries. His research interests concern storage and index structures for non-traditional data types, similarity search, and performance evaluation. Recently, he has focused on index structures and query evaluation of large collections of XML documents.

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Scalability Comparison of Peer-to-Peer Similarity-Search Structures Michal Batko a David Novak a Fabrizio Falchi b Pavel Zezula a a Masaryk University, Brno, Czech Republic b ISTI-CNR, Pisa, Italy Abstract