D-Index: Distance Searching Index for Metric Data Sets

Size: px
Start display at page:

Download "D-Index: Distance Searching Index for Metric Data Sets"

Transcription

1 Multimedia Tools and Applications, 21, 9 33, 23 c 23 Kluwer Academic Publishers. Manufactured in The Netherlands. D-Index: Distance Searching Index for Metric Data Sets VLASTISLAV DOHNAL Masaryk University Brno, Czech Republic xdohnal@fi.muni.cz CLAUDIO GENNARO PASQUALE SAVINO ISI-CNR, Via Moruzzi, 1, 56124, Pisa, Italy c.gennaro@isti.cnr.it p.savino@isti.cnr.it PAVEL ZEZULA Masaryk University Brno, Czech Republic zezula@fi.muni.cz Abstract. In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces have become a popular paradigm for similarity retrieval. We propose a new index structure, called D-Index, that combines a novel clustering technique and the pivot-based distance searching strategy to speed up execution of similarity range and nearest neighbor queries for large files with objects stored in disk memories. We have qualitatively analyzed D-Index and verified its properties on actual implementation. We have also compared D-Index with other index structures and demonstrated its superiority on several real-life data sets. Contrary to tree organizations, the D-Index structure is suitable for dynamic environments with a high rate of delete/insert operations. Keywords: metric spaces, similarity search, index structures, performance evaluation 1. Introduction The concept of similarity searching based on relative distances between a query and database objects has become essential for a number of application areas, e.g. data mining, signal processing, geographic databases, information retrieval, or computational biology. They usually exploit data types such as sets, strings, vectors, or complex structures that can be exemplified by XML documents. Intuitively, the problem is to find similar objects with respect to a query object according to a domain specific distance measure. The problem can be formalized by the mathematical notion of the metric space, so the data elements are assumed to be objects from a metric space where only pairwise distances between the objects can be determined. The growing need to deal with large, possibly distributed, archives requires an indexing support to speedup retrieval. An interesting survey of indexes built on the metric distance paradigm can be found in [4]. The common assumption is that the costs to build and to maintain an index structure are much less important compared to the costs that are needed to execute a query. Though this is certainly true for prevalently static collections of data,

2 1 DOHNAL ET AL. such as those occurring in the traditional information retrieval, some other applications may need to deal with dynamic objects that are permanently subject of change as a function of time. In such environments, updates are inevitable and the costs of update are becoming a matter of concern. One important characteristic of distance based index structures is that, depending on specific data and a distance function, the performance can be either I/O or CPU bound. For example, let us consider two different applications: in the first one, the similarity of text strings of one hundred characters is computed by using the edit distance, which has a quadratic computational complexity; in the second application, one hundred dimensional vectors are compared by the inner product, with a linear computational complexity. Since the text objects are four times shorter than the vectors and the distance computation on the vectors is one hundred times faster than the edit distance, the minimization of I/O accesses for the vectors is much more important that for the text strings. On the contrary, the CPU costs are more significant for computing the distance of text strings than for the distance of vectors. In general, the strategy of indexes based on distances should be able to minimize both of the cost components. An index structure supporting similarity retrieval should be able to execute similarity range queries with any search radius. However, a significant group of applications needs retrieval of objects which are in a very near vicinity of the query object e.g., copy (replica) detection, cluster analysis, DNA mutations, corrupted signals after their transmission over noisy channels, or typing (spelling) errors. Thus in some cases, special attention may be warranted. In this article, we propose a similarity search structure, called D-Index, that is able to reduce both the I/O and CPU costs. The organization stores data objects in buckets with direct access to avoid hierarchical bucket dependencies. Such organization also results in a very efficient insertion and deletion of objects. Though similarity constraints of queries can be defined arbitrarily, the structure is extremely efficient for queries searching for very close objects. A short preliminary version of this article is available in [8]. The major extensions concern generic search algorithms, an extensive experimental evaluation, and comparison with other index structures. The rest of the article is organized as follows. In Section 2, we define the general principles of the D-Index. The system structure, algorithms, and design issues are specified in Section 3, while experimental evaluation results can be found in Section 4. The article concludes in Section Search strategies A metric space M = (D, d) isdefined by a domain of objects (elements, points) D and a total (distance) function d a non negative and symmetric function which satisfies the triangle inequality d(x, y) d(x, z) + d(z, y), x, y, z D. Without any loss of generality, we assume that the maximum distance never exceeds the maximum distance d +. In general, we study the following problem: Given a set X D in the metric space M, pre-process or structure the elements of X so that similarity queries can be answered efficiently.

3 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 11 For a query object q D, two fundamental similarity queries can be defined. A range query retrieves all elements within distance r to q, that is the set {x X, d(q, x) r}. A nearest neighbor query retrieves the k closest elements to q, that is a set A X such that A =k and x A, y X A, d(q, x) d(q, y). According to the survey [4], existing approaches to search general metric spaces can be classified in two basic categories: Clustering algorithms. The idea is to divide, often recursively, the search space into partitions (groups, sets, clusters) characterized by some representative information. Pivot-based algorithms. The second trend is based on choosing t distinguished elements from D, storing for each database element x distances to these t pivots, and using such information for speeding up retrieval. Though the number of existing search algorithms is impressive, see [4], the search costs are still high, and there is no algorithm that is able to outperform the others in all situations. Furthermore, most of the algorithms are implemented as main memory structures, thus do not scale up well to deal with large data files. The access structure we propose in this paper, called D-Index, synergistically combines more principles into a single system in order to minimize the amount of accessed data, as well as the number of distance computations. In particular, we define a new technique to recursively cluster data in separable partitions of data blocks and we combine it with pivot-based strategies to decrease the I/O costs. In the following, we first formally specify the underlying principles of our clustering and pivoting techniques and then we present examples to illustrate the ideas behind the definitions Clustering through separable partitioning To achieve our objectives, we base our partitioning principles on a mapping function, which we call the ρ-split function, where ρ is a real number constrained as ρ<d +. In order to gradually explain the concept of ρ-split functions, we first define a first order ρ-split function and its properties. Definition 1. Given a metric space (D, d), a first order ρ-split function s 1,ρ is the mapping s 1,ρ : D {, 1, }, such that for arbitrary different objects x, y D, s 1,ρ (x) = s 1,ρ (y) = 1 d(x, y) > 2ρ (separable property) and ρ 2 ρ 1 s 1,ρ 2 (x) s 1,ρ 1 (y) = d(x, y) >ρ 2 ρ 1 (symmetry property). In other words, the ρ-split function assigns to each object of the space D one of the symbols, 1, or. Moreover, the function must hold the separable and symmetry properties. The meaning and the importance of these properties will be clarified later. We can generalize the ρ-split function by concatenating n first order ρ-split functions with the purpose of obtaining a split function of order n. Definition 2. Givenn first order ρ-split functions s 1,ρ 1,...,sn 1,ρ in the metric space (D, d), a ρ-split function of order ns n,ρ = (s 1,ρ 1, s 1,ρ 2,...,sn 1,ρ ):D {, 1, } n is the mapping,

4 12 DOHNAL ET AL. such that for arbitrary different objects x, y D, is 1,ρ i (x) js 1,ρ j (y) s n,ρ (x) s n,ρ (y) d(x, y) > 2ρ (separable property) and ρ 2 ρ 1 is 1,ρ 2 i (x) js 1,ρ 1 j (y) = d(x, y) >ρ 2 ρ 1 (symmetry property). An obvious consequence of the ρ-split function definitions, useful for our purposes, is that by combining n ρ-split functions of order 1 s 1,ρ 1,...,sn 1,ρ, which have the separable and symmetric properties, we obtain a ρ-split function of order ns n,ρ, which also demonstrates the separable and symmetric properties. We often refer to the number of symbols generated by s n,ρ, that is the parameter n, as the order of the ρ-split function. In order to obtain an addressing scheme, we need another function that transforms the ρ-split strings into integers, which we define as follows. Definition 3. Given a string b = (b 1,...,b n )ofn elements, 1, or, the function : {, 1, } n [..2 n ] is specified as: n [b 1, b 2,...,b n ] 2 = 2 j 1 b j, if jb j b = j=1 2 n, otherwise When all the elements are different from, the function b simply translates the string b into an integer by interpreting it as a binary number (which is always <2 n ), otherwise the function returns 2 n. By means of the ρ-split function and the b operator we can assign an integer number i ( i 2 n ) to each object x D, i.e., the function can group objects from X D in 2 n + 1 disjoint sub-sets. Assuming a ρ-split function s n,ρ, we use the capital letter S n,ρ to designate partitions (sub-sets) of objects from X, which the function can produce. More precisely, given a ρ- split function s n,ρ and a set of objects X,wedefine S n,ρ [i] (X) ={x X sn,ρ (x) =i}.we call the first 2 n sets the separable sets. The last set, that is the set of objects for which the function s n,ρ (x) evaluates to 2 n, is called the exclusion set. For illustration, given two split functions of order 1, s 1,ρ 1 and s 1,ρ 2,aρ-split function of order 2 produces strings as a concatenation of strings returned by functions s 1,ρ 1 and s 1,ρ 2. The combination of the ρ-split functions can also be seen from the perspective of the object data set. Considering again the illustrative example above, the resulting function of order 2 generates an exclusion set as the union of the exclusion sets of the two first order split functions, i.e. (S 1,ρ 1[2] (D) S1,ρ 2[2] (D)). Moreover, the four separable sets, which are given by the combination of the separable sets of the original split functions, are determined as follows: {S 1,ρ 1[] (D) S1,ρ 2[] (D), S1,ρ 1[] (D) S1,ρ 2[1] (D), S1,ρ 1[1] (D) S1,ρ 2[] (D), S1,ρ 1[1] (D) S1,ρ 2[1] (D)}. The separable property of the ρ-split function allows to partition a set of objects X in sub-sets S n,ρ [i] (X), so that the distance from an object in one sub-set to another object in a different sub-set is more than 2ρ, which is true for all i < 2 n. We say that such disjoint separation of sub-sets, or partitioning, is separable up to 2ρ. This property will be used during the retrieval, since a range query with radius ρ requires to access only one of the separable sets and, possibly the exclusion set. For convenience, we denote a set of separable

5 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 13 partitions {S n,ρ [] (D),...,Sn,ρ [2 n 1] (D)} as {Sn,ρ [ ] (D)}. The last partition S n,ρ [2 n ](D) is the exclusion set. When the ρ parameter changes, the separable partitions shrink or widen correspondingly. The symmetric property guarantees a uniform reaction in all the partitions. This property is essential in the similarity search phase as detailed in Sections 3.2 and Pivot-based filtering In general, the pivot-based algorithms can be viewed as a mapping F from the original metric space (D, d) toat-dimensional vector space with the L distance. The mapping assumes a set T ={p 1, p 2,...,p t } of objects from D, called pivots, and for each database object o, the mapping determines its characteristic (feature) vector as F(o) = (d(o, p 1 ), d(o, p 2 ),...,d(o, p t )). We designate the new metric space as M T (R t, L ). At search time, we compute for a query object q the query feature vector F(q) = (d(q, p 1 ), d(q, p 2 ),...,d(q, p t )) and discard for the search radius r an object o if L (F(o), F(q)) > r (1) In other words, the object o can be discarded if for some pivot p i, d(q, p i ) d(o, p i ) > r (2) Due to the triangle inequality, the mapping F is contractive, that is all discarded objects do not belong to the result set. However, some not-discarded objects may not be relevant and must be verified through the original distance function d( ). For more details, see for example [3] Illustration of the idea Before we specify the structure and the insertion/search algorithms of D-Index, we illustrate the ρ-split clustering and pivoting techniques by simple examples. Though several different types of first order ρ-split functions are proposed, analyzed, and evaluated in [7], the ball partitioning split (bps) originally proposed in [12] under the name excluded middle partitioning, provided the smallest exclusion set. For this reason, we also apply this approach, which can be characterized as follows. The ball partitioning ρ-split function bps ρ (x, x v ) uses one object x v D and the medium distance d m to partition the data file into three subsets BPS 1,ρ [], BPS1,ρ [1] and BPS 1,ρ [2] the result of the following function gives the index of the set to which the object x belongs (see figure 1). ifd(x, x v ) d m ρ bps 1,ρ (x) = 1 ifd(x, x v ) > d m + ρ otherwise (3)

6 14 DOHNAL ET AL. Figure 1. The excluded middle partitioning. Note that the median distance d m is relative to x v and it is defined so that the number of objects with distances smaller than d m is the same as the number of objects with distances larger than d m. For more details see [11]. In order to see that the bps ρ -split function behaves in the desired way, it is sufficient to prove that the separable and symmetric properties hold. Separable property. We have to prove that x BPS 1,ρ [], x 1 BPS 1,ρ [1] : d(x, x 1 ) > 2ρ. If we consider Definition 1 of the split function, we can write d(x, x v ) d m ρ and d(x 1, x v ) > d m + ρ. Since the triangle inequality among x, x 1, x v holds, we get d(x, x 1 ) + d(x, x v ) d(x 1, x v ). If we combine these inequalities and simplify the expression, we get d(x, x 1 ) > 2ρ. Symmetric property. We have to prove that x BPS 1,ρ 2 [], x 1 BPS 1,ρ 2 [1], y BPS 1,ρ 1 [2] : d(x, y) >ρ 2 ρ 1 and d(x 1, y) >ρ 2 ρ 1 with ρ 2 ρ 1. We prove it only for the object x because the proof for x 1 is analogous. Since y BPS 1,ρ 1 [2] then d(y, x v ) > d m ρ 1. Moreover, since x BPS 1,ρ 2 [] then d m ρ 2 d(x, x v ). By summing both the sides of the above inequalities we obtain d(y, x v ) d(x, x v ) >ρ 2 ρ 1. Finally, from the triangle inequality we have that d(x, y) > d(y, x v ) d(x, x v ) and then d(x, y) ρ 2 ρ 1. As explained in Section 2.1, once we have defined a set of first order ρ-split functions, it is possible to combine them in order to obtain a function which generates more partitions. This idea is depicted in figure 2, where two bps-split functions, on the two-dimensional space, are used. The domain D, represented by the grey square, is divided into four regions {S 2,ρ [ ] (D)} ={S2,ρ [] (D), S2,ρ [1] (D), S2,ρ [2] (D), S2,ρ [3] (D)}, corresponding to the separable partitions. The partition S 2,ρ [4] (D), the exclusion set, is represented by the brighter region and it is formed by the union of the exclusion sets resulting from the two splits.

7 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 15 Figure 2. Clustering through partitioning in the two-dimensional space. Figure 3. Example of pivots behavior. In figure 3, the basic principle of the pivoting technique is illustrated, where q is the query object, r the query radius, and p i a pivot. Provided the distance between any object and p i is known, the gray area represents the region of objects x, that do not belong to the query result. This can easily be decided without actually computing the distance between q and x, by using the triangle inequalities d(x, p i ) + d(x, q) d(p i, q) and d(p i, q) + d(x, q) d(x, p i ), and respecting the query radius r. Naturally, by using more pivots, we can improve the probability of excluding an object without actually computing its distance with respect to q. 3. System structure and algorithms In this section we specify the structure of the Distance Searching Index, D-Index, and discuss the problems related to the definition of ρ-split functions and the choice of reference objects.

8 16 DOHNAL ET AL. Then we specify the insertion algorithm and outline the principle of searching. Finally, we define the generic similarity range and nearest neighbor algorithms Storage architecture The basic idea of the D-Index is to create a multilevel storage and retrieval structure that uses several ρ-split functions, one for each level, to create an array of buckets for storing objects. On the first level, we use a ρ-split function for separating objects of the whole data set. For any other level, objects mapped to the exclusion bucket of the previous level are the candidates for storage in separable buckets of this level. Finally, the exclusion bucket of the last level forms the exclusion bucket of the whole D-Index structure. It is worth noting that the ρ-split functions of individual levels use the same ρ. Moreover, split functions can have different order, typically decreasing with the level, allowing the D-Index structure to have levels with a different number of buckets. More precisely, the D-Index structure can be defined as follows. Definition 4. Given a set of objects X, the h-level distance searching index DI ρ (X, m 1, m 2,...,m h ) with the buckets at each level separable up to 2ρ, is determined by h independent ρ-split functions s m i,ρ i (i = 1, 2,..., h) which generate: { m S 1,ρ 1[2 Exclusion bucket E i = m 1 ](X) ifi = 1 S m i,ρ i[2 m i ] (E i 1) ifi > 1 {{ m S 1,ρ 1[ ] (X) } if i = 1 Separable buckets {B i,, B i,1,...,b i,2 m i 1} = { m S i,ρ i[ ] (E i 1 ) } if i > 1 From the structure point of view, you can see the buckets organized as the following two dimensional array consisting of 1 + h i=1 2m i elements. B 1,, B 1,1,...,B 1,2 m 1 1 B 2,, B 2,1,...,B 2,2 m 2 1. B h,, B h,1,...,b h,2 m h 1, E h All separable buckets are included, but only the E h exclusion bucket is present exclusion buckets E i<h are recursively re-partitioned on level i + 1. Then, for each row (i.e. the D- Index level) i, 2 m i buckets are separable up to 2ρ thus we are sure that do not exist two buckets at the same level i both containing relevant objects for any similarity range query with radius r x ρ. Naturally, there is a tradeoff between the number of separable buckets and the amount of objects in the exclusion bucket the more separable buckets there are, the greater the number of objects in the exclusion bucket is. However, the set of objects in the exclusion bucket can recursively be re-partitioned, with possibly different number of bits (m i ) and certainly different ρ-split functions.

9 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS Insertion and search strategies In order to complete our description of the D-Index, we present an insertion algorithm and a sketch of a simplified search algorithm. Advanced search techniques are described in the next section. Insertion of object x X in DI ρ (X, m 1, m 2,...,m h ) proceeds according to the following algorithm. Algorithm 1. Insertion for i = 1 to h if s m i,ρ i (x) < 2 m i x B m i, s i,ρ exit; end if end for x E h ; i (x) ; Starting from the first level, Algorithm 1 tries to accommodate x into a separable bucket. If a suitable bucket exists, the object is stored in this bucket. If it fails for all levels, the object x is placed in the exclusion bucket E h. In any case, the insertion algorithm determines exactly one bucket to store the object. Given a query region Q = R(q, r q ) with q D and r q ρ, a simple algorithm can execute the query as follows. Algorithm 2. Search for i = 1 to h return all objects x such that x Q B i, s m i, i (q) ; end for return all objects x such that x Q E h ; The function s m i, i (q) always gives a value smaller than 2 m i, because ρ =. Consequently, one separable bucket on each level i is determined. Objects from the query response set cannot be in any other separable bucket on the level i, because r q is not greater than ρ and the buckets are separable up to 2ρ. However, some of them can be in the exclusion zone, and Algorithm 2 assumes that exclusion buckets are always accessed. That is why all levels are considered, and also the exclusion bucket E h is accessed. The execution of Algorithm 2 requires h + 1 bucket accesses, which forms the upper bound of a more sophisticated algorithm described in the next section Generic search algorithms The search Algorithm 2 requires to access one bucket at each level of the D-Index, plus the exclusion bucket. In the following two situations, however, the number of accesses can

10 18 DOHNAL ET AL. even be reduced: if the query region is contained in the exclusion partition of the level i, then the query cannot have objects in the separable buckets of this level and only the next level, if it exists, must be considered; if the query region is contained in a separable partition of the level i, the following levels, as well as the exclusion bucket for i = h, need not be accessed, thus the search terminates on this level. Another drawback of the simple algorithm is that it works only for search radii up to ρ. However, with additional computational effort, queries with r q >ρcan also be executed. Indeed, queries with r q >ρcan be executed by evaluating the split function s r q ρ. In case s r q ρ returns a string without any, the result is contained in a single bucket (namely B s rq ρ ) plus, possibly, the exclusion bucket. Let us now consider that the string returned contains at least one. We indicate this string as (b 1,...,b n ) with b i ={, 1, }. In case there is only one b i =, we must access all buckets B, whose index is obtained by substituting in s r q ρ the with and 1. In the most general case we must substitute in (b 1,...,b n ) all the with zeros and ones and generate all possible combinations. A simple example of this concept is illustrated in figure 4. In order to define an algorithm for this process, we need some additional terms and notation. Definition 5. We define an extended exclusive OR bit operator,, which is based on the following truth table: bit bit Figure 4. Example of use of the function G.

11 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 19 Notice that the operator can be used bit-wise on two strings of the same length and that it always returns a standard binary number (i.e., it does not contain any ). Consequently, s 1 s 2 < 2 n is always true for strings s 1 and s 2 of length n (see Definition 3). Definition 6. Given a string s of length n, G(s) denotes a sub-set of n ={, 1,...,2 n 1}, such that all elements e i n for which s e i are eliminated (interpreting e i as a binary string of length n). Observe that, G( ) = n, and that the cardinality of the set is the second power of the number of elements, that is generating all partition identifications in which the symbol is alternatively substituted by zeros and ones. In fact, we can use G( ) to generate, from a string returned by a split function, the set of partitions that need be accessed to execute the query. As an example let us consider that n = 2asinfigure 4. If s 2,r q ρ = ( 1) then we must access buckets B 1 and B 3. In such a case G( 1) ={1, 3}. If s 2,r q ρ = ( ) then we must access all buckets, as G( ) ={, 1, 2, 3} indicates. We only apply the function G when the search radius r q is greater than the parameter ρ. We first evaluate the split function using r q ρ (which is greater than ). Next, the function G is applied to the resulting string. In this way, G generates the partitions that intersect the query region. In the following, we specify how such ideas are applied in the D-Index to implement generic range and nearest neighbor algorithms Range queries. Given a query region Q = R(q, r q ) with q D and r q d +.An advanced algorithm can execute the similarity range query as follows. Algorithm 3. Range search 1. for i = 1 to h 2. if s m i,ρ+r q i (q) < 2 m i then (exclusive containment in a separable bucket) 3. return all objects x such that x Q B m i, s i,ρ+rq i (q) ; exit; 4. end if 5. if r q ρ then (search radius up to ρ) 6. if s m i,ρ r q i (q) < 2 m i then (not exclusive containment in an exclusion bucket) 7. return all objects x such that x Q B m i, s i,ρ rq i (q) ; 8. end if 9. else (search radius greater than ρ) 1. let {l 1, l2,...,lk}=g(s m i,r q ρ i (q)) 11. return all objects x such that x Q B i,l1 or x Q B i,l2 or...or x Q B i,lk ; 12. end if 13. end for 14. return all objects x such that x Q E h ;

12 2 DOHNAL ET AL. In general, Algorithm 3 considers all D-Index levels and eventually also accesses the global exclusion bucket. However, due to the symmetry property, the test on the line 2 can discover the exclusive containment of the query region in a separable bucket and terminate the search earlier. Otherwise, the algorithm proceeds according to the size of the query radius. If it is not greater than ρ, line 5, there are two possibilities. If the test on line 6 is satisfied, one separable bucket is accessed. Otherwise no separable bucket is accessed on this level, because the query region is, from this level point of view, exclusively in the exclusion zone. Provided the search radius is greater than ρ, more separable buckets are accessed on a specific level as defined by lines 1 and 11. Unless terminated earlier, the algorithm accesses the exclusion bucket at line Nearest neighbor queries. The task of the nearest neighbor search is to retrieve k closest elements from X to q D, respecting the metric M. Specifically, it retrieves a set A X such that A =k and x A, y X A, d(q, x) d(q, y). In case of ties, we are satisfied with any set of k elements complying with the condition. For convenience, we designate the distance to the k-th nearest neighbor as d k (d k d + ). The general strategy of Algorithm 4 works as follows. At the beginning (line 1), we assume that the response set A is empty and d k = d +. The algorithm then proceeds with an optimistic strategy (lines 2 through 13) assuming that the k-th nearest neighbor is at distance maximally ρ. If it fails, see the test on line 14, additional steps of search are performed to find the correct result. Specifically, the first phase of the algorithm determines the buckets (maximally one separable at each level) by using the range query strategy with radius r = min{d k,ρ}. In this way, the most promising buckets are accessed and their identifications are remembered to avoid multiple bucket accesses. If d k ρ the search terminates successfully and A contains the result. Otherwise, additional bucket accesses are performed (lines 15 through 22) using the strategy for radii greater than ρ of Algorithm 3, ignoring the buckets already accessed in the first (optimistic) phase. Algorithm 4. Nearest neighbor search 1. A =, d k = d + ; (initialization) 2. for i = 1 to h(first - optimistic - phase) 3. r = min{d k,ρ}; 4. if s m i,ρ+r i (q) < 2 m i then (exclusive containment in a separable bucket) 5. access bucket B i, s m i,ρ+r i (q) ; update A and d k; 6. if d k ρ then exit; (the response set is determined) 7. else 8. if s m i,ρ r i (q) < 2 m i (not exclusive containment in an exclusion bucket) 9. access bucket B i, s m i, i (q) ; update A and d k; 1. end if 11. end if 12. end for 13. access bucket E h ; update A and d k ; 14. if d k >ρ(second phase - if needed)

13 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS for i = 1 to h 16. if s m i,ρ+d k i (q) < 2 m i then (exclusive containment in a separable bucket) 17. access bucket B m i, s i,ρ+d k i (q) if not already accessed; update A and d k; exit; 18. else 19. let {b 1, b 2,...,b k }=G ( s m i,d k ρ i (q) ) 2. access buckets B i,b1,...,b i,b2, B i,bk if not accessed; update A and d k ; 21. end if 22. end for 23. end if The output of the algorithm is the result set A and the distance to the last (k-th) nearest neighbor d k. Whenever a bucket is accessed, the response set A and the distance to the k-th nearest neighbor must be updated. In particular, the new response set is determined as the result of the nearest neighbor query over the union of objects from the accessed bucket and the previous response set A. The value of d k, which is the current search radius in the given bucket, is adjusted according to the new response set. Notice that the pivots are used during the bucket accesses, as it is explained in the next subsection Design issues In order to apply D-Index, the ρ-split functions must be designed, which obviously requires specification of reference (pivot) objects. An important feature of D-Index is that the same reference objects used in the ρ-split functions for the partitioning into buckets are also used as pivots for the internal management of buckets Choosing references. The problem of choosing reference objects (pivots) is important for any search technique in the general metric space, because all such algorithms need, directly or indirectly, some anchors for partitioning and search pruning. It is well known that the way in which pivots are selected can affect the performance of proper algorithms. This has been recognized and demonstrated by several researchers, e.g. [11] or [1], but specific suggestions for choosing good reference objects are rather vague. In this article, we use the same sets of reference objects both for the separable partitioning and the pivot-based filtering. Recently, the problem was systematically studied in [2], and several strategies for selecting pivots have been proposed and tested. The generic conclusion is that the mean µ MT of distance distribution in the feature space M T should be as high as possible, which is formalized by the hypothesis that the set of pivots T 1 ={p 1, p 2,...,p t } is better than the set T 2 ={p 1, p 2,...,p t } if µ MT1 >µ MT2 (4) However, the problem is how to find the mean for given pivot set T. An estimate of such quantity is proposed in [2] and computed as:

14 22 DOHNAL ET AL. at random, choose l pairs of elements {(o 1, o 1 ), (o 2, o 2 ),...,(o l, o l )} from D; for all pairs of elements, compute their distances in the feature space determined by the pivot set T ; compute µ MT as the mean of these distances. We apply the incremental selection strategy which was found in [2] as the most suitable for real world metric spaces, which works as follows. At the beginning, a set T ={p 1 } of one element from a sample of m database elements is chosen, such that the pivot p 1 has the maximum µ MT value. Then, a second pivot p 2 is chosen from another sample of m elements of the database, creating a new set K ={p 1, p 2 } for fixed p 1, maximizing µ MT. The third pivot p 3 is chosen from another sample of m elements of the database, creating another set T ={p 1, p 2, p 3 } for fixed p 1, p 2, maximizing µ MT. The process is repeated until t pivots are determined. In order to verify the capability of the incremental (INC) strategy, we have also tried to select pivots at random (RAN) and with a modified incremental strategy, called the middle search strategy (MID). The MID strategy combines the original INC approach with the following idea. The set of pivots T 1 = {p 1, p 2,...,p t } is better than the set T 2 ={p 1, p 2,...,p t } if µ T 1 d m < µ T2 d m, where d m represents the average global distance in a given data set and µ T is the average distance in the set T. In principle, this strategy supports choices of pivots with high µ MT but also tries to keep distances between pairs of pivots close to d m. The hypothesis is that too close or too distant pairs of pivots are not suitable for good partitioning of metric spaces Elastic bucket implementation. Since a uniform bucket occupation is difficult to guarantee, a bucket consists of a header plus a dynamic list of fixed size blocks of capacity that is sufficient to accommodate any object from the processed file. The header is characterized by a set of pivots and it contains distances to these pivots from all objects stored in the bucket. Furthermore, objects are sorted according to their distances to the first pivot p 1 to form a non-decreasing sequence. In order to cope with dynamic data, a block overflow is solved by splitting. Given a query object q and search radius r a bucket evaluation proceeds in the following steps: ignore all blocks containing objects with distances to p 1 outside the interval [d(q, p 1 ) r, d(q, p 1 ) + r]; for each not ignored block, if o i, L (F(o i ), F(q)) > r then ignore the block; access remaining blocks and o i, L (F(o i ), F(q)) r compute d(q, o i ). If d(q, o i ) r, then o i qualifies. In this way, the number of accessed blocks and necessary distance computations are minimized Properties of D-index On a qualitative level, the most important properties of D-Index can be summarized as follows:

15 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 23 An object is typically inserted at the cost of one block access, because objects in buckets are sorted. Since blocks are of fixed length, a block can overflow and a split of an overloaded block costs another block access. The number of distance computations between the reference (pivot) and inserted objects varies from m 1 to h i=1 m i. For an object inserted on level j, j i=1 m i distance computations are needed. For all response sets such that the distance to the most dissimilar object does not exceed ρ, the number of bucket accesses is maximally h + 1. The smaller the search radius, the more efficient the search can be. For r =, i.e. the exact match, the successful search is typically solved with one block access, unsuccessful search usually does not require any access very skewed distance distributions may result in slightly more accesses. For search radii greater than ρ, the number of bucket accesses is higher than h + 1, and for very large search radii, the query evaluation can eventually access all buckets. The number of accessed blocks in a bucket depends on the search radius and the distance distribution with respect to the first pivot. The number of distance computations between the query object and pivots is determined by the highest level of an accessed bucket. Provided the level is j, the number of distance computations is j i=1 m i, with the upper bound for any kind of query h i=1 m i. The number of distance computations between the query object and data objects stored in accessed buckets depends on the search radii and the number of pre-computed distances to pivots. From the quantitative point of view, we investigate the performance of D-Index in the next section. 4. Performance evaluation and comparison We have implemented the D-Index and conducted numerous experiments to verify its properties on the following three metric data sets: VEC 45-dimensional vectors of image color features compared by the quadratic distance measure respecting correlations between individual colors. URL sets of URL addresses attended by users during work sessions with the Masaryk University information system. The distance measure used was based on set similarity, defined as the fraction of cardinalities of intersection and union (Jacard coefficient). STR sentences of a Czech national corpus compared by the edit distance measure that counts the minimum number of insertions, deletions or substitutions to transform one string into the other. We have chosen these data sets not only to demonstrate the wide range of applicability of the D-Index, but also to show its performance over data with significantly different and realistic data (distance) distributions. For illustration, see figure 5 for distance densities of all our data sets. Notice the practically normal distribution of VEC, very discrete distribution of URL, and the skewed distribution of STR.

16 24 DOHNAL ET AL. Frequency VEC Frequency URL Frequency STR Distance Distance Distance Figure 5. Distance densities for VEC, URL, and STR. In all our experiments, the query objects are not chosen from the indexed data sets, but they follow the same distance distribution. The search costs are measured in terms of distance computations and block reads. We deliberately ignore the L distance measures used by the D-Index for the pivot-based filtering, because according to our tests, the costs to compute such distances are several orders of magnitude smaller than the costs needed to compute any of the distance functions of our experiments. All presented cost values are averages obtained by executing queries for 5 different query objects and constant search selectivity, that is queries using the same search radius or the same number of nearest neighbors D-Index performance evaluation In order to test the basic properties of the D-Index, we have considered about 11 objects for each of our data sets VEC, URL, and STR. Notice that the number of object used does not affect the significance of the results, since the cost measures (i.e., the number of block reads and the number of the distance computations) are not influenced by the cache of the system. Moreover, as shown in the following, performance scalability is linear with data set size. First, we have tested the efficiency of the incremental (INC), the random (RAN), and the middle (MID) strategies to select good reference objects. For each set of the selected reference objects and specific data set, we have built D-Index organizations of pre-selected structures. In Table 1, we summarize characteristics of resulting D-Index organizations, separately for individual data sets. We measured the search efficiency for range queries by changing the search radius to retrieve maximally 2% of data, that is about 2 objects. The results, presented in graphs Table 1. D-Index organization parameters. Set h No. of buckets No. of blocks Block size ρ VEC KB 12 URL KB.225 STR KB 14

17 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 25 Distance Computations VEC Distance Computations URL Distance Computations STR INC MID RAN Search radius (x1) INC MID RAN Search radius INC MID RAN Search radius Figure 6. Search efficiency in the number of distance computations for VEC, URL, and STR. Page Reads VEC Page Reads URL Page Reads STR INC MID RAN Search radius (x1) INC MID RAN Search radius INC MID RAN Search radius Figure 7. Search efficiency in the number of block reads for VEC, URL, and STR. for all three data sets, are available in figure 6 for the distance computations and in figure 7 for the block reads. The general conclusion is that the incremental method proved to be the most suitable for choosing reference objects for the D-Index. Except for very few search cases, the method was systematically better and can certainly be recommended for the sentences, the middle technique was nearly as good as the incremental, and for the vectors, the differences in performance of all three tested methods were the least evident. An interesting observation is that for vectors with distribution close to uniform, even the randomly selected reference points worked quite well. However, when the distance distribution was significantly different from the uniform, the performance with the randomly selected reference points was poor. Notice also some differences in performance between the execution costs quantified in block reads and distance computations. But the general trends are similar and the incremental method is the best. However, different cost quantities are not equally important for all the data sets. For example, the average time for quadratic form distances for the vectors were computed in much less than 9 µsec, while the average time spent computing a distance between two sentences was about 4.5 msec. This implies that the reduction of block reads for vectors was quite important in the total search time, while for the sentences, it was not important at all.

18 26 DOHNAL ET AL. Figure 8. Nearest neighbor search on sentences using different D-Index structures. Figure 9. Range search on sentences using different D-Index structures. The objective of the second group of tests was to investigate the relationship between the D-Index structure and the search efficiency. We have tested this hypothesis on the data set STR, using ρ = 1, 14, 25, 6. The results of experiments for the nearest neighbor search are reported in figure 8 while performance for range queries is summarized in figure 9. It is quite obvious that very small ρ can only be suitable when range search with small radius is used. However, this was not typical for our data where the distance to a nearest neighbor was rather high see in figure 8 the relatively high costs to get the nearest neighbor. This implies that higher values of ρ, such as 14, are preferable. However, there are limits to increase such value, since the structure with ρ = 25 was only exceptionally better than the one with ρ = 14, and the performance with ρ = 6 was rather poor. The experiments demonstrate that a proper choice of the structure can significantly influence the performance. The selection of optimized parameters is still an open research issue, that we plan to investigate in the near future. In the last group of tests, we analyzed the influence of the block size on system performance. We have experimentally studied the problem on the VEC data set both for the nearest neighbor and the range search. The results are summarized in figure 1. In order to obtain an objective comparison, we express the cost as relative block reads, i.e. the percentage of blocks that was necessary to execute a query.

19 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 27 Page Reads (%) block=4 1 block=124 block= Nearest neighbors Page Reads (%) block=4 block=124 block= Search radius (x1) Figure 1. Search efficiency versus the block size. We can conclude from the experiments that the search with smaller blocks requires to read a smaller fraction of the entire data structure. Notice that the number of distance computations for different block sizes is practically constant and depends only on the query. However, a small block size implies a large number of blocks; this means that in case the access cost to a block is significant (e.g., if blocks are allocated randomly) than large block sizes may become preferable. However, when the blocks are stored in a continuous storage space, small blocks can be preferable, because an optimized strategy for reading a set of disk pages, such as that proposed in [1], can be applied. For a static file, we can easily achieve such storage allocation of the D-Index through a simple reorganization D-Index performance comparison We have also compared the performance of D-Index with other index structures under the same workload. In particular, we considered the M-tree 1 [5] and the sequential organization, SEQ, because, according to [4], these are the only types of index structures for metric data that use disk memories to store objects. In order to maximize objectivity of comparison, all three types of index structures are implemented using the GIST package [9], and the actual performance is measured in terms of the number of distance comparisons and I/O operations the basic access unit of disk memory for each data set was the same for all index structures. The main objective of these experiments was to compare the similarity search efficiency of the D-Index with the other organizations. We have also considered the space efficiency, costs to build an index, and the scalability to process growing data collections Search efficiency for different data sets. We have built the D-Index, M-tree, and SEQ for all the data sets VEC, URL, and STR. The M-tree was built by using the bulk load package and the Min-Max splitting strategy. This approach was found in [5] as the most efficient to construct the tree. We have measured average performance over 5 different query objects considering numerous similarity range and nearest neighbor queries. The

20 28 DOHNAL ET AL. Distance Computations VEC Distance Computations URL Distance Computations STR dindex mtree seq dindex mtree seq dindex mtree seq Search radius (x1) Search radius Search radius Figure 11. and STR. Comparison of the range search efficiency in the number of distance computations for VEC, URL, Page Reads VEC Page Reads URL Page Reads STR dindex mtree seq Search radius (x1) dindex mtree seq Search radius dindex mtree seq Search radius Figure 12. Comparison of the range search efficiency in the number of block reads for VEC, URL, and STR. Distance Computations VEC Distance Computations URL Distance Computations STR dindex mtree seq dindex mtree seq dindex mtree seq Nearest neighbors Nearest neighbors Nearest neighbors Figure 13. Comparison of the nearest neighbor search efficiency in the number of distance computations for VEC, URL, and STR. results for the range search are shown in figures 11 and 12, while the performance for the nearest neighbor search is presented in figures 13 and 14. For all tested queries, i.e. retrieving subsets up to 2% of the database, the M-tree and the D-Index always needed less distance computations than the sequential scan. However, this was not true for the block accesses, where even the sequential scan was typically

21 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 29 Page Reads VEC Page Reads URL Page Reads STR dindex mtree seq Nearest neighbors dindex mtree seq Nearest neighbors dindex mtree seq Nearest neighbors Figure 14. and STR. Comparison of the nearest neighbor search efficiency in the number of block reads for VEC, URL, more efficient than the M-tree. In general, the number of block reads of the M-tree was significantly higher than for the D-Index. Only for the URL sets when retrieving large subsets, the number of block reads of the D-Index exceeded the SEQ. In this situation, the M-tree accessed nearly three times more blocks than the D-Index or SEQ. On the other hand, the M-tree and the D-Index required for this search case much less distance computations than the SEQ. Figure 12 demonstrates another interesting observation: to run the exact match query, i.e. range search with r q =, the D-Index only needs to access one block. As a comparison, see (in the same figure) the number of block reads for the M-tree: they are one half of the SEQ for vectors, equal to the SEQ for the URL sets, and even three times more than the SEQ for the sentences. Notice that the exact match search is important when a specific object is to be eliminated the location of the deleted object forms the main cost. In this respect, the D-Index is able to manage deletions more efficiently than the M-tree. We did not verify this fact by explicit experiments, because the available version of the M-tree does not support deletions. Concerning the block reads, the D-Index looks to be significantly better than the M- tree, while comparison on the level of distance computations is not so clear. The D-Index certainly performed much better for the sentences and also for the range search over vectors. However, the nearest neighbor search on vectors needed practically the same number of distance computations. For the URL sets, the M-tree was on average only slightly better than the D-Index Search, space, and insertion scalability. To measure the scalability of the D-Index, M-tree, and SEQ, we have used the 45-dimensional color feature histograms. Particularly, we have considered collections of vectors ranging from 1, to 6, elements. For these experiments, the structure of D-Index was defined by 37 reference objects and 74 buckets, where only the number of blocks in buckets was increasing to deal with growing files. The results for several typical nearest neighbor queries are reported in figure 15. The execution costs for range queries are reported in figure 16. Except for the SEQ organization, the individual curves are labelled with a number, representing either the number of nearest neighbors or the search radius, and a letter, where D stands for the D-Index and M for the M-tree.

22 3 DOHNAL ET AL. Distance Computations D 1 D 1 M 1 M SEQ Data Set Size (x1) Page Reads 3 1 D 25 1 D 1 M 1 M 2 SEQ Data Set Size (x1) Figure 15. Nearest neighbor search scalability. Distance Computations D 2 D 1 M 2 M SEQ Data Set Size (x1) Page Reads 3 1 D 25 2 D 1 M 2 M 2 SEQ Data Set Size (x1) Figure 16. Range search scalability. We can observe that on the level of distance computations the D-Index is usually slightly better than the M-tree, but the differences are not significant the D-Index and M-tree can save considerable number of distance computations compared to the SEQ. To solve a query, the M-tree needs significantly more block reads than the D-Index and for some queries, see 2M in figure 16, this number is even higher than for the SEQ.We have also investigated the performance of range queries with zero radius, i.e. the exact match. Independently of the data set size, the D-Index required one block access and 18 distance comparisons. This was in a sharp contrast with the M-tree, which needed about 6, block reads and 2, distance computations to find the exact match in the set of 6, vectors. The search scale up of the D-Index was strictly linear. In this respect, the M-tree was even slightly better, because the execution costs for processing a two times larger file were not two times higher. This sublinear behavior should be attributed to the fact that the M-tree is incrementally reorganizing its structure by splitting blocks and, in this way, improving the clustering of data. On the other hand, the D-Index was using a constant bucket structure, where only the number of blocks was changing. Such observation is posing another

23 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 31 research challenge, specifically, to build a D-Index structure with a dynamic number of buckets. The disk space to store data was also increasing in all our organizations linearly. The D-Index needed about 2% more space than the SEQ. In contrast, the M-tree occupied two times more space than the SEQ. In order to insert one object, the D-Index evaluated on average the distance function 18 times. That means that distances to (practically) one half of the 37 reference objects were computed. For this operation, the D- Index performed 1 block read and, due to the dynamic bucket implementation using splits, a bit more than one block writes. The insertion into the M-tree was more expensive and required on average 6 distance computations, 12 block reads, and 2 block writes. 5. Conclusions Recently, metric spaces have become an important paradigm for similarity search and many index structures supporting execution of similarity queries have been proposed. However, most of the existing structures are only limited to operate on main memory, so they do not scale up to high volumes of data. We have concentrated on the case where indexed data are stored on disks and we have proposed a new index structure for similarity range and nearest neighbor queries. Contrary to other index structures, such as the M-tree, the D-Index stores and deletes any object with one block access cost, so it is particularly suitable for dynamic data environments. Compared to the M-tree, it typically needs less distance computations and much less disk reads to execute a query. The D-Index is also economical in space requirements. It needs slightly more space than the sequential organization, but at least two times less disk space compared to the M-tree. As experiments confirm, it scales up well to large data volumes. Our future research will concentrate on developing methodologies that would support optimal design of the D-Index structures for specific applications. The dynamic bucket management is another research challenge. We will also study parallel implementations, which the multilevel hashing structure of the D-Index inherently offers. Finally, we will carefully study two very recent proposals to index structures based on selection of reference objects [6, 13] and compare the performance of D-index with these approaches. Acknowledgments We would like to thank the anonymous referees for their comments on the earlier version of the paper. Note 1. The software is available at

24 32 DOHNAL ET AL. References 1. T. Bozkaya and Ozsoyoglu, Indexing large metric spaces for similarity search queries, ACM TODS, Vol. 24, No. 3, pp , B. Bustos, G. Navarro, and E. Chavez, Pivot selection techniques for proximity searching in metric spaces, in Proceedings of the XXI Conference of the Chielan Computer Science Society (SCCC1), IEEE CS Press, 21, pp E. Chavez, J. Marroquin, and G. Navarro, Fixed queries array: A fast and economical data structure for proximity searching, Multimedia Tools and Applications, Vol. 14, No. 2, pp , E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin, Proximity searching in metric spaces, ACM Computing Surveys. Vol. 33, No. 3, pp , P. Ciaccia, M. Patella, and P. Zezula, M-tree: An efficient access method for similarity search in metric spaces, in Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997, pp R.F.S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos, Similarity search without tears: The OMNI-family of all-purpose access methods, in Proceedings of the 17th ICDE Conference, Heidelberg, Germany, 21, pp V. Dohnal, C. Gennaro, P. Savino, and P. Zezula, Separable splits in metric data sets, in Proceedings of 9-th Italian Symposium on Advanced Database Systems, Venice, Italy, June 21, pp , LCM Selecta Group Milano. 8. C. Gennaro, P. Savino, and P. Zezula, Similarity search in metric databases through Hashing, in Proceedings of ACM Multimedia 21 Workshops, Oct. 21, Ottawa, Canada, pp J.M. Hellerstein, J.F. Naughton, and A. Pfeffer, Generalized search trees for database systems, in Proceedings of the 21st VLDB Conference, 1995, pp B. Seeger, P. Larson, and R. McFayden, Reading a set of disk pages, in Proceedings of the 19th VLDB Conference, 1993, pp P.N. Yianilos, Data structures and algorithms for nearest neighbor search in general metric spaces, ACM- SIAM Symposium on Discrete Algorithms (SODA), 1993, pp P.N. Yianilos, Excluded middle vantage point forests for nearest neighbor search, Tech. rep., NEC Research Institute, 1999, Presented at Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop, Jan. 15, C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, Indexing the Distance: An efficient method to KNN processing, in Proceedings of the 27th VLDB Conference, Roma, Italy, 21, pp Vlastislav Dohnal received the B.Sc. degree and M.Sc. degree in Computer Science from the Masaryk University, Czech Republic, in 1999 and 2, respectively. Currently, he is a Ph.D. student at the Faculty of Informatics, Masaryk University. His interests include multimedia data indexing, information retrieval, metric spaces, and related areas.

25 D-INDEX: DISTANCE SEARCHING INDEX FOR METRIC DATA SETS 33 Claudio Gennaro received the Laurea degree in Electronic Engineering from University of Pisa in 1994 and the Ph.D. degree in Computer and Automation Engineering from Politecnico di Milano in He is now researcher at IEI an institute of National Research Council (CNR) situated in Pisa. His Ph.D. studies were in the field of Performance Evaluation of Computer Systems and Parallel Applications. His current main research interests are Performance Evaluation, Similarity retrieval, Storage Structures for Multimedia Information Retrieval and Multimedia document modelling. Pasquale Savino graduated in Physics at the University of Pisa, Italy, in 198. From 1983 to 1995 he has worked at the Olivetti Research Labs in Pisa; since 1996 he has been a member of the research staff at CNR-IEI in Pisa, working in the area of multimedia information systems. He has participated and coordinated several EU-funded research projects in the multimedia area, among which MULTOS (Multimedia Office Systems), OSMOSE (Open Standard for Multimedia Optical Storage Environments), HYTEA (HYperText Authoring), M-CUBE (Multiple Media Multiple Communication Workstation), MIMICS (Multiparty Interactive Multimedia Conferencing Services), HERMES (Foundations of High Performance Multimedia Information Management Systems). Currently, he is coordinating the project ECHO (European Chronicles on Line). He has published scientific papers in many international journals and conferences in the areas of multimedia document retrieval and information retrieval. His current research interests are multimedia information retrieval, multimedia content addressability and indexing. Pavel Zezula is Full Professor at the Faculty of Informatics, Masaryk University of Brno. For many years, he has been cooperating with the IEI-CNR Pisa on numerous research projects in the areas of Multimedia Systems and Digital Libraries. His research interests concern storage and index structures for non-traditional data types, similarity search, and performance evaluation. Recently, he has focused on index structures and query evaluation of large collections of XML documents.

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Scalability Comparison of Peer-to-Peer Similarity-Search Structures Scalability Comparison of Peer-to-Peer Similarity-Search Structures Michal Batko a David Novak a Fabrizio Falchi b Pavel Zezula a a Masaryk University, Brno, Czech Republic b ISTI-CNR, Pisa, Italy Abstract

More information

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko Table of Contents Part I: Metric searching in a nutshell Foundations of metric space searching

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Indexing by Shape of Image Databases Based on Extended Grid Files

Indexing by Shape of Image Databases Based on Extended Grid Files Indexing by Shape of Image Databases Based on Extended Grid Files Carlo Combi, Gian Luca Foresti, Massimo Franceschet, Angelo Montanari Department of Mathematics and ComputerScience, University of Udine

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

More information

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane?

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane? Intersecting Circles This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane? This is a problem that a programmer might have to solve, for example,

More information

Worst-case running time for RANDOMIZED-SELECT

Worst-case running time for RANDOMIZED-SELECT Worst-case running time for RANDOMIZED-SELECT is ), even to nd the minimum The algorithm has a linear expected running time, though, and because it is randomized, no particular input elicits the worst-case

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

* (4.1) A more exact setting will be specified later. The side lengthsj are determined such that

* (4.1) A more exact setting will be specified later. The side lengthsj are determined such that D D Chapter 4 xtensions of the CUB MTOD e present several generalizations of the CUB MTOD In section 41 we analyze the query algorithm GOINGCUB The difference to the CUB MTOD occurs in the case when the

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Geometric data structures:

Geometric data structures: Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other

More information

arxiv: v1 [cs.ma] 8 May 2018

arxiv: v1 [cs.ma] 8 May 2018 Ordinal Approximation for Social Choice, Matching, and Facility Location Problems given Candidate Positions Elliot Anshelevich and Wennan Zhu arxiv:1805.03103v1 [cs.ma] 8 May 2018 May 9, 2018 Abstract

More information

Lecture 2 September 3

Lecture 2 September 3 EE 381V: Large Scale Optimization Fall 2012 Lecture 2 September 3 Lecturer: Caramanis & Sanghavi Scribe: Hongbo Si, Qiaoyang Ye 2.1 Overview of the last Lecture The focus of the last lecture was to give

More information

A Pivot-based Index Structure for Combination of Feature Vectors

A Pivot-based Index Structure for Combination of Feature Vectors A Pivot-based Index Structure for Combination of Feature Vectors Benjamin Bustos Daniel Keim Tobias Schreck Department of Computer and Information Science, University of Konstanz Universitätstr. 10 Box

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Tree Models of Similarity and Association. Clustering and Classification Lecture 5 Tree Models of Similarity and Association Clustering and Lecture 5 Today s Class Tree models. Hierarchical clustering methods. Fun with ultrametrics. 2 Preliminaries Today s lecture is based on the monograph

More information

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1

Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 Localized and Incremental Monitoring of Reverse Nearest Neighbor Queries in Wireless Sensor Networks 1 HAI THANH MAI AND MYOUNG HO KIM Department of Computer Science Korea Advanced Institute of Science

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique

A Real Time GIS Approximation Approach for Multiphase Spatial Query Processing Using Hierarchical-Partitioned-Indexing Technique International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2017 IJSRCSEIT Volume 2 Issue 6 ISSN : 2456-3307 A Real Time GIS Approximation Approach for Multiphase

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information

Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces Kaushik Chakrabarti Department of Computer Science University of Illinois Urbana, IL 61801 kaushikc@cs.uiuc.edu Sharad

More information

General properties of staircase and convex dual feasible functions

General properties of staircase and convex dual feasible functions General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia

More information

Pebble Sets in Convex Polygons

Pebble Sets in Convex Polygons 2 1 Pebble Sets in Convex Polygons Kevin Iga, Randall Maddox June 15, 2005 Abstract Lukács and András posed the problem of showing the existence of a set of n 2 points in the interior of a convex n-gon

More information

Error-Correcting Codes

Error-Correcting Codes Error-Correcting Codes Michael Mo 10770518 6 February 2016 Abstract An introduction to error-correcting codes will be given by discussing a class of error-correcting codes, called linear block codes. The

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

THREE LECTURES ON BASIC TOPOLOGY. 1. Basic notions.

THREE LECTURES ON BASIC TOPOLOGY. 1. Basic notions. THREE LECTURES ON BASIC TOPOLOGY PHILIP FOTH 1. Basic notions. Let X be a set. To make a topological space out of X, one must specify a collection T of subsets of X, which are said to be open subsets of

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Speeding up Queries in a Leaf Image Database

Speeding up Queries in a Leaf Image Database 1 Speeding up Queries in a Leaf Image Database Daozheng Chen May 10, 2007 Abstract We have an Electronic Field Guide which contains an image database with thousands of leaf images. We have a system which

More information

CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12

CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12 Tool 1: Standards for Mathematical ent: Interpreting Functions CCSSM Curriculum Analysis Project Tool 1 Interpreting Functions in Grades 9-12 Name of Reviewer School/District Date Name of Curriculum Materials:

More information

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA Chapter 1 : BioMath: Transformation of Graphs Use the results in part (a) to identify the vertex of the parabola. c. Find a vertical line on your graph paper so that when you fold the paper, the left portion

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Morphological Image Processing

Morphological Image Processing Morphological Image Processing Binary image processing In binary images, we conventionally take background as black (0) and foreground objects as white (1 or 255) Morphology Figure 4.1 objects on a conveyor

More information

Simplicial Global Optimization

Simplicial Global Optimization Simplicial Global Optimization Julius Žilinskas Vilnius University, Lithuania September, 7 http://web.vu.lt/mii/j.zilinskas Global optimization Find f = min x A f (x) and x A, f (x ) = f, where A R n.

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

REDUCING GRAPH COLORING TO CLIQUE SEARCH

REDUCING GRAPH COLORING TO CLIQUE SEARCH Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Chapter 3. Set Theory. 3.1 What is a Set?

Chapter 3. Set Theory. 3.1 What is a Set? Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any

More information

Lecture Notes on Contracts

Lecture Notes on Contracts Lecture Notes on Contracts 15-122: Principles of Imperative Computation Frank Pfenning Lecture 2 August 30, 2012 1 Introduction For an overview the course goals and the mechanics and schedule of the course,

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Fundamental mathematical techniques reviewed: Mathematical induction Recursion. Typically taught in courses such as Calculus and Discrete Mathematics.

Fundamental mathematical techniques reviewed: Mathematical induction Recursion. Typically taught in courses such as Calculus and Discrete Mathematics. Fundamental mathematical techniques reviewed: Mathematical induction Recursion Typically taught in courses such as Calculus and Discrete Mathematics. Techniques introduced: Divide-and-Conquer Algorithms

More information

3 Fractional Ramsey Numbers

3 Fractional Ramsey Numbers 27 3 Fractional Ramsey Numbers Since the definition of Ramsey numbers makes use of the clique number of graphs, we may define fractional Ramsey numbers simply by substituting fractional clique number into

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Training Digital Circuits with Hamming Clustering

Training Digital Circuits with Hamming Clustering IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: FUNDAMENTAL THEORY AND APPLICATIONS, VOL. 47, NO. 4, APRIL 2000 513 Training Digital Circuits with Hamming Clustering Marco Muselli, Member, IEEE, and Diego

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic mufin@fi.muni.cz SEMWEB 1 Search problem SEARCH index structure data & queries infrastructure SEMWEB 2 The thesis

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

Math 734 Aug 22, Differential Geometry Fall 2002, USC

Math 734 Aug 22, Differential Geometry Fall 2002, USC Math 734 Aug 22, 2002 1 Differential Geometry Fall 2002, USC Lecture Notes 1 1 Topological Manifolds The basic objects of study in this class are manifolds. Roughly speaking, these are objects which locally

More information

Integrated Mathematics I Performance Level Descriptors

Integrated Mathematics I Performance Level Descriptors Limited A student performing at the Limited Level demonstrates a minimal command of Ohio s Learning Standards for Integrated Mathematics I. A student at this level has an emerging ability to demonstrate

More information

High-dimensional knn Joins with Incremental Updates

High-dimensional knn Joins with Incremental Updates GeoInformatica manuscript No. (will be inserted by the editor) High-dimensional knn Joins with Incremental Updates Cui Yu Rui Zhang Yaochun Huang Hui Xiong Received: date / Accepted: date Abstract The

More information

Finding the k-closest pairs in metric spaces

Finding the k-closest pairs in metric spaces Finding the k-closest pairs in metric spaces Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, JAPAN kurasawa@nii.ac.jp Atsuhiro Takasu National Institute of Informatics 2-1-2

More information

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems Kyu-Woong Lee Hun-Soon Lee, Mi-Young Lee, Myung-Joon Kim Abstract We address the problem of designing index structures that

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

SHAPE, SPACE & MEASURE

SHAPE, SPACE & MEASURE STAGE 1 Know the place value headings up to millions Recall primes to 19 Know the first 12 square numbers Know the Roman numerals I, V, X, L, C, D, M Know the % symbol Know percentage and decimal equivalents

More information

Online Facility Location

Online Facility Location Online Facility Location Adam Meyerson Abstract We consider the online variant of facility location, in which demand points arrive one at a time and we must maintain a set of facilities to service these

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

Introduction to Sets and Logic (MATH 1190)

Introduction to Sets and Logic (MATH 1190) Introduction to Sets and Logic () Instructor: Email: shenlili@yorku.ca Department of Mathematics and Statistics York University Dec 4, 2014 Outline 1 2 3 4 Definition A relation R from a set A to a set

More information

Integrated Math I High School Math Solution West Virginia Correlation

Integrated Math I High School Math Solution West Virginia Correlation M.1.HS.1 M.1.HS.2 M.1.HS.3 Use units as a way to understand problems and to guide the solution of multi-step problems; choose and interpret units consistently in formulas; choose and interpret the scale

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 1 Introduction Today, we will introduce a fundamental algorithm design paradigm, Divide-And-Conquer,

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

Topology and Topological Spaces

Topology and Topological Spaces Topology and Topological Spaces Mathematical spaces such as vector spaces, normed vector spaces (Banach spaces), and metric spaces are generalizations of ideas that are familiar in R or in R n. For example,

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

Algorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms

Algorithm Efficiency & Sorting. Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms Algorithm Efficiency & Sorting Algorithm efficiency Big-O notation Searching algorithms Sorting algorithms Overview Writing programs to solve problem consists of a large number of decisions how to represent

More information

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome

More information

Solutions to Problem Set 1

Solutions to Problem Set 1 CSCI-GA.3520-001 Honors Analysis of Algorithms Solutions to Problem Set 1 Problem 1 An O(n) algorithm that finds the kth integer in an array a = (a 1,..., a n ) of n distinct integers. Basic Idea Using

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS

CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS CHAPTER 4 VORONOI DIAGRAM BASED CLUSTERING ALGORITHMS 4.1 Introduction Although MST-based clustering methods are effective for complex data, they require quadratic computational time which is high for

More information

A step towards the Bermond-Thomassen conjecture about disjoint cycles in digraphs

A step towards the Bermond-Thomassen conjecture about disjoint cycles in digraphs A step towards the Bermond-Thomassen conjecture about disjoint cycles in digraphs Nicolas Lichiardopol Attila Pór Jean-Sébastien Sereni Abstract In 1981, Bermond and Thomassen conjectured that every digraph

More information

Subset sum problem and dynamic programming

Subset sum problem and dynamic programming Lecture Notes: Dynamic programming We will discuss the subset sum problem (introduced last time), and introduce the main idea of dynamic programming. We illustrate it further using a variant of the so-called

More information

CHAPTER 4 BLOOM FILTER

CHAPTER 4 BLOOM FILTER 54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,

More information