A Content-Addressable Network for Similarity Search in Metric Spaces

Size: px

Start display at page:

Download "A Content-Addressable Network for Similarity Search in Metric Spaces"

Erica Matthews
6 years ago
Views:

University of Pisa Masaryk University Faculty of Informatics Department of Information Engineering Doctoral Course in Information Engineering Department of Information Technologies Doctoral Course

1 University of Pisa Masaryk University Faculty of Informatics Department of Information Engineering Doctoral Course in Information Engineering Department of Information Technologies Doctoral Course in Informatics Ph.D. Thesis A Content-Addressable Network for Similarity Search in Metric Spaces Fabrizio Falchi Supervisor Lanfranco Lopriore Supervisor Fausto Rabitti Supervisor Pavel Zezula 2007

3 Abstract Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peerto-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces.

5 Acknowledgments It is a great honor and privilege for me to have an international joint supervision of my Ph.D. dissertation. I want to thank the University of Pisa and the Masaryk University, Brno for their agreement regarding this joint supervision. I am grateful to my supervisors Lanfranco Lopriore, Fausto Rabitti and Pavel Zezula for their guidance and encouragement. Since I was not just a Ph.D. student at the University of Pisa but also at the Masaryk University, I had the opportunity to work with several researchers in Brno, especially Michal Batko, David Novak and Vlastislav Dohnal. The discussions we had and the work we did together were fundamental for the completion of this thesis. This thesis would not have been possible without the substantial help from Claudio Gennaro and without the discussions I had with Giuseppe Amato and Pasquale Savino. I also thank all the other people at the Networked Multimedia Information Systems laboratory of the Institute of Information Science and Technologies, especially Fausto Rabitti who is the head of the laboratory and his predecessor Costantino Thanos. I would also like to thank Cosimo Antonio Prete and Sandro Bartolini for teaching me the basics of research during my master thesis. Finally, I thank my mother and father for supporting me during my studies.

7 To Moly

9 Contents Abstract Acknowledgments Table of Contents iii v vii Introduction xiii I.1 Similarity Search xiv I.2 Statement of the Problem xv I.2.1 Purpose of the Study xvi I.2.2 Significance of the Study xvii I.2.3 Limitations of the Study xviii I.2.4 Summary xix 1 The Similarity Search Problem Similarity Search Similarity Queries Range Query Nearest Neighbor Query Combinations of Queries Complex Similarity Queries Metric Spaces Metric Distance Measures Access Methods for Metric Spaces Ball Partitioning Generalized Hyperplane Partitioning Exploiting Pre-Computed Distances ix

10 x CONTENTS Hybrid Indexing Approaches Distributed Indexes Scalable and Distr. Data Structures Peer-to-Peer Systems Characterization The lookup problem Unstructured Peer-to-Peer Systems Structured Peer-to-Peer Systems Introduction to DHTs Chord Pastry Tapestry Chimera Z-Ring Content Addressable Network CAN Kademlia Symphony Viceroy DHTs Comparison DHTs Related Works P-Grid Small-World and Scale-Free Other Works Metric Peer-to-Peer Structures GHT and VPT M-Chord Peer-to-Peer and Grid Computing Content-Addressable Network (CAN) Node arrivals Finding a Zone Joining the Routing Routing Node departures Load Balancing

11 0.0. CONTENTS xi 3.5 M-CAN: CAN-based Multicast CAN Related Works MCAN Mapping Pivot Selection Filtering Regions Construction Insert Split Execution End Detection Range Query Nearest Neighbor query MCAN Evaluation Dimensionality of the Mapped Space Range query Nearest Neighbor query Number of peers involved in query execution Total number of distance computations Parallel cost of knn Candidate results MCAN Comparison with other structures Experiments Settings Measurements Scalability with Respect to the Size of the Query Scalability with Respect to the Size of Datasets Number of Simultaneous Queries Comparison Summary Conclusions Research Directions Bibliography 145

12 xii CONTENTS Index 174

13 Introduction Traditionally, search has been applied to structured (attributetype) data yielding records that exactly match the query. However, to find images similar to a given one, time series with similar development, or groups of customers with common buying patterns in respective data collections, the traditional search technologies simply fail. In these cases, the required comparison is gradual, rather than binary, because once a reference (query) pattern is given, each instance in a search file and the pattern are in a certain relation which is measured by a user-defined dissimilarity function. These types of search are more and more important for a variety of current complex digital data collections and are generally designated as similarity search. Thus similarity search has become a fundamental computational task in many applications. Unfortunately, its cost is still high and grows linearly with respect to the dataset size which prevents the realization of efficient applications for the ever-increasing amount of digital information being created today. In the last few years, Peer-to-Peer systems have been widely used particularly for file-sharing, distributed computing, Internetbased telephony and video multicast. The growth in the usage of these applications is enormous and even more rapid than that of the World Wide Web [183]. In a Peer-to-Peer system autonomous entities (peers) aim for the shared usage of distributed autonomous resources in a networked environment. Taking advantage of these distributed resources, scalability issues of centralized solutions for specific applications have been overcome. In this thesis we present a scalable distributed similarity search

14 xiv INTRODUCTION structure for similarity search in metric spaces, which is called Metric Content-Addressable Network (MCAN). MCAN extends the well known structured Peer-to-Peer system CAN described in Chapter 3 [p.63], to generic metric space objects. MCAN uses the Peerto-Peer paradigm to overcome scalability issues of centralized structures for similarity search, taking advantage of distributed resources over the Peer-to-Peer network. I.1 Similarity Search Similarity search is based on gradual rather than exact relevance and is used in content-based retrieval for queries involving complex data types such as images, videos, time series, text and DNA sequences. For instance, for the experimental results reported in this thesis, we use 3 datasets which cover very different topics: vectors of color image features, protein symbol sequences, and titles and subtitles of books and periodicals from several academic libraries. In similarity search, discussed in Chapter 1 [p.1], a distance between objects is used to quantify the proximity, similarity or dissimilarity of a query object versus the objects stored in a database to be searched. Although many similarity search approaches have been proposed, the most generic one considers the mathematical notion of metric space (see Section 1.3 [p.7]) as a suitable abstraction of similarity. The simple but powerful concept of the metric space consists of a domain of objects and a distance function which satisfies a set of constraints. It can be applied not only to multidimensional vector spaces, but also to different forms of string objects, as well as to sets or groups of various natures, etc. However, the similarity search is inherently expensive and for centralized indexes the search costs increase linearly as the stored dataset grows, thus preventing them from being applied to huge files that have become common and are continuously growing. Applications of similarity search for multimedia objects have been studied extensively. In the last few years, standardization processes have took place to make multimedia features interchangeable,

15 I.2. STATEMENT OF THE PROBLEM xv thus moving the problem of comparing two multimedia objects to the problem of comparing their standardized features which could be automatic extracted by different software. In particular in the Multimedia Content Description Interface standard MPEG-7 [87], a great number of low level features for multimedia objects have been defined. Moreover, distance functions to evaluate the dissimilarity of two multimedia objects have been suggested by the Moving Picture Experts Group (MPEG) [154]. The low level features of MPEG-7 have been largely used in advanced multimedia information systems (e.g. MILOS Photo Book [9, 10]). I.2 Statement of the Problem Current centralized similarity search structures for metric spaces reveal linear scalability with respect to the data set size. Obviously, this is not sufficient for the expected data volume dimension of the problem, and this is why there is no web search engine able to search the web for images similar to a given one. In fact, the image search application available (e.g. Google Image Search 1, Yahoo! Image Search 2 ) are simply based on the accompanying or manual annotated text (e.g. Google Image Labeler 3 ). Peer-to-Peer systems are today widely used to perform unscalable tasks in network of volunteer nodes (e.g. file sharing, search for extraterrestrial intelligence, video broadcasting over the Internet, etc.). However, centralized indexes for exact match queries are still largely used in file sharing communities. In this context, distributed indexes are mostly used because it is more difficult for the authorities to shut down them. Distributed Peer-to-Peer web search engines have also been proposed to avoid the existence of a centralized control (e.g. Google). In recent years various structured Peer-to-Peer systems have been proposed for exact match indexing (see Section 2.4 [p.29])

16 xvi INTRODUCTION However, their use is still limited because of the good performance achievable by the centralized structures. We believe that structured Peer-to-Peer systems for similarity search are more attractive because their centralized counterparts do not scale well and thus they can not be used for huge amount of data. Very recently four scalable distributed similarity search structures for metric data have been proposed. One is the MCAN which is the object of this thesis. The other three are GHT*, VPT* and M-Chord which we describe in Section 2.5 [p.52]. Each of these structures is able to execute similarity queries for any metric dataset, and they all exploit parallelism for query execution. All these distributed similarity search structures adopt the Peer-to-Peer paradigm trying to overcome scalability issues of similarity search structures using distributed resources over a Peer-to-Peer network. I.2.1 Purpose of the Study In this thesis we consider the Peer-to-Peer paradigm to develop a scalable and distributed data structure for similarity search in metric spaces. Structured Peer-to-Peer systems have been proposed to efficiently find data in a scalable manner in large Peer-to-Peer systems using exact match. Almost all of them belong to the class of Distributed Hash Tables (DHTs) discussed in Section 2.4 [p.29]. Our objective is to study the possibility of combining the capabilities of metric space similarity search with the power and high scalability of the Peer-to-Peer systems, especially DHTs, employing their virtually unlimited storage and computational resources. For this purpose we extend the Content-Addressable Network (CAN), which is a well-known DHT, to support storage and retrieval of generic metric space objects. More precisely, we want to address the problem of executing Range and Nearest Neighbor queries (see Section 1.2 [p.3]). Particular attention is given to the scalability of the proposed solution with respect to the dataset size. Because of the great variety of forms of data that do satisfy the metric assumptions, we evaluate the performance of the proposed solution using different datasets (i.e. of image features, protein

17 I.2. STATEMENT OF THE PROBLEM xvii symbols and text) and different metric distance functions. Considering that almost at the same time that we were developing our MCAN, other three similar structures (i.e. GHT*, VPT* and M-Chord) were proposed, we want to compare their performance considering various performance measures and various datasets. I.2.2 Significance of the Study The proposed MCAN has been proved to be scalable with respect to the dataset size considering the response time of Range queries. Three different strategies for performing Nearest Neighbor queries were studied. The study demonstrated that the mixed mode execution is the best choice in general. It responds well to the demand of the scalability in terms of response time and consumes fewer resources than the complete parallel approaches. Implementing MCAN over the same framework in which, almost at the same time, three similar structures (i.e. GHT*, VPT* and M-Chord) were built, it was possible to explore the performance characteristics of the proposed algorithms by showing their advantages and disadvantages. The distributed similarity search structure approach is the core of the European project SAPIR 4 (Search on Audio-visual content using Peer-to-peer Information Retrieval) that aims at finding new ways to analyze, index, and retrieve the tremendous amounts of speech, image, video, and music that are filling our digital universe, going beyond what the most popular engines are still doing, that is, searching using text tags that have been associated with multimedia files. SAPIR is a three-year research project that aims at breaking this technological barrier by developing a large-scale, distributed Peer-to-Peer architecture that will make it possible to search for audio-visual content by querying the specific characteristics (i.e. features) of the content. SAPIR s goal is to establish a giant Peerto-Peer network, where users are peers that produce audiovisual 4

18 xviii INTRODUCTION content using multiple devices and service providers are super-peers that maintain indexes and provide search capabilities. MCAN will be also used in the context of the Networked Peers for Business (NeP4B) 5, an Italian FIRB project which aims to contribute innovative Information and Communication Technologies solutions for small and medium enterprises by developing an advanced technological infrastructure to enable companies of any nature, size and geographic location to search for partners, exchange data, negotiate and collaborate without limitations and constraints. I.2.3 Limitations of the Study MCAN efficiently supports similarity search of a single type of feature. However, even for the same type of data, different features can be used for searching. For instance, in content based image retrieval parts of the entire images, obtained using image segmentation software, can be searched considering different aspects, e.g. the colors, the shape, the edges and the texture. Moreover, in most cases a combination of features is requested. Whenever the combination is fixed and the combined distance function is metric, our proposed solution could be used, considering the combined features as a whole. However, in some cases, the users would like to dynamically specify the features to combine, the importance to give to each feature and the combination function. To this purpose, the state of the art complex queries algorithms that should be used (see Subsection [p.5]), typically require an Incremental Nearest Neighbor algorithm to be implemented for each feature. Moving from the Nearest Neighbor algorithm presented in this thesis, it should be possible to define an Incremental version. However, this is still an open question which will be investigated in the near future. Using CAN as the basis, MCAN inherited a lot of functionalities that were proposed in the literature for the CAN (e.g. load balancing, multicast, recovering from node failures, etc.). However, the performance of these functionalities should be extensively studied. 5

19 I.2. STATEMENT OF THE PROBLEM xix With distributed data structures for similarity search as the MCAN, we achieved scalability of similarity queries with respect to the dataset size adding resources as the dataset grows. Basically, we supposed that resources (i.e. nodes) are at least proportional to the dataset size which is quite common in Peer-to-Peer systems used by communities. However, similarity queries do remain expensive tasks which require a lot of resources. I.2.4 Summary The thesis is organized as follows. In Chapter 1 backgrounds and definitions of the similarity search problem are given, with particular attention to the metric space abstraction. Scalable indexing mechanisms in distributed systems are the topic of Chapter 2 where the Scalable and Distributed Data Structures (SDDSs) and Peer-to- Peer approaches are discussed. The Content-Addressable Network, which is the base of our structure, is then discussed in Chapter 3. Our Metric Content-Addressable Network, which is our Peer-to- Peer similarity search structure, is defined in Chapter 4. Exhaustive experimental evaluation is provided in Chapter 5. Finally, the results of a comprehensive experimental comparison between our and other three scalable distributed similarity search structures are reported in Chapter 6.

21 Chapter 1 The Similarity Search Problem Similarity search, particularly in metric spaces, has been receiving increasing attention in the last decade, due to its many applications in widely different areas. Two recent books [191, 157], two surveys in ACM Computing Surveys (CSUR) [46, 29] and ACM Transactions on Database Systems (TODS) [84, 30], and the sharp increase in publications in recent years witness the momentum gained by this important problem. The chapter is organized as follows. In Section 1.1 we discuss the notion of similarity and we define the search problem. In Section 1.2 we will report the most used similarity queries. The metric space abstraction is presented in Section 1.3 together with some metric distance measures. Finally, we illustrate the most important centralized access method for metric spaces that has been presented in the literature. 1.1 Similarity Search The notion of similarity has been studied extensively in the field of psychology and the given definition characterizes that similarity has been found to have an important role in cognitive sciences. In [75], R.L. Goldstone and S.J. Yun say: 1

22 2 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM An ability to assess similarity lies close to the core of cognition. In the time-honored tradition of legitimizing fields of psychology by citing William James, This sense of Sameness is the very keel and backbone of our thinking [89, p.459]. An understanding of problem solving, categorization, memory retrieval, inductive reasoning, and other cognitive processes requires that we understand how humans assess similarity. Four major psychological models of similarity are geometric, featural, alignment-based, and transformational. From a database prospective, similarity search is based on gradual rather than exact relevance. A distance between objects is used to quantify the proximity, similarity or dissimilarity of a query object versus the objects stored in a database to be searched. A similarity search can be seen as a process of obtaining data objects in order of their distance or dissimilarity from a given query object. It is a kind of sorting, ordering, or ranking of objects with respect to the query object, where the ranking criterion is the distance measure. Though this principle works for any distance measure, we restrict the possible set of measure to the metric distance (see Section 1.3 [p.7]). Because of the mathematical foundations of the metric space notion, partitioning and pruning rules can be constructed for developing efficient index structures. Therefore, in the past years research has focused on metric spaces. On the other hand, since many data domains in use are represented by vectors, we could restrict to this special case of coordinate spaces which is a special case of metric space in which objects can be seen as vectors. Unfortunately existing solutions for searching in coordinate space suffer from the so-called dimensionality curse, i.e. either become slower than naive algorithms with linear search times or they use too much space. Moreover, some spaces have coordinates restricted to small sets of values, so that the use of such coordinates could not be helpful. Therefore, even if the use of coordinates can be advantageous in special cases, as Clarkson said in [53]:

23 1.2. SIMILARITY QUERIES 3 to strip the problem down to its essentials by only considering distances, it is reasonable to find the minimal properties needed for fast algorithms. As in [191], we consider the problem of organizing and searching large datasets from the perspective of generic or arbitrary metric space, sometimes conveniently labelled distance space. In general, the search problem can be described as follows: Problem 1. Let D be the domain, d a distance measure on D, and M = (D, d) a metric space. Given a set X D of elements, preprocess or structure the data so that proximity queries are answered efficiently. From a practical point of view, X can be seen as a file ( a dataset or a collection) of objects that takes values from the domain D, with d as the proximity measure, i.e. the distance function defined for an arbitrary pair of objects from D. In (Section 1.2 [p.3]) the most common similarity queries will be presented. In this thesis we will only considering metric distance functions (see Section 1.3 [p.7]). The most important ones are presented in Subsection [p.9]. 1.2 Similarity Queries A similarity query is defined by a query object and a constraint on the proximity required for an object to be in the result set. The response to a query returns all the objects that satisfy the selection conditions. In this section we present the two basic types of similarity queries, i.e. Range and Nearest Neighbor. We also consider the Incremental Nearest Neighbor search and the problematics of combining similarity queries and of complex similarity queries. For a more complete description of similarity queries see [191, ch. 4].

24 4 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM Range Query The simlarity range query R(q,r) is probably the most common type of similarity search. The query is specified by a query object q D, with some query radius r R as the distance constraint. The query retrieves all the objects within distance r of q, formally: R(q, r) = {x X D, d(x, q) r}. If need, individual objects in the response set can be ranked according to their distance with respect to q. Observer that the query object q needs not exist in the collection X D to be searched. The only restriction on q is that it belongs to the domain D. When the range radius is zero, the range query R(q,r) is called a point query or exact match. In this case, we are looking for an identical copy (or copies) of the query object q. The most usual use of this type of query is in delete algorithms, when we want to locate an object to remove it from the database Nearest Neighbor Query Whenever we want to search for similar objects using a range search, we must specify a maximal distance for objects to qualify. But it can be difficult to specify the radius without some knowledge of the data and the distance function. For example, the range r = 3 of the edit distance metric represents less than four edit operations between compared strings. This has a clear semantic meaning. However, a distance of two color-histogram vectors of images is a real number whose quantification cannot be so easily interpreted. If too small a query radius is specified, the empty set may be returned and a new search with a larger radius will be needed to get any result. On the other hand, if query radii are too large, the query may be computationally expensive and the response sets contain many non significant objects. An alternative way to search for similar objects is to use Nearest Neighbor query. The elementary version of this query finds the closest object to the given query object, that is the nearest neigh-

25 1.2. SIMILARITY QUERIES 5 bor of q. The concept can be generalized to the case where we look for the k nearest neighbors. Specifically kn N(q) query retrieves the k nearest neighbors of the object q. If the collection to be searched consists of fewer than k objects, the query returns the whole database. Formally, the response set can be defined as follows: knn(q) = {R X, R = k x R, y X R : d(q, x) d(q, y)}. When several objects lie at the same distance from the k-th nearest neighbor, the ties are solved arbitrarily Combinations of Queries As an extension of the query types defined above, we can define additional types of queries as combinations of the previous ones. For example we might combine a range query with a nearest neighbor query to get knn(q,r) with the response set: knn(q, r) = {R X, R k x R, y X R : : d(q, x) d(q, y) d(q, x) r}. In fact, we have constrained the result from two sides. First, all objects in the result-set should lie at a distance not greater than r, and if there are more than k of them, just the first (i.e. the nearest) k are returned Complex Similarity Queries Complex similarity queries are queries consisting of more than one similarity predicate. Efficient processing of such kind of queries differs substantially from traditional (Boolean) query processing. The problem was studied first by Fagin in [60] (related works are

26 6 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM [61, 62, 63, 64]. The similarity score (or grade) a retrieved object receives as a whole depends not only on the scores it gets for individual predicates, but also on how such score are combined. To this aim, Fagin has proposed the A 0 algorithm [60]. This algorithm assumes that for each query predicate we have an index structure able to return objects in order of decreasing similarity. Other related papers are [78, 134]. Ciaccia, Patella, and Zezula [51] have concentrated on complex similarity queries expressed through a generic language. On the other hand, they assume that query predicates are from a single feature domain, i.e. from the same metric space. The proposed evaluation process is based on distances between feature values, because metric indexes can use just distances to evaluate predicates. In [49] an extension of M-tree [50] is given. It outperforms the A 0 algorithm but has the main drawback is that even though it is able to employ more features during the search, these features are compared using a single distance function. A generalization of relational algebra to allow the formulation of complex similarity queries over multimedia database called similarity algebra with weights has been introduced in [48]. Incremental Nearest Neighbor Search When finding k nearest neighbors to the query object using a knn algorithm (see Section [p.4]), k is known prior to the invocation of the algorithm. Thus if the (k +1)-th neighbor is needed, the knn needs to be reinvoked for (k + 1) neighbors from scratch. To resolve this problem, the authors of the Incremental Nearest Neighbor algorithm [83] proposed the concept of distance browsing which is to obtain the neighbors incrementally (i.e. one by one) as they are needed. This operation means browsing through the database on the basis of distance. An incremental similarity search can provide objects in order of decreasing similarity without explicitly specifying the number of nearest neighbors in advance. This is especially important in interactive database applications, as it makes it possible to display

27 1.3. METRIC SPACES 7 partial query results early. For more detail see [32] and [83]. The incremental aspect also provides significant benefits in situations where the number of desired neighbors is unknown beforehand, for example when complex similarity queries are processed (see [62]). 1.3 Metric Spaces Although many similarity search approaches have been proposed, the most generic one considers the mathematical metric space as a suitable abstraction of similarity. Let M = (D, d) be a metric space defined for a domain of objects (or the objects keys or indexed feature), D and a total (distance) function d. In this metric space, the properties of the function d : D D R, sometimes called the metric space postulates, are typically characterized as: x, y D, d(x, y) 0 x, y D, d(x, y) = d(y, x) x, y D, x = y d(x, y) = 0 x, y, z D, d(x, z) d(x, y) + d(y, z) non-negativity, symmetry, identity, triangle inequality. d(x, y) is called the distance from x to y. A function d satisfying the metric postulates is called a metric function [96] or simply the metric. Therefore, if d is a metric function can be called metric distance. There are also several variations of metric spaces. In order to specify them more easily, we first transform the metric space postulates given above, into an equivalent form in which the identity postulate is decomposed into reflexivity (1.3) and positiveness (1.4):

28 8 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM x, y D, d(x, y) 0 non-negativity, (1.1) x, y D, d(x, y) = d(y, x) symmetry, (1.2) x D, d(x, x) = 0 reflexivity, (1.3) x, y D, x y d(x, y) > 0 positiveness, (1.4) x, y, z D, d(x, z) d(x, y) + d(y, z) triangle inequality.(1.5) If the total (distance) function d does not satisfy the positiveness property (1.4), it is called a pseudo-metric [96]. Although for simplicity we do not consider pseudo-metric spaces in this work, such functions can be transformed to the standard metric by regarding any pair of objects with zero distance as a single object. In fact if d is a pseudo-metric function, using triangle inequality (1.5): d(x, y) = 0 z D, d(x, z, ) = d(y, z). If the symmetry property (1.2) does not hold, we talk about a quasi-metric. For eample, let the objects be different locations within a city, and the distance function the physical distance a car must travel between them. The existence of one-way streets implies the function must be asymmetrical. There are techniques to transform asymmetric distances into symmetric form, for example: d sym (x, y) = d asym (x, y) + d asym (y, x, ). If the function d satisfies a stronger constraint on the triangle equality: x, y, z D, d(x, z) max{d(x, y), d(y, z)}. d is called super-metric or ultra-metric. The geometric characterization of the super-metric requires every triangle to have at least two sides of equal length, i.e. to be isosceles, which implies that the third side must be shorter than the others. Ultra-metrics are widely used in the field of biology, particularly in evolutionary biology. By comparing the DNA sequences of pairs of species,

29 1.3. METRIC SPACES 9 evolutionary biologists obtain an estimate of the time which has elapsed since the species separated. From these distances, an evolutionary tree (sometimes called phylogenetic tree) can be reconstructed, where the weights of the tree edges are determined by the time elapsed between two speciation events [142, 143]. Having a set of extant species, the evolutionary tree forms an ultra-metric tree with all the species stored in leaves and an identical distance from root to leaves. The ultra-metric tree is a model of the underlying ultra-metric distance function Metric Distance Measures In the following, we present examples of distance functions used in practice on various types of data. Typically, distance functions are specified by domain experts. Depending on the character of values returned, distance measures can be divided into two groups: discrete (i.e. distance functions that return only a predefined small set of values) and continuous (i.e. distance functions in which the cardinality of the set of values returned is very large or infinite). A survey of the most important metric distances can be found in [191, ch. 3]. Minkowski Distances The Minkowski distance functions form a whole family of metric functions. They are usually designated as the L p metrics where p is a parameter. These functions are defined on n-dimensional vectors of real numbers as: L p [(x 1,..., x 2 ), (y 1,..., y 2 )] = n p x i y i p. The L 1 distance is known as the Manhattan distance (also called the City-Block distance. The L 2 distance denotes the well-known Euclidean distance. The maximum distance or infinite distance or chessboard distance is defined as: i=1

30 10 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM L = n max i=1 x i y i. The L p metrics are used in measurements of scientific experiments, environmental observations, or the study of different aspects of the business process. Quadratic Form Distance The quadratic form distance function has been successfully applied to vectors that have individual components correlated, i.e. a kind of cross-talk exists between individual dimensions (e.g. color histograms of images [67, 80, 158]). A generalized quadratic distance measure is defined as d M ( x, y) = ( x y) T M ( x y), where M is a semi-definite matrix where the weights m i,j denote how strong the connection between two components i and j of vectors x and y is. Observe that this definition also subsumes the Euclidean distance when M is the identity matrix. When M = diag(w 1,..., w n ) the quadratic form distance collapse to the so called weighted Euclidean distance: d M ( x, y) = n w i (x i y i ) 2. i=1 Edit Distance The closeness of sequences of symbols (strings) can be effectively measured by the edit distance, also called the Levenshtein distance presented in [101]. The distance between two string x and y is defined as the minimum number of atomic edit operations (insert, delete and replace) needed to transform string x into string y.

31 1.3. METRIC SPACES 11 The generalized edit distance function assigns weights (positive real numbers) to individual atomic operations. Hence, the generalized edit distance is the minimum value of the sum of weighted atomic operations needed to transform x into y. In case the weight of insert and delete operations differ the generalized edit distance is not metric. Using weighting functions, we can define a most generic edit distance which assigns different costs even to operations on individual characters. In this case, to retain the metric postulates some additional limits must be placed on weight functions (e.g. symmetry of substitutions). A survey on string matching can be found in [131]. Tree Edit Distance A proximity measure for trees is the tree edit distance (see [77, 117] which defines a distance between two tree structures as the minimum cost needed to convert the source tree to the target tree using a predefined set of tree edit operation, such as the insertion or deletion of a node (i.e. a generalization of the edit distance to labeled trees). Since XML documents are typically modeled as rooted labeled trees, the tree edit distance can also be used to measure the structural dissimilarity of XML documents. Jacard s Coeeficient Assuming two sets A and B, Jacard s coefficient is defined as follows: d(a, B) = 1 A B A B. An example of an application is measuring similarity of users search interest considering the URLs the accessed browsing the Internet. An application of this metric to vector data is called the Tanimoto similarity measure. An even more complicated distance measure defined on sets is the Hausdorff distance [86].

32 12 CHAPTER 1. THE SIMILARITY SEARCH PROBLEM 1.4 Access Methods for Metric Spaces Relevant surveys on indexing techniques in metric spaces have been published by Chávez, Navarro, Baeza-Yates, and Marroquín in 2001 [46], by Hjaltason and Samet in 2003 [84] and by Zezula, Amato, Dohnal, and Batko in 2006 [191, ch. 2]. As in [191] we divide the individual techniques into four groups: Ball Partitioning, Generalized Hyperplane Partitioning, Exploiting Pre-Computed Distances, Hybrid Indexing Approaches. We will not take into account techniques for approximate similarity search. For a survey of approximate similarity search see [191, ch. 2.5]) Ball Partitioning Ball partitioning requires only one pivot and, provided the median distance is known, the resulting subsets contain the same amount of data. Numerous indexing approaches based on this simple concept have being defined. The most important techniques using ball partitioning are: Burkhard-Keller Tree (BKT) [34], Fixed Queries Tree (FQT) [14], Fixed Queries Array (FQA) [45], Vantage Point Tree (VPT) [188], Excluded middle Vantage Point Forest [189] Generalized Hyperplane Partitioning An orthogonal approach to ball partitioning is Generalized Hyperplane Partitioning. Two reference objects (pivots) are arbitrarily chosen and assigned to two distinct objects subset. All objects are assigned to the subset containing the pivot which is nearest to the object itself. In contrast to ball partitioning the generalized hyperplane does not guarantee a balanced split, and a suitable choice of reference points to achieve this objective is an interesting challenge. The most important techniques of this type are: Bisector Tree (BST) [92], Generalized Hyperplane Tree (GHT) [176].

33 1.4. ACCESS METHODS FOR METRIC SPACES Exploiting Pre-Computed Distances In [160] Shasha and Wang suggested using pre-compouted distances between data objects. Pairwise distance which are not stored (typically those from a query object to database objects) are estimated as intervals using the pre-computed distances. This technique of storing and using pre-computed distances may be effective for datasets of small cardinality, but space requirements and search complexity become overwhelming for larger files. The most important techniques of this type are: Approximating and Eliminating Search Algorithm (AESA) [177, 178], Linear AESA (LAESA) [122] Spaghettis [44]. LAESA and Spaghettis, in particular, store distances from objects to only a fixed number of pivots chosen with specific algorithms Hybrid Indexing Approaches Indexing methods that employ pre-computed distances provide high performance boosts in terms of computational costs. However, the price to pay for their performance is their enormous space requirements. A straightforward remedy is to combine both the partitioning principle and the pre-computed distances technique into a single index structure. The most important hybrid indexing approaches are: Multi Vantage Point Tree (MVPT) [30], Geometric Near-neighbor Access Tree(GNAT) [31], Spatial Approximation Tree (SAT) [130, 132], Metric Tree (M-tree) [50] (see [191, ch. 3] for a survey of M-Trees variants), Similarity Hashing (SH) [73], D-Index [57].

35 Chapter 2 Distributed Indexes During the last few years, numerous papers have been published about scalable indexes in distributed environments. As Aberer suggested in [3], two classes have been studied in the literature: Scalable and Distributed Data Structures (SDDSs) and Distributed Hash Tables (DHTs). SDDSs have been investigated for indexing scalable and distributed databases on workstation clusters. In most of the cases, they are variants of distributed search trees and hash-based access structures. SDDSs are characterized by a client-server architecture, a medium number of nodes, and typical by some form of global coordination, such as split coordinators or global directories. SDDSs are analyzed in more details in Section 2.1. DHTs, which represent the most important class of structured Peer-to-Peer systems, have been studied for implementing globalscale resource access in Peer-to-Peer architectures. DHTs differ from SDDSs through complete (or almost complete) degree of decentralization. They implement routing schemes for quickly locating resources identified by a data key in a network of N peers (typically in O(log(N)) time). The search can be started at any peer, without relying on a centralized directory. In Section 2.2 an introduction to the Peer-to-Peer paradigm is given while unstructured and structured Peer-to-Peer systems are discussed in Section 2.3 and Section 2.4 respectively. Finally in Section 2.5 we illustrate 15

36 16 CHAPTER 2. DISTRIBUTED INDEXES other 3 distributed similarity search structures that were developed at the same time we developed our MCAN. 2.1 Scalable and Distr. Data Structures In 1993 the use of mass produced PCs and WSs, interconnected through high speed networks (10 Mb/s - 1 Gb/s) were becoming prevalent in many organizations. Multicomputers needed new system software, fully taking advantage of the distributed RAM and of parallel processing on multiple CPUs. As Litwin, Neimat, and Schneider said in [113], a frequent problem of developing such kind of software is that: [... ] while the use of too many sites may deteriorate performance, the best number of sites to use is either unknown in advance or could evolve during processing. Given a client/server architecture, we are interested in methods for gracefully adjusting the number of servers, i.e. the number of sites or processors involved. Basically, the problem was to find data structures that efficiently would use the servers. Litwin, Neimat, and Schneider in [111, 113] defined Scalable Distributed Data Structure (SDDS) to solve this problem. They defined a SDDS as a structure that meets tree constraints: 1. A file expands to new servers gracefully, and only when servers already used are efficiently loaded. 2. There is no master site that object address computations must go through, e.g. to access a centralized directory. 3. The file access and maintenance primitives, e.g. search, insertion, split, etc., never require atomic updates to multiple clients.

37 2.1. SCALABLE AND DISTR. DATA STRUCTURES 17 As Zezula, Amato, Dohnal, and Batko in [191], we refer these tree constraints respectively as: scalability, no hot-spot and independence. In [112] the same authors gave a slightly different definition of an SDDS: An SDDS stores data on some sites called server, and is used from some sites called clients, basically distinct, but not necessarily. [... ] Every SDDS must respect three design requirements. 1. First, no central directory is used for data addressing, to avoid a hot-spot. 2. Next, a client image can be outdated, being updated only through messages, called Image Adjustment Messages (IAMs). These are sent only when a client makes an addressing error. A client autonomy is preserved in this way, which is a major requirement for multicomputers [126]. 3. Finally, a client with an outdated image can send a key to an incorrect server, in which case the structure should deliver it to the right server, and trigger an IAM. While the no hot-spot and independence properties, as they are called in [191], are present in both the definitions (i.e. [111, 113] and [112]), the scalability property is here missed. Obviously SDDS are still considered scalable (it s in the name), but the way in which the scalability is achieve is no more specified. Instead, in [112], the authors gave a further specification of the message behavior. Essentially an SDDS must be tolerant to client addressing errors. This property was already present in the LH* [111], but it was not considered in the SDDS definition. From the very beginning it was clear that there are two important goals in SDDSs: minimizing the messages and maximizing the load factor load factor. As Litwin, Neimat, and Schneider said in [111],

38 18 CHAPTER 2. DISTRIBUTED INDEXES to make an SDDS efficient, one should minimize the messages exchanged through the network, while maximizing the load factor. The SDDSs studied in the literature can be grouped in four classes: SDDSs for hashed files. Distributed Linear Hashing (LH*) [111, 56, 113, 109, 105, 179, 93]. They extend the more traditional dynamic hash data structures, especially the popular linear hashing [106, 107, 159] used, e.g. in the Netscape browser, and the dynamic hashing [99], to the network multicomputers, and switched multicomputers. SDDSs for ordered files. RP* (a family of Order Preserving SDDS) [112], Distributed Random Tree (DRT) [97]. They extend the traditional ordered data structures, B-trees or binary trees, to the multicomputers. SDDSs for multi-attribute (multi-key) files. k-rp*s [110], [129]. These algorithms extend the traditional k-d data structures [156] to the multicomputers. Performance of multi-attribute search may get improved by several orders of magnitude. high-availability SDDSs. LH*s [105, 109]. They have no known counterpart among the traditional data structures. They are designed to transparently survive failures of server sites. They apply principles of mirroring, or of striping, or of grouping of records, revised to support the file scalability.

39 2.2. PEER-TO-PEER SYSTEMS 19 A number of systems have been built using SDDSs, by Litwin et al. (see [108]), as well as the P-Grid system at EPFL in Switzerland [4] and the Snowball distributed file system at the University of Kentucky [180]. 2.2 Peer-to-Peer Systems The popularity of Peer-to-Peer has led to a number of (often contradictory) definitions. Tim O Reilly in [139, p.40] says: Ultimately, Peer-to-Peer is about overcoming the barriers to the formation of ad hoc communities, whether of people, of programs, of devices, or of distributed resources. The definition [186] currently adopted by Wikipedia [187] says that: A Peer-to-Peer (or P2P) computer network is a network that relies primarily on the computing power and bandwidth of the participants in the network rather than concentrating it in a relatively low number of servers. Shirky in [161, p.22] defines Peer-to-Peer as (its definition has also been adopted by Ian Foster in [71]): Peer-to-Peer is a class of applications that takes advantage of resources storage, cycles, content, human presence available at the edges of the Internet. The Shirky s definition, as suggested by Androutsellis-Theotokis and Spinellis in [11], encompasses systems that completely rely upon centralized servers for their operation (such as SETI@home 1, various instant messaging systems, or even the notorious Napster), as well as various applications from the field of Grid computing. Steinmetz refined the Shirky s definition in [163]: 1

40 20 CHAPTER 2. DISTRIBUTED INDEXES [a Peer-to-Peer system is] a self-organizing system of equal, autonomous entities (peers) [which] aim for the shared usage of distributed resources in a networked environment avoiding central services. Moreover, Steinmetz and Wehrle in [165] give an equivalent shorter definition: [a Peer-to-Peer system is] a system with completely decentralized self-organization and resource usage. The Steinmetz and Wehrle s definition is the one that we prefer and thus, it will be used in the rest of the thesis. While the other ones can be more expressive, the Steinmetz and Wehrle s definition is the most technical one. In [11], the lack of agreement on a definition or rather the acceptance of various different definitions -- is attributed to the fact that systems or applications are labeled Peer-to-Peer not because of their internal operation or architecture whether they give the impression of providing direct interaction between computers. As a result, different definitions of Peer-to-Peer are applied to accommodate the various different cases of such systems or applications. However, they too propose a definition: Peer-to-peer systems are distributed systems consisting of interconnected nodes able to self-organize into network topologies with the purpose of sharing resources such as content, CPU cycles, storage and bandwidth, capable of adapting to failures and accommodating transient populations of nodes while maintaining acceptable connectivity and performance, without requiring the intermediation or support of a global centralized server or authority. This definition is meant to encompass degrees of centralization ranging from the pure, completely decentralized systems such as Gnutella, to partially centralized systems such as Kazaa

41 2.2. PEER-TO-PEER SYSTEMS 21 Peer-to-Peer gained visibility with Napster s support for music sharing on the Web and its lawsuit with the music companies. However, the Peer-to-Peer approach is by no means just a technology for file sharing. Rather, it forms a fundamental design principle for distributed systems. It clearly reflects the paradigm shift from coordination to cooperation, from centralization to decentralization, and from control to incentives. Incentivebased systems raise a large number of important research issues. Finding a fair balance between give and take among peers may be crucial to the success of this technology. [164] There are a number of books published about Peer-to-Peer systems [138, 123, 68, 54, 128, 17, 169, 100], mobile networks [140, 85], Peer-to-Peer Information Retrieval [see 7, 173]. Good surveys of Peer-to-Peer systems are [124, 11] Characterization As suggested by Steinmetz and Wehrle in [165], Peer-to-Peer systems are characterized as follows (though a single system rarely exhibits all of these properties): Decentralized Resource Usage: 1. Resources of interest (bandwidth, storage, processing power) are used in a manner as equally distributed as possible and are located at the edges of the network, close to the peers. Thus, with regard to network topology, Peer-to-Peer systems follow the end-to-end arguments [155] which are the main reasons for the success of the Internet. 2. Within a set of peers, each utilizes the resources provided by other peers. The most prominent examples for such resources are storage and processing capacity. Other possible resources are connectivity, human presence, or geographic proximity

42 22 CHAPTER 2. DISTRIBUTED INDEXES (with instant messaging and group communication as application examples). 3. Peers are interconnected through a network and in most cases distributed globally. 4. A peer s Internet address typically changes so the peer is not constantly reachable at the same address (transient connectivity). Often, they may be disconnected or shut down over longer periods of time. Among other reasons, this encourages Peer-to-Peer systems to introduce new address and name spaces above of the traditional Internet address level. Hence, content is usually addressed through unstructured identifiers derived from the content with a hash function. Consequently, data is no longer addressed by location (the address of the server) but by the data itself. With multiple copies of a data item, queries may locate any one of those copies. Thus Peerto-Peer systems locate data based on content in contrast to location-based routing in the Internet. Decentralized Self-Organization: 1. In order to utilize shared resources, peers interact directly with each other. In general, this interaction is achieved without any central control or coordination. Peer-to-Peer systems establish a cooperation between equal partners. 2. Peers directly access and exchange the shared resources they utilize without a centralized service. Thus Peer-to-Peer systems represent a fundamental decentralization of control mechanisms. However, performance considerations may lead to centralized elements being part of a complete Peer-to-Peer system, e.g. for efficiently locating resources. Such systems are commonly called server-based or centralized Peer-to-Peer systems (see Figure 2.1 [p.23]). 3. In a Peer-to-Peer system, peers can act both as clients and servers. This is radically different from traditional systems

2.2. PEER-TO-PEER SYSTEMS 23 Figure 2.1: Summary of the characteristic features of Client-Server and Peer-to-Peer networks with asymmetric functionality.

43 2.2. PEER-TO-PEER SYSTEMS 23 Figure 2.1: Summary of the characteristic features of Client-Server and Peer-to-Peer networks with asymmetric functionality. It leads to additional flexibility with regard to available functionality and to new requirements for the design of Peer-to-Peer systems. 4. Peers are equal partners with symmetric functionality. Each peer is fully autonomous regarding its respective resources. 5. Ideally, resources can be located without any central entity or service. Similarly, the system is controlled in a self-organizing or ad hoc manner. As mentioned above, this guide line may be violated for reasons of performance. However, the decentralized nature should not be violated. The result of such a mix is a Peer-to-Peer system with a server-based approach. As shown in Figure 2.1 (taken from [59]), in a Client-Server system the server is the only provider of service or content. The peers (clients) in this context only request content or service from the server. The clients do not provide any service or content to

44 24 CHAPTER 2. DISTRIBUTED INDEXES run this system. Thus generally the clients are lower performance systems and the server is a high performance system. In contrast, in Peer-to-Peer systems all resources, i.e. the shared content and services, are provided by the peers. Some central facility may still exist, e.g. to locate a given content. A peer in this context is simply an application running on a machine [59], which may be a personal computer, a handheld or a mobile phone. In contrast to a Client-Server network, it is generally not possible to distinguish between a content requestor (client) and a content provider, as one application participating in the overlay in general offers content to other peers and requests content from other participants. This is often expressed by the term servent, composed of the first syllable of the term Server and the second syllable of the term Client. Using this basic concept Figure 2.1 outlines various possibilities currently used. In the first generation centralized Peer-to-Peer systems some central server is still available. However, this server only stores the IP addresses of peers where some content is available. The address of that server must be known to the peers in advance. This concept was widely used and became especially well known due to Napster, offering free music downloads by providing the addresses of peers sharing the desired content. This approach subsequently lost much of its importance due to legal issues. As a replacement for that scheme decentrally organized schemes, generally termed pure Peer-to-Peer such as Gnutella 0.4 and Freenet 3 [52] became widely used. These schemes do not rely on any central facility (except possibly for some bootstrap server to ease joining such a network), but rely on flooding the desired content identifier over the network. Contacted peers that share that content will then respond to the requesting peer which will subsequently initiate a separate download session. These schemes generate a potentially huge amount of signaling traffic by flooding the requests. To avoid that, schemes like Gnutella 0.6 or JXTA 4 [76] introduced a hierarchy by defining superpeers, which store the content available at the connected peers together with their IP address. The

45 2.2. PEER-TO-PEER SYSTEMS 25 superpeers are often able to answer incoming requests by immediately providing the respective IP address, so that on average less hops are required in the search process, thus reducing the signaling traffic. This second generation Peer-to-Peer systems are generally named hybrid The lookup problem Because of their completely decentralized character, the distributed coordination of resources is a major challenge in Peer-to-Peer systems. One of the research problems is: How do you find any given data item x in a large Peer-to-Peer system in a scalable manner, without any centralized server or hierarchy? [15] More generally, x may be some (small) data item, the location of some bigger content, or coordination data, e.g. the current status of a node, or its current IP address, etc. This problem is the heart of any Peer-to-Peer system and is called the lookup problem. More precisely the lookup problem can be defined as: Given a data item x stored at some dynamic set of nodes in the system, find it. [15] As underlined in [181], interesting questions about Peer-to-Peer systems and the lookup problem are: Where should a node store a given data item x? How do other nodes discover the location of x? How can the distributed system 5 be organized to assure scalability and efficiency? Four strategies have been proposed in literature to store and retrieve data in distributed systems: server-based, hybrid Peerto-Peer, flooding search (also called pure Peer-to-Peer), and distributed indexing (also referred as structured Peer-to-Peer has been adopted by DHTs). 5 In the context of Peer-to-Peer systems, the distributed system the collection of participating nodes pursuing the same purpose is often called the overlay network or overlay system.

46 26 CHAPTER 2. DISTRIBUTED INDEXES Figure 2.2: Strategies to store and retrieve data in distributed systems comparison. In Figure 2.2 [181] the three basic strategies (the hybrid is missed) are compared considering the communication overhead and storage cost per node. Also bottlenecks and special characteristics of each approach are named. Often, systems that adopt centralized and pure approaches are referred as first generation Peer-to-Peer systems while the flooding search is considered typical of second generation Peer-to-Peer systems. Both first and second generation Peer-to-Peer systems are generally termed unstructured because the content stored on a given node and its IP address are unrelated and do not follow any specific structure. Instead, systems that adopt the distributed indexing approach are referred as structured Peer-to-Peer systems. In Section 2.3 [p.27] and Section 2.4 [p.29] we analyze structured and unstructured Peer-to-Peer systems respectively.

47 2.3. UNSTRUCTURED PEER-TO-PEER SYSTEMS Unstructured Peer-to-Peer Systems Both first generation (i.e. server-based and pure) and second generation (i.e. hybrid) (see Figure 2.1 [p.23]) Peer-to-Peer systems are generally termed Unstructured, because the content stored on a given node and its IP address are unrelated and do not follow any specific structure. The server-based approach was adopted by the first Peer-to- Peer-based file sharing applications. The data was transferred directly between peers, only after looking up the location of a data item via the server (Centralized P2P, cf. Figure 2.1 [59]). First generation Peer-to-Peer systems, such as Napster, maintain the current locations of data items in a central server. After joining the Peer-to-Peer system, a participating node submits to the central server information about the content it stores and/or the services it offers. Thus requests are simply directed to the central server that responds to the requesting node with the current location of the data. The transmission of the located content is then organized in a Peer-to-Peer fashion between the requesting node and the node storing the requested data. The centralized server approach has the advantage of retrieving the location of the desired information with a search complexity of O(1). Also, fuzzy and complex queries are possible, since the server has a global overview of all available content. However, the central server is a critical element within the whole system concerning scalability and availability. The complexity in terms of memory consumption is O( X ), with X representing the number of items available in the distributed system. The server also represents a single point of failure and attack. Second generation of Peer-to-Peer systems (e.g. Gnutella and Freenet 7 [52]) pursued an opposite approach. They keep no explicit information about the location of data items in other nodes. This means that there is no additional information concerning where to find a specific item in the distributed system. Thus, to retrieve

48 28 CHAPTER 2. DISTRIBUTED INDEXES a data item, the only chance is to ask as much participating nodes as necessary, whether or not they presently have the requeste data. Thus, in Peer-to-Peer systems, a request is broadcasted among the nodes of the distributed system. If a node receives a query, it floods this message to other nodes until a certain hop count (Time to Live TTL) is exceeded. Often, the general assumption is that content is replicated multiple times in the network, so a query may be answered in a small number of hops. A well-known example of such an application is Gnutella [1]. Gnutella includes several mechanisms to avoid request loops, but it is obvious that such a broadcast mechanism does not scale well. The number of messages and the bandwidth consumed is extremely high and increases more than linearly with increasing numbers of participants. In fact, after the central server of Napster was shut down in July 2001 due to a court decision [185], an enormous number of Napster users migrated to the Gnutella network within a few days, and under this heavy network load the system collapsed [15]. The advantage of flooding-based systems, such as Gnutella, is that there is no need for proactive efforts to maintain the network. Also, unsharp queries can be placed, and the nodes implicitly use proximity due to the expanding search mechanism. Furthermore, there are efforts to be made when nodes join or leave the network. But still the complexity of looking up and retrieving a data item is O( X 2 ), or even higher, and search results are not guaranteed, since the lifetime of request messages is restricted to a limited number of hops. On the other hand, storage cost is in the order of O(1) because data is only stored in the nodes actually providing the data whereby multiple sources are possible and no information for a faster retrieval of data items is kept in intermediate nodes. Overall, flooding search is an adequate technique for file-sharing-like purposes and complex queries. It is apparent that neither approach scales well. The serverbased system suffers from exhibiting a single point of attack as well as being a bottleneck with regard to resources such as memory, pro-

49 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 29 cessing power, and bandwidth, while the flooding-based approaches show tremendous bandwidth consumption on the network. Generally, these unstructured systems were developed in response to user demands (mainly file sharing and instant messaging) and consequently suffer from ad hoc designs and implementations. 2.4 Structured Peer-to-Peer Systems Contrary to the unstructured Peer-to-Peer systems, also Peer-to- Peer approaches have been proposed which establish a link between the stored content and the IP address of a node. These systems are generally termed structured Peer-to-Peer systems (see Figure 2.1 [p.23]). The link between a content identifier and the IP address is usually based on Distributed Hash Tables (DHTs) Introduction to DHTs As Wehrle, Gtz, and Rieche said in [181], a Distributed Hash Table manages data by distributing it across a number of nodes and implementing a routing scheme which allows efficient look up the node on which a specific data item is located. In contrast to floodingbased searches in unstructured systems, each node in a DHT becomes responsible for a particular range of data items. Also, each node stores a partial view of the whole distributed system which effectively distributes the routing information. The DHT functionality, is already proving to be a useful substrate for large distributed systems. DHTs have already been used for distributed file systems [58, 151, 98] application-layer multicast [146, 195], event notification services [36, 42]. The routing procedure, using the partial view stored by each node, typically traverses several nodes, getting closer to the destination until the destination node is reached. Thus DHTs follow a

50 30 CHAPTER 2. DISTRIBUTED INDEXES proactive strategy for data retrieval by structuring the search space and providing a deterministic routing scheme. As reported in [181], overall, DHTs possess the following characteristics: In contrast to unstructured Peer-to-Peer systems, each DHT node manages a small number of references to other nodes. By means these are O(log(N)) references, where N depicts the number of nodes in the system. By mapping nodes and data items into a common address space, routing to a node leads to the data items for which a certain node is responsible. Queries are routed via a small number of nodes to the target node. Because of the small set of references each node manages, a data item can be located by routing via O(log(N)) hops. The initial node of a lookup request may be any node of the DHT. By distributing the identifiers of nodes and data items nearly equally throughout the system, the load for retrieving items should be balanced equally among all nodes. Because no node plays a distinct role within the system, the formation of hot-spots or bottlenecks can be avoided. Also, the departure or dedicated elimination of a node should have no considerable effects on the functionality of a DHT. Therefore, DHTs are considered to be very robust against random failures and attacks. A distributed index provides a definitive answer about results. If a data item is stored in the system, the DHT guarantees that the data is found. Address space DHTs introduce new address spaces into which data is mapped. Address spaces typically consist of large integer values. DHTs

51 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 31 achieve distributed indexing by assigning a contiguous portion of the address space to each participating node (Figure 7.6). Given a value from the address space, the main operation provided by a DHT system is the lookup function (see Subsection [p.25]). The mayor difference between DHT approaches is how they internally manage and partition their address space. In most cases, these schemes lend themselves to geometric interpretations of address spaces. In a DHT system, each data item is assigned an identifier ID, a unique value from the address space. This value can be chosen freely by the application, but it is often derived from the data itself (e.g. the complete binary file or the file name) via a collision-resistant hash function, such as SHA-1 [137]. Thus the DHT would store the file at the node responsible for the portion of the address space which contains the identifier. Based on the lookup function, most DHTs also implement a storage interface similar to a hash table. Thus the put function accepts an identifier and arbitrary data to store the data. This identifier and the data is often referred to as (key,value)-tuple. Symmetrically, the get function retrieves the data associated with a specified identifier. DHTs can be used for a wide variety of applications. Applications are free to associate arbitrary semantics with identifiers, e.g. hashes of search keywords, database indexes, geographic coordinates, hierarchical directory like binary names, etc. Thus such diverse applications as distributed file systems, distributed databases, and routing systems have been developed on top of DHTs, e.g. application-layer multicast [95], systems [125], storage systems [98, 58], indirection infrastructures [166], etc. Load balancing Most DHT systems attempt to spread the load of routing messages and of storing data on the participating nodes evenly [150, 135]. However, as suggested in [181],there are at least three reasons why some nodes in the system may experience higher loads than others:

52 32 CHAPTER 2. DISTRIBUTED INDEXES a node manages a very large portion of the address space, a node is responsible for a portion of the address space with a very large number of data items, or a node manages data items which are particularly popular. Under these circumstances, additional load-balancing mechanisms can help to spread the load more evenly over all nodes. For example, a node may transfer responsibility for a part of its address space to other nodes, or several nodes may manage the same portion of address space. In [149] load-balancing schemes are discussed in details. Routing The fundamental principle of the large variety of approaches to routing implemented by the DHT system is to provide each node with a limited view of the whole system. Moreover, a bounded number of links to other nodes are stored on each node. When a node receives a message for a destination ID it is not responsible for itself, it forwards the message to one of these other nodes it stores information about. This process is repeated recursively until the destination node is found. The routing algorithm and the routing metric determine the next-hop node. A typical metric is that of numeric closeness, i.e. messages are always forwarded to the node managing the identifiers numerically closest to the destination ID. A challenge for routing algorithms and metrics designing is that node failures and incorrect routing information have limited or little impact on routing correctness and system stability. A key difference in the algorithms is the data structure that they use as a routing table to provide O(log(N)) lookups. Chord maintains a data structure that resembles a skiplist. Each node in Kademlia, Pastry, and Tapestry maintains a tree-like data structure. Viceroy maintains a butterfly data structure, which requires information about only constant other number nodes, while

53 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 33 still providing O(log(N)) lookup. A recent variant of Chord (Koorde [91]), uses de Bruijn graphs, which requires each node to know only about two other nodes, while also providing O(log(N)) lookup. Data Storage There are two possibilities for storing data in a DHT. In a DHT which uses direct storage, the data is copied upon insertion to the node responsible for it. In this case the node which inserted it can subsequently leave the DHT without the data becoming unavailable. However there is an overhead in terms of storage and network bandwidth. Since nodes may fail, the data must be replicated to several nodes to increase its availability. The other possibility is to store references to the data. The inserting node only places a pointer to the data into the DHT. The data itself remains on this node. However, the data is only available as long as the node is available. In both cases, the node using the DHT for lookup purposes does not have to be part of the DHT in order to use its services. Thus it is possible to realize a DHT service as third-party infrastructure service (see OpenDHT Project [148]). Node Arrival Generally speaking, it takes four steps for a node to join a DHT. 1. The new node has to get in contact with the DHT. 2. The new node gets to know some arbitrary node of the DHT using a bootstrap method (this node is used as an entry point to the DHT until the new node is an equivalent member of the DHT). 3. A partition in the logical address space is assigned to the new node (the routing information in the system needs to be updated to reflect the presence of the new node). 4. The new node retrieves all (key, value) pairs under its responsibility from the node that stored them previously.

54 34 CHAPTER 2. DISTRIBUTED INDEXES Depending on the DHT implementation, a node may choose arbitrary or specific partitions on its own or not. Node Failure Because DHTs are often composed of poorly connected desktop machines, failures are usually assumed to occur frequently. Thus all non-local operations in a DHT need to resist failures of other nodes, reflecting the self-organizing design of DHT algorithms. For example, routing and lookup procedures are typically designed to use alternative routes toward the destination when a failed node is encountered on the default route. Many DHTs also employ proactive recovery mechanisms, e.g. to maintain their routing information. Consequently, they periodically probe other nodes to check whether these nodes are still operational. Furthermore, node failures lead to a re-partitioning of the DHT s address space. This may in turn require (key, value)-pairs to be moved between nodes and additional maintenance operations such as adaptation to new load-balancing requirements. When a node fails, the application data that it stored is lost unless the DHT uses replication to keep multiple copies on different nodes. Some DHTs follow the simpler soft-state approach which does not guarantee persistence of data. Data items are pruned from the DHT unless the application refreshes them periodically. Therefore, a node failure leads to a temporary loss of application data until the data is refreshed. Node Departure Nodes which voluntarily leave a DHT could be treated the same as failed nodes. However, DHT implementations often require departing nodes to notify the system before leaving. This allows other nodes to copy application data from the leaving node and to immediately update their routing information leading to improved routing efficiency.

55 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 35 Performance measures In [91] five performance measures for DHTs are considered: Degree: the number of neighbors with which a node must maintain continuous contact; Hop Count: the number of hops needed to get a message from any source to any destination; The degree of Fault Tolerance: what fraction of the nodes can fail without eliminating data or preventing successful routing; The Maintenance Overhead: how often messages are passed between nodes and neighbors to maintain coherence as nodes join and depart; The degree of Load Balance: how evenly keys are distributed among the nodes, and how much load each node experiences as an intermediate node for other routes. It is important to note that optimizing one tends to put pressure on the others. In other words, we agree with [181] that the design challenges are: Routing efficiency: The latency of routing and lookup operations is influenced by the topology of the address space, the routing algorithm, the number of references to other nodes, the awareness of the IP-level topology, etc. Management overhead: The costs of maintaining the Distributed Hash Table under no load depend on such factors as the number of entries in routing tables, the number of links to other nodes, and the protocols for detecting failures. Dynamics: A large number of nodes joining and leaving a Distributed Hash Table often referred to as churn concurrently puts particular stress on the overall stability of the system, reducing routing efficiency, incurring additional management traffic, or even resulting in partitioned or defective systems.

56 36 CHAPTER 2. DISTRIBUTED INDEXES Figure 2.3: Chord identifier circle (ring) consisting of ten nodes storing five keys Chord Chord, which is probably the most famous DHTs, has been defined in [167] and [168]. The elegance of the Chord algorithm derives from its simplicity. In Chord, the keys are l-bit identifiers (IDs) that form a one-dimensional circle. Both data items and nodes are associated with an ID. The (key, value) pair (k, v) is hosted by the node whose ID is greater than or equal to key k. Such a node is called the successor of key k. In other words, a node in a Chord circle with clockwise increasing IDs is responsible for all keys that precede it counter-clockwise. In Chord each node stores its successor node on the identifier circle. When a key is being looked up, each node forwards the query to its successor in the identifier circle until one of the nodes determines that the key lies between itself and its successor. The successor is communicated as the result of the query back to its originator (the key must be hosted by this successor). To achieve scalable key lookup, each node also maintains information about no-successor nodes in a routing table called finger

57 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 37 table. Given a circle with l-bit identifiers, a finger table has a maximum of l entries. Thus its size is independent of the number of keys or nodes forming the DHT. Each finger entry consists of a node ID, an IP address and port pair, and possibly some book-keeping information. The routing information from finger tables provides information about nearby nodes and a coarse-grained view of longdistance links at intervals increasing by powers of two. Using the finger tables queries are routed over large distances on the identifier circle in a single hop. In fact a node forwards queries for a given key to the closest predecessor according to its finger table. Furthermore, the closer the query gets to k, the more accurate the routing information of the intermediate nodes on the location of k becomes. Given the power-of-two intervals of finger IDs, each hop covers at least half of the remaining distance on the identifier circle between the current node and the target identifier. This results in an average of O(log(N)) routing hops for a Chord circle with N participating nodes. Stoica et al. show that the average lookup requires 1/2 log(n) steps. In order to join a Chord identifier circle, the new node first determines some identifier n. The original Chord protocol does not impose any restrictions on this choice. There have been several proposals to restrict node IDs according to certain criteria, e.g. to exploit network locality or to avoid identity spoofing. For the new node n, another node n 0 must be known which already participates in the Chord system. By querying n 0 for n s own ID, n retrieves its successor. It notifies its successor n s of its presence leading to an update of the predecessor pointer of n s to n. Node n then builds its finger by iteratively querying o for the successors of n + 21, n + 22, n + 23, etc. At this stage, n has a valid successor pointer and finger table. However, n does not show up in the routing information of other nodes. In particular, it is not known to its predecessor as its new successor since the lookup algorithm is not apt to determine a node s predecessor. Chord introduces a stabilization protocol to validate and update successor pointers as nodes join and leave the system. Stabilization requires an additional predecessor pointer and is performed peri-

58 38 CHAPTER 2. DISTRIBUTED INDEXES odically on every node. With the stabilization protocol, the new node n does not actively determine its predecessor. Instead, the predecessor itself has to detect and fix inconsistencies of successor and predecessor pointers. After the new node has thus learnt of its predecessor, it copies all keys it is responsible for. At this stage, all successor pointers are up to date and queries can be routed correctly, albeit slowly. Since the new node n is not present in the finger tables of other nodes, they forward queries to the predecessor of n even if n would be more suitable. Node n s predecessor then needs to forward the query to n via its successor pointer. Chord updates finger tables lazily. Each node periodically picks a finger randomly from the finger table at index i(1 < i = l) and looks it up to find the true current successor of n + 2i 1. Chord addresses node failures on several levels. When a node detects a failure of a finger during a lookup, it chooses the next best preceding node from its finger table. Failed nodes are removed from the finger tables. It is particularly important to maintain the accuracy of the successor information as the correctness of lookups depends on it. To maintain a valid successor pointer in the presence of multiple simultaneous node failures, each node holds a successor list of a certain length that contains the node s first successors. When a node detects the failure of its successor, it reverts to the next live node in its successor list. The Chord ring is affected only if all nodes from a successor list fail simultaneously. The successor of a failed node becomes responsible for the keys and data of the failed node. Thus an application utilizing Chord can replicates data to successor nodes to reduce the risk of data lost. Chord can use the successor list to communicate this information and possible changes to the application. In Chord a leaving node transfer its keys to its successor and notify its successor and predecessor. This ensures that data is not lost and that the routing information remains intact. Important variations of Chord are Symphony Subsection [p.46] and Viceroy Subsection [p.47]. A recent variant of Chord using the De Bruijn graphs is Koorde [91], which requires each node to know only about two other nodes, while also providing

59 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 39 Figure 2.4: Routing table of a Pastry node with nodeid 65a1x, b=4. Digits in base 16, x represents an arbitrary suffix. The IP address associated with each entry is not shown. O(log(N)) lookup. In [72], Chord has been extended to solve multi-dimensional attributed range queries transforming the original vector space into a single-dimensional domain. The method is called Space-filling Curveswith Range Partitioning (SCRAP) and uses space filling curves in order to mapping Pastry The Pastry distributed routing system was proposed by Rowstron and Druschel in [152]. In Pastry the routing is based on numeric closeness of identifiers. In their work, the authors focus not only on the number of routing hops, but also on network locality as factors in routing efficiency. In Pastry, nodes and data items uniquely associate with l-bit identifiers (l is typically 128). Pastry views identifiers as strings of digits to the base 2b where b is typically chosen to be 4. A key is located on the node to whose node ID it

60 40 CHAPTER 2. DISTRIBUTED INDEXES is numerically closest. Pastry s node state is divided into three main elements. The routing table, similar to Chord s finger table, stores links into the identifier space. The leaf set contains nodes which are close in the identifier space (like Chord s successor list). Nodes that are close together in terms of network locality are listed in the neighborhood set. Pastry measures network locality based on a given scalar network proximity metric which is assumed to be already available from the network infrastructure. This metric might range from IP hops to actual the geographical location of nodes. A Pastry node s routing table is made up of l/b rows with 2 b 1 entries per row On node n, the entries in row i hold the identities of Pastry nodes whose node IDs share an i-digit prefix with n but differ in digit n itself. When there is no node with an appropriate prefix, the corresponding table entry is left empty. As in Chord, in Pastry a node has a coarse-grained knowledge of other nodes which are distant in the identifier space. The detail of the routing information increases with the proximity of other nodes in the identifier space. Without a large number of nearby nodes, the last rows of the routing table are only sparsely populated. Intuitively, the identifier space would need to be fully exhausted with node IDs for complete routing tables on all nodes. In a system with N nodes, only log 2 b(n) routing table rows are populated on average. In populating the routing table, there is a choice from the set of nodes with the appropriate identifier prefix. During the routing process, network locality can be exploited by selecting nodes which are close in terms of a network proximity metric. To increase lookup efficiency, the leaf set L of node n holds the L nodes numerically closest to n. The routing table and the leaf set are the two sources of information relevant for routing. The leaf set also plays a role similar to Chord s successor lists in recovering from failures of adjacent nodes. Instead of numeric closeness, the neighborhood set M is concerned with nodes that are close to the current node with regard to

61 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 41 the network proximity metric. Thus it is not involved in routing itself but in maintaining network locality in the routing information. Routing in Pastry is divided into two main steps: 1. a node checks whether the key k is within the range of its leaf set. If this is the case, it implies that k is located on one of the nearby nodes of the leaf set. 2. the node forwards the query to the leaf set node numerically closest to k. In case this is the node itself, the routing process is finished. If k does not fall into the range of leaf set nodes, the query needs to be forwarded over a longer distance using the routing table. In this case, a node n tries to pass the query on to a node which shares a longer common prefix with k than n itself. If there is no such entry in the routing table, the query is forwarded to a node which shares a prefix with k of the same length as n but which is numerically closer to k than n. Pastry optimizes two aspects of routing and locating the node responsible for a given key: it attempts both to achieve a small number of hops to reach the destination node, and to exploit network locality to reduce the overhead of each individual hop. For detailed description of Pastry see [152, 37, 39, 41, 38, 114, 79] An open-source implementation of Pastry is FreePastry [175]. A scalable application-level multicast application called Scribe [42] has been built upon Pastry. In [125] epost, a Peer-to-Peer system built using Pastry, is presented Tapestry Tapestry, defined by Zhao, Kubiatowicz, and Joseph in [193], is very similar to Pastry (see Subsection [p.39]) but differs in its approach to mapping keys to nodes in the sparsely populated id space, and in how it manages replication. In Tapestry, there is no leaf set and neighboring nodes in the namespace are not aware of each other. When a node s routing table does not have an entry for

62 42 CHAPTER 2. DISTRIBUTED INDEXES a node that matches a key s nth digit, the message is forwarded to the node in the routing table with the next higher value in the nth digit modulo 2b. This procedure, called surrogate routing, maps keys to a unique live node if the node routing tables are consistent. For fault tolerance, Tapestry inserts replicas of data items using different keys. The expected number of routing hops is log 1 6(N). Tapestry has been also described in [194]. OceanStore, a utility infrastructure designed to span the globe and provide continuous access to persistent information, presented in [98] uses Tapestry. In [195]), application-level multicast for Pastry called Bayeux is presented Chimera Chimera [13] is a light-weight C implementation of a structured overlay that provides similar functionality as prefix-routing protocols Tapestry (see Subsection [p.41]) and Pastry (see Subsection [p.39]). Chimera gains simplicity and robustness from its use of Pastry s leafsets, and efficient routing from Tapestry s locality algorithms. In addition to these properties, Chimera also provides efficient detection of node and network failures, and reroutes messages around them to maintain connectivity and throughput Z-Ring Z-Ring [104] is a fast prefix routing protocol for Peer-to-Peer overlay networks. The main idea in Z-Ring is to classify peers into different closed groups at each routing level. Z-Ring uses efficient membership maintenance to support one or two-hop key-based routing in large dynamic networks. The analysis and simulations presented in [104] show that it provides efficient routing with very low maintenance overhead. Most of the DHTs show logarithmic number of hops which means that they scale well. However, with network size, the latency they incur can be substantial in practice. Z-Ring tries to achieve both goals of low routing hops and low maintenance costs in an

63 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 43 adaptive system by integrating Peer-to-Peer routing with efficient membership maintenance algorithms. On each peer Z-Ring uses a membership protocol to maintain a large routing table, which is the set of active peers belonging to the same group. They started with Pastry (see Subsection [p.39]) and expand its prefix routing base b from 16 to Using b = 4096, they can achieve one-hop routing with 4096 nodes and two-hop routing across 16 million peers. While traditional protocols require a node to periodically probe its links to each of its neighbors and thus make the maintenance of routing entries with b = 4096 infeasible, we leverage cost-efficient membership protocols to maintain routing entries and detect link or node failures Content Addressable Network CAN We analyze in details the CAN, over which we built our MCAN, in Chapter 3 [p.63] Kademlia Presented in [118], Kademlia is probably the most used DHT. The following are known implementations of Kademlia (see [184] for Up-to-Date information): Overnet network ued in MLDonkey and edonkey2000 (which is no longer available); Kad Network used by emule 8 (from v0.40), MLDonkey 9 (from v2.5-28) and amule (from v2.1.0); RevConnect 10 (from v0.403); KadC 11. A C library to publish and retrieve information to and from the Overnet network;

64 44 CHAPTER 2. DISTRIBUTED INDEXES Figure 2.5: An example of a Kademlia topology Khashmir. A Python implementation of Kademlia. BitTorrent Azureus 12 DHT, a modified Kademlia implementation for decentralized tracking and various other fetures like the Comments and Ratings plugin used by Azureus (from v ); BitTorrent Mainline DHT, used by BitTorrent 13 client (from v4.1.0), µtorrent 14 (from v1.2), BitSpirit 15 (from v3.0) Bit- Comet 16 (from v0.59) and KTorrent 17. They all share a DHT based on an implementation of the Kademlia algorithm, for trackerless torrents. Many of Kademlia s benefits result from its use of a XOR metric for distance between points in the key space. XOR is symmetric, allowing Kademlia participants to receive lookup queries from precisely the same distribution of nodes contained in their routing tables. Without this property, systems such Chord Subsection

65 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 45 [p.36] do not learn useful routing information from queries they receive. Worse yet, asymmetry leads to rigid routing tables. Each entry in a Chord node s finger table must store the precise node preceding some interval in the ID space. Any node actually in the interval would be too far from nodes preceding it in the same interval. Kademlia, in contrast, can send a query to any node within an interval, allowing it to select routes based on latency or even send parallel, asynchronous queries to several equally appropriate nodes. In their work on Kademlia [118], Maymounkov and Mazi eres observe a mismatch in the design of Pastry: its routing metric (identifier prefix length) does not necessarily correspond to the actual numeric closeness of identifiers. As a result, Pastry requires two routing phases which impacts routing performance and complicates formal analysis. Thus Kademlia uses an XOR routing metric which improves on these problems and optionally offers additional parallelism for lookup operations. Kademlia most resembles Pastry s (see Subsection [p.39]) first phase, which successively finds nodes roughly half as far from the target ID by Kademlia s XOR metric. In a second phase, however, Pastry switches distance metrics to the numeric difference between IDs. It also uses the second, numeric difference metric in replication. Unfortunately, nodes close by the second metric can be quite far by the first, creating discontinuities at particular node ID values, reducing performance, and complicating attempts at formal analysis of worst-case behavior. Kademlia s XOR metric measures the distance between two IDs by interpreting the result of the bit-wise exclusive OR function on the two IDs as integers. For example, the distance between the identifiers 3 and 5 is 6. Considering the shortest unique prefix of a node identifier, this metric effectively treats nodes and their identifiers as the leaves of a binary tree. For each node, Kademlia further divides the tree into subtrees not containing the node, as illustrated in Figure 2.5 [182]. Each node knows of at least one node in each of the subtrees. A query for an identifier is forwarded to the subtree with the longest matching prefix until the destination node is reached. Similar to

66 46 CHAPTER 2. DISTRIBUTED INDEXES Chord, this halves the remaining identifier space to search in each step and implies a routing latency of O(log(N)) routing hops on average. In many cases, a node knows of more than a single node per subtree. Similar to Pastry, the Kademlia protocols suggests forwarding queries to α nodes per subtree in parallel. By biasing the choice of nodes towards short round-trip times, the latency of the individual hops can be reduced. With this scheme, a failed node does not delay the lookup operation. However, bandwidth usage is increased compared to linear lookups. When choosing remote nodes in other subtrees, Kademlia favors old links over nodes that only recently joined the network. This design choice is based on the observation that nodes with long uptime have a higher probability of remaining available than fresh nodes. This increases the stability of the routing topology and also prevents good links from being flushed from the routing tables by distributed denial-of-service attacks, as can be the case in other DHT systems. With its XOR metric, Kademlia s routing has been formally proved consistent and achieves a lookup latency of O(log(N)). The required amount of node state grows with the size of a Kademlia network. However, it is configurable and together with the adjustable parallelism factor allows for a trade-off of node state, bandwidth consumption, and lookup latency Symphony The Symphony protocol, defined in [116], is a variation of Chord Subsection [p.36] that exploits the small world phenomenon. In Symphony each node establishes only a constant number of links to other nodes. This basic property of Symphony significantly reduces the amount of per-node state and network traffic when the overlay topology changes. However, with an increasing number of nodes, it does not scale as well as Chord. In Symphony, the Chord finger table is replaced by a constant but configurable number k of long distance links which are are chosen randomly according to harmonic distributions (hence the

67 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 47 name Symphony). Effectively, the harmonic distribution of longdistance links favors large distances in the identifier space for a system with few nodes and decreasingly smaller distances as the system grows. In Symphony a query is forwarded to the node with the shortest distance to the destination key. Symphony additionally employs a 1-lookahead approach. The lookahead table of each node records those nodes which are reachable through the successor, predecessor, and long distance links, i.e. the neighbors of a node s neighbors. Instead of routing greedily, a node forwards messages to its direct neighbor (not a neighbor s neighbor) which promises the best progression towards the destination. This reduces the average number of routing hops by 40% at the expense of management overhead when nodes join or leave the system. The main contribution is its constant degree topology resulting in very low costs of per-node state and of node arrivals and departures. It also utilizes bidirectional links between nodes and bi-directional routing. Symphony s routing performance O(1/k log 2 (N)). However, nodes can vary the number of links they maintain to the rest of the system during run-time based on their capabilities, which is not permitted by the original designs of Chord, Pastry, and CAN Viceroy In 2002, Malkhi, Naor, and Ratajczak proposed Viceroy [115], another variation on Chord (see Subsection [p.36]). Viceroy maintains a butterfly data structure, which requires information about only constant other number nodes. Viceroy improves on the original Chord algorithm through a hierarchical structure of the ID space with constant degree which approximates a butterfly topology. This results in less per-node state and less management traffic but slightly lower routing performance than Chord. Viceroy borrows from Chord s fundamental ring topology with successor and predecessor links on each node. It also introduces a new node state called a level. The routing procedure is split into three phases closely related

68 48 CHAPTER 2. DISTRIBUTED INDEXES Figure 2.6: An ideal Viceroy network. Up and ring links are omitted for simplicity. to the available routing information. First, a query is forwarded to level one along the uplinks. Second, a query recursively traverses the downlinks towards the destination. On each level, it chooses the downlink which leads to a node closer to the destination, without overshooting it in the clockwise direction. After reaching a node without downlinks, the query is forwarded along ringlevel and successor links until it reaches the target identifier. The authors of Viceroy show that this routing algorithm yields an average number of O(log(N)) routing hops. Like Symphony (see Subsection [p.46]), Viceroy features a constant degree linkage in its node state. However, every node establishes seven links whereas Symphony keeps this number configurable even at run-time. Furthermore and similar to Chord, the rigid layout of the identifier space requires more link updates than Symphony when nodes join or leave the system. At the same time, the scalability of

69 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 49 Figure 2.7: Performance comparison of DHT systems. The columns show the averages for the number of routing hops during a key lookup, the amount of per-node state, and the number of messages when nodes join or leave the system. its routing latency of O(log(N)) surpasses that of Symphony, while not approaching that of Chord, Pastry, and CAN DHTs Comparison In Figure 2.7 we report the performance comparison results of the most important DHTs. These results were presented by Wehrle, Gtz, and Rieche in [182]. Another comparison of DHTs can be found in [103] DHTs Related Works A great number of papers about particular aspects of DHTs can be found, e.g. topology-aware routing in [40], scalable application-level multicast in [43], complex queries in DHTs in [81]. load balancing in Peer-to-Peer systems [74] and [171]. An important project using Chord is MINERVA. In MINERVA 18 each peer is considered autonomous and has its own local search engine with a crawler and a corresponding local index. Peers share 18

70 50 CHAPTER 2. DISTRIBUTED INDEXES their local indexes (or specific fragments of local indexes) by posting meta-information into the Peer-to-Peer network. This metainformation contains compact statistics and quality-of-service information, and effectively forms a global directory. However, this directory is implemented in a completely decentralized and largely self-organizing manner. More specifically, they are maintained as a distributed hash table (DHT) using Chord. The MINERVA perpeer engine uses the global directory to identify candidate peers that are most likely to provide good query results. A query posed by a user is forwarded to other peers for better result quality. The local results obtained from there are merged by the query initiator. The MINERVA Project has been presented in [26], while related papers are [27, 24, 23, 25, 28, 47, 141, 120, 119]. In [170] the Threshold Algorithm [60, 63] was applied to the query processing problem in a Peer-to-Peer environment in which inverted lists are mapped to a set of peers, by employing an underlying DHT substrate. In [192] the work was extended defining 5 different policies for distributed query evaluation: Simple Algorithm, Distributed Threshold Algorithm, Bloom Circle Threshold Algorithm, Bloom Petal Threshold Algorithm (BPTA) and Simple Bloom Petal Algorithm (SBPA). From the experiments they conducted, the BTPA approach appears to perform best for queries with up to 4 terms while SBPA is better for longer queries. In [174] an information retrieval system called psearch has been defined. It uses a technique called latent semantic indexing to reduce the dimensionality of the space and a singular value decomposition to transform and truncate the matrix of term vectors computer in the previous step. The obtained lower-dimensional vector space is then distributed using CAN (see Chapter 3 [p.63]) P-Grid While unstructured Peer-to-Peer systems have generated substantial interest because of their self-organization structured overlay networks like DHTs typically requires a higher degree of coordination among the nodes while constructing and maintaining the

71 2.4. STRUCTURED PEER-TO-PEER SYSTEMS 51 overlay network. However, self-organizing process can also be used in the context of structured overlay networks. With such an approach structural properties are not guaranteed through localized operation, but emerge as a global property from a self-organization process. P-Grid 19, first presented by Aberer in [2], adopt this approach. P-Grid uses self-organizing processes for the initial network constructino to achieve load-balancing properties as well as for maintenance to retain structural properties of the overlay network intact during changes in the physical network. In P-Grid the key space is recursively bisected to achieve balanced workload of partitions. Bisecting the key space induces a canonical tree structure which is used as the basis for implementing a standard, distributed. There fore P-Grid is a tree-based Peer-to-Peer network. prefix routing scheme for efficient search. The P-Grid authors claim that it differs from other DHT approaches in terms of practical applicability (especially in respect to dynamic network environments), algorithmic foundations (randomized algorithms with probabilistic guarantees), robustness, and flexibility. In [55] the problem of executing range queries over a structured overlay network on top of a tree abstraction is discussed. The approach is evaluated using P-Grid. Other P-Grid related papers are [2, 8, 4, 5, 6] Small-World and Scale-Free Small-world network is a generalization of the small world phenomenon to non-social networks. Formally, it is a class of where most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps. In [94] a family of Small-World Access Methods (SWAM) have been proposed for efficient execution of various similarity-search queries, namely exact match, Range and Nearest Neighbor, on vector space with L p metrics (see Subsection [p.9]). Furthermore, the authors also describe the SWAM-V structure that partitions the data 19 P-Grid Project:

72 52 CHAPTER 2. DISTRIBUTED INDEXES space in a Voronoi-life manner of neighboring cells Other Works In [121] KLEE a novel algorithmic framework for distributed top-k queries has been presented. It addresses the efficient processing of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers. In KLEE each data item has associated with it a set of descriptors, text terms or attribute values, and there is a precomputed score for each pair of data item and descriptor. The inverted index list for one descriptor is the list of data items in which the descriptor appears sorted in descending order of scores. These index lists are the distribution granularity of the distributed system. Each index list is assigned to one peer (or, if we wish to replicate it, to multiple peers). P2PR-tree (Peer-to-Peer R-tree), defined in [127], is a spatial index specifically designed for Peer-to-Peer systems. It is hierarchical and performs efficient pruning of the search space by maintaining minimal amount of information concerning peers that are far away and storing more information concerning nearby peers. 2.5 Metric Peer-to-Peer Structures Very recently, four scalable distributed have been proposed similarity search structures for metric data. The first two structures adopt the basic ball and Generalized Hyperplane partitioning principles [176] and they are called the VPT and the GHT, respectively (see Subsection [p.53]). The other two apply transformation strategies the metric similarity search problem is transformed into a series of range queries executed on existing distributed keyword structures, namely the CAN (described in Chapter 3 [p.63]) and the Chord (described in Subsection [p.36]). By analogy, they have been called the MCAN, which is the object of this thesis, and the M-Chord (see Subsection [p.57]). Each of the structures is able to execute

73 2.5. METRIC PEER-TO-PEER STRUCTURES 53 similarity queries for any metric dataset, and they all exploit parallelism for query execution. Result of numerous experiments conducted on implementations of the VPT, GHT, MCAN and M-Chord systems over the same infrastructure of peer computers, have been reported in [22]. In this thesis they are reported reported on Chapter 6 [p.117] GHT and VPT In this section, we describe two distributed metric index structures the GHT [18, 20, 21, 19] and its extension called the VPT [22]. Both of them exploit native metric partitioning principles using them to build a distributed binary tree [97]. In both the GHT and the VPT, the dataset is distributed among peers participating in the network. Every peer holds sets of objects in its storage areas called buckets. A bucket is a limited space dedicated to storing objects, e.g. a memory segment or a block on a disk. The number of buckets managed by a peer depends on its own potential. Since both structures are dynamic and new objects can be inserted at any time, a bucket on a peer may reach its capacity limit. In this situation, a new bucket is created and some objects from the full bucket are moved to it. This new bucket may be located on a peer different from the original one. Thus the structures grow as new data come in. The core of the algorithm lays down a mechanism for locating respective peers which hold requested objects. The component of the structure responsible for this navigation is called the Address Search Tree (AST), an instance of which is present at every peer. Whenever a peer wants to access or modify the data in the GHT structure, it must first consult its own AST to get locations, i.e. peers, where the data resides. Then, it contacts the peers via network communication to actually process the operation. Since we are in a distributed environment, it is practically impossible to maintain a precise address for every object in every peer. Thus the ASTs at the peers contain only limited navigation infor-

74 54 CHAPTER 2. DISTRIBUTED INDEXES BID 1 BID 2 BID NNID BID 1 BID 2 BID Figure 2.8: Address Search Tree with the generalized hyperplane partitioning mation which may be imprecise. The locating step is repeated on contacted peers whenever AST is imprecise until the desired peers are reached. The algorithm guarantees that the destination peers are always found. Both of these structures also provide a mechanism called image adjustment for updating the imprecise parts of the AST automatically. For more technical details see [21]. Address Search Tree The AST is a binary search tree based on the Generalized Hyperplane Tree (GHT) [176] in GHT, and on the Vantage Point Tree (VPT) [176] for the VPT structure. Its inner nodes hold the routing information according to the partitioning principle and each leaf node represents a pointer to either a local bucket (denoted as BID) or a remote peer (denoted as NNID) where the data of the respective partition is located. An example of AST using the generalized hyperplane partitioning is depicted in Figure 2.8. In order to divide a set of objects I = {x 1,..., x 22 } into two separated partitions I 1, I 2 using the generalized hyperplane, we must first select a pair of objects from the set. In Figure 2.8, we select objects x 10, x 12 and promote them to pivots of the first level of the AST. Then, the original set I is split

75 2.5. METRIC PEER-TO-PEER STRUCTURES r r r r BID 1 BID 2 BID 3 NNID 1 11 r 1 12 BID 1 BID 2 BID r Figure 2.9: Address Search Tree with the vantage point partitioning by measuring the distance between every object x I and both the pivots. If d(x, x 10 ) d(x, x 12 ), i.e. the object x is closer to the pivot x 10, the object is assigned to the partition I 1 and vice versa. This principle is used recursively until all the partitions are small enough and a binary tree representing the partitioning is built accordingly. Figure 2.8 shows an example of such a tree. Observe that the leaf nodes are denoted by BID i and NNID i symbols. This means that the corresponding partition (which is small enough to stop the recursion) is stored either in a local bucket or on a remote peer respectively. The vantage point partitioning, which is used by the VPT structure, can be seen in Figure 2.9. In general, this principle also divides a set I into two partitions I 1 and I 2. However, only one pivot x 11 is selected from the set and the objects are divided by a radius r 1. More specifically, if the distance between the pivot x 11 and an object x I is smaller or equal to the specified radius r 1, i.e. if d(x, x 11 ) r 1 then the object belongs to partition I 1. Otherwise, the object is assigned to I 2. Similarly, the algorithm is applied recursively to build a binary tree. The leaf nodes follow the same schema for addressing local buckets and remote peers.

76 56 CHAPTER 2. DISTRIBUTED INDEXES Range Search The R(q, r) query search in both the GHT and VPT structures proceeds as follows. The evaluation starts by traversing the local AST of the peer which issued the query. For every inner node in the tree, we evaluate the following conditions. Having the generalized hyperplane partitioning with the inner node of format p 1, p 2 : d(p 1, q) r d(p 2, q) + r, (2.1) d(p 1, q) + r > d(p 2, q) r. (2.2) For the vantage point partitioning with the inner node of format p, r p : d(p, q) r r p, (2.3) d(p, q) + r > r p. (2.4) The right subtree of the inner node is traversed if Condition 2.1 for the GHT or Condition 2.3 for the VPT qualifies. The left subtree is traversed whenever Condition 2.2 or Condition 2.4 holds respectively. It is clear that both conditions may qualify at the same time for a particular range search. Therefore, multiple paths may be followed and, finally, multiple leaf nodes may be reached. For all qualifying paths having an NNID pointer in their leaves, the query request is forwarded to identified peers until a BID pointer is found in every leaf. The range search condition is evaluated by the peers in every bucket determined by the BID pointers. All qualifying objects together form the query response set. In order to avoid some distance computations, both the structures apply additional filtering using Equation 4.2 [p.80]. For every stored object, the distances to all pivots on the AST path from the root to the leaf with the respective bucket are held together with the data object. For example, object x 1 in Figure 2.8 has four associated numbers the distances d(x 1, x 10 ), d(x 1, x 12 ), d(x 1, x 8 ), d(x 1, x 19 ) which were evaluated during the insertion of x 1 into bucket BID 1. In the case of VPT structure, only half of the distances are stored, because only one pivot is present in every

77 2.5. METRIC PEER-TO-PEER STRUCTURES 57 C 2 C 2 C0 p2 C0 p 2 p0 p0 p1 C 1 (idistance) r q p 1 C 1 0 c (a) 2*c 3*c 0 c (b) 2*c 3*c Figure 2.10: The principles of idistance inner node. As is obvious, the deeper the bucket where the object is stored, the more precomputed distances are stored for that particular object and the better the effect of the filtering M-Chord Similarly to the MCAN, the M-Chord [136] approach also transforms the original metric space. The core idea is to map the data space into a one-dimensional domain and navigate in this domain using the Chord routing protocol [167]. Specifically, this approach exploits the idea of a vector index method idistance [88, 190], which partitions the data space into clusters (C i ), identifies reference points (p i ) within the clusters, and defines one-dimensional mapping of the data objects according to their distances from the cluster reference point. Having a separation constant c, the idistance key for an object x C i is idist(x) = d(p i, x) + i c. Figure 2.10a visualizes the mapping schema. Handling a R(q, r) query, the space to be searched is specified by idistance intervals for such clusters that intersect the query sphere see an example in Figure 2.10b. This method is generalized to metric spaces in the M-Chord. No vector coordinate system can be utilized in order to partition

78 58 CHAPTER 2. DISTRIBUTED INDEXES a general metric space, therefore, a set of M pivots p 0,..., p M 1 is selected from a sample dataset and the space is partitioned according to these pivots. The partitioning is done in a Voronoi-like manner [84] (every object is assigned to its closest pivot). Because the idistance domain is to be used as the key space for the Chord protocol, the domain is transformed by an orderpreserving hash function h into M-Chord domain of size 2 m. The distribution of h is uniform on a given sample dataset. Thus, for an object x C i, 0 i < M, the M-Chord key-assignment formula becomes: m-chord(x) = h(d(p i, x) + i c). (2.5) The M-Chord Structure Having the data space mapped into the one-dimensional M-Chord domain, every active node of the system takes over the responsibility for an interval of keys. The structure of the system is formed by the Chord circle (see Subsection [p.36]). This Peer-to-Peer protocol provides an efficient localization of the node responsible for a given key. When inserting an object x D into the structure, the initiating node n ins computes the m-chord(x) key using Formula 2.5 and employs Chord to forward a store request to the node responsible for key m-chord(x) (see Figure 2.11a). The nodes store data in B + -tree storage according to their M- Chord keys. When a node reaches its storage capacity limit (or another defined condition) it requests a split. A new node is placed on the M-Chord circle, so that the requester s storage can be split evenly. Range Search Algorithm The node n q that initiates the R(q, r) query uses the idistance pruning idea to choose the M-Chord intervals to be examined. The Chord protocol is then employed to reach nodes responsible for middle points of these intervals. The request is then spread to all nodes covering the particular interval (see Figure 2.11b).

79 2.6. PEER-TO-PEER AND GRID COMPUTING 59 insert(x): k:=m chord(x) forward(k,x) receive(k,x): store(k,x) N x m chord(x) Nins 0 request response N q (a) (b) Figure 2.11: The insert (a) and range search (b) From the metric point of view, the idistance pruning technique filters out all objects x C i that fulfill d(x, p i ) d(q, p i ) > r. But in M-Chord, when inserting an object x, all distances d(x, p i ) have to be computed i : 0 i < n. These values are stored together with object x and the general metric filtering criterion improves the pruning of the search space. 2.6 Peer-to-Peer and Grid Computing Another approach to distributed computing which has been very important in the past few years research is Grid computing. The term Grid computing originated in the early 1990s as a metaphor for making computer power as easy to access as an electric power grid. Even if there are many definitions. In [70] Foster reports previous definition and summarize the essence of them in a three points of this checklist: Computing resources are not administered centrally. Open standards are used. Non-trivial quality of service is achieved.

80 60 CHAPTER 2. DISTRIBUTED INDEXES Peer-to-Peer and Grid computing have a lot in common. As Foster and Iamnitchi say: Two supposedly new approaches to distributed computing have emerged in the past few years, both claiming to address the problem of organizing large-scale computational societies: Peer-to-Peer and Grid computing. They both emerged in the past few years, both claiming to address the problem of organizing large-scale computational societies. Both approaches have seen rapid evolution, widespread deployment, successful application, considerable hype, and a certain amount of (sometimes warranted) criticism. The two technologies appear to have the same final objective.the pooling and coordinated use of large sets of distributed resources.but are based in different communities and, at least in their current designs, focus on different requirements. [71] In the same paper they try to compare and contrast Peer-to- Peer and Gird computing. In brief, they argue that: both are concerned with the organization of resource sharing within virtual communities; both take the same general approach to solving the creation of overlay structures that coexist with, but need not correspond in structure to, underlying organizational structures; each has made genuine technical advances, but each also has.in current instantiations.crucial limitations: Grid computing addresses infrastructure but not yet failure, whereas Peer-to- Peer addresses failure but not yet infrastructure ; and the complementary nature of the strengths and weaknesses of the two approaches suggests that the interests of the two communities are likely to grow closer over time. We agree with Androutsellis-Theotokis and Spinellis that:

81 2.6. PEER-TO-PEER AND GRID COMPUTING 61 As Grid systems increase in scale, they begin to require solutions to issues of self-configuration, fault tolerance, and scalability, for which Peer-to-Peer research has much to offer. Peer-to-Peer systems, on the other hand, focus on dealing with instability, transient populations, fault tolerance, and self-adaptation. To date, however, Peer-to-Peer developers have worked mainly on vertically integrated applications, rather than seeking to define common protocols and standardized infrastructures for interoperability. [11] At the end of these considerations they agree with Foster [71] about the fact that Grid computing addresses infrastructure but not yet failure, whereas Peer-to-Peer addresses failure but not yet infrastructure. However, as Foster in [69], they believe that, as Peer-to-Peer technologies move into more sophisticated and complex applications, (e.g. structured content distribution, desktop collaboration, and network computation), it is expected that there will be a strong convergence between Peer-to-Peer and Grid computing. The result will be a new class of technologies combining elements of both Peer-to-Peer and Grid computing, which will address scalability, self-adaptation, and failure recovery, while, at the same time, providing a persistent and standardized infrastructure for interoperability.

83 Chapter 3 Content-Addressable Network (CAN) Defined in [145], Content-Addressable Network (CAN) is a DHT (see Section 2.4 [p.29]). The CAN routing algorithm provides the DHT functionality while meeting the previously enumerated design goals of scalability, efficiency, dynamicity and balanced load. The basic operations performed on a CAN are the insertion, lookup and deletion of (key, value) pairs. Each CAN node stores a chunk (called a zone) of the entire hash table. In addition, a node holds information about a small number of adjacent zones in the table. Requests (insert, lookup, or delete) for a particular key are routed by intermediate CAN nodes toward the CAN node whose zone contains that key. The CAN design is completely distributed (requiring no form of centralized control, coordination or configuration), scalable (nodes maintain only a small amount of control state that is independent of the number of nodes in the system), and fault-tolerant (nodes can route around failures). Finally the CAN design can be implemented entirely at the application level The CAN design centers around a virtual M-dimensional Cartesian coordinate space on a M-torus. This coordinate space ha no relation to any physical coordinate system. At any point in time, the entire coordinate space is dynamically partitioned among all the nodes in the system. A node learns and maintains the IP addresses 63

84 64 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) Figure 3.1: 5 nodes CAN and its corresponding partition tree. of those nodes that hold coordinate zones adjoining its own zone. This set of immediate neighbors serve as a coordinate routing table that enables routing between arbitrary points in the coordinate space. The Cartesian space serves as a level of indirection. The virtual coordinate space is used to store (key,value) pairs by deterministically mapping the key into a point ˆx in the coordinate space using a uniform hash function. The corresponding pair is then stored at the node that owns the zone within which the point ˆx lies. To retrieve an entry corresponding to a given key, any node can apply the same deterministic hash function and the retrieve the corresponding value from the point ˆx. If the point ˆx is not owned by the requesting node or its immediate neighbors, the request must be routed through the CAN infrastructure until it reaches the node in whose zone ˆx lies. In this chapter a short introduction to the CAN is given. A further description can be found in Ratnasamy s PhD Thesis [144]. 3.1 Node arrivals Any node that joins the CAN must:

85 3.1. NODE ARRIVALS 65 be allocated its own portion of the coordinate space; discover its neighbors in the space. In the CAN, the entire space is divided among the nodes currently in the system. The first node to join owns the entire CAN space (i.e. the complete virtual space). Each time a new node joins the CAN, an existing zone is split into two halves, one of which is assigned to the new node. The split is done by following a predefined ordering of the dimensions in deciding along which dimension a zone is to be split, so that zones can be re-merged when nodes leave. For example, for a 2-dimensional space, a zone would first be split along the first dimension, then along the second, then the first again followed by the second and so forth. Thus, at any given step, we can think each existing zone as a leaf of a binary partition tree. The internal nodes in the tree represent zones that no longer exist, but were split at some previous time. The children of a tree node are the two zones into which it was split (see Figure 3.1 [144]). Labeling the edges with a predefined rule (e.g. 0 if the child zone occupies the lower half, and 1 otherwise), a zones s position (i.e. the zone s coordinate span along each dimension) in the coordinate space is completely defined by the path from the root of the partition tree to the leaf node corresponding to that zone. Every node in the CAN is addressed with a virtual identifier (VID) the binary string representing the path from the root to the leaf node. The partition tree is not maintained as a data structure, and none of the CAN operation require a node to have knowledge of the entire partition tree. The tree is just a useful conceptual aid to understanding the structure of nodes in a CAN. Considering the tree, the allocation of a portion of the coordinate space to a node can be seen as obtaining a unique VID and the discovering of neighbors as discovering its neighbors VIDs and IP addresses. We can summarize the process of joining the CAN in three steps: 1. the new node must find a node already in the CAN

86 66 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) 2. using the CAN routing mechanisms, it must find a node whose zone will be split 3. the neighbors of the split zone must be notified so that routing can include the new node. In this chapter we will not take into consideration the bootstrap mechanism because the functioning of a CAN does not depend on the details of how this is done. For a description of how the bootstrap was implemented on the first implementation of the CAN see [144] Finding a Zone To find a node to ask for split, the joining node chooses a point ˆx in the space and sends a join request destined for point ˆx. This message is sent into the CAN via any existing CAN nodes. Each CAN node then uses the CAN routing mechanism (described later) to forward the message, until it reaches the node in whose zone ˆx lies. In the CAN definition two approaches have been taken in considerations, once the owner node has been reached. The simpler just split the owner node. The other one, namely 1-hop volume check, makes use of the fact that the owner node knows not only its own zone coordinates, but also those of its neighbors. Therefore, instead of directly splitting its own zone, the existing occupant node first compares the volume of its zone with those of its immediate neighbors in the coordinate space. The zone with the largest volume is then split. In [144] a comparison of those two approaches has been given. Results shown that 1-hop volume check gives significantly better results. Because of the fact that 1-hop volume check performance does depend on the average number of neighbors, increasingly the dimensionality of the CAN space the advantages of the 1-hop volume check become more and more significantly.

87 3.2. ROUTING Joining the Routing Once the joining node has obtained its zone, it must learn the IP addresses of its coordinate neighbor set. In a M-dimensional coordinate space, two nodes are neighbors if their coordinate spans overlap along M 1 dimensions and abut along one. Because a new node s zone is derived by splitting the previous occupant s zone, the new nodes neighbor set is a subset of the previous occupant s neighbors, plus that occupant itself. Similarly, the previous occupant updates its neighbor set to eliminate those nodes that are no longer neighbors. Finally, both the new and olds nodes neighbors must be informed of this reallocation of space. Every node in the system senses an immediate update message, followed by periodic refreshes, with its currently assigned zone to all its neighbors. The addition of a new node affects only a small number of existing nodes in a very small locality of the coordinate space. The number of neighbors a node maintains depends only on the dimensionality of the coordinate space and is independent of the total number of nodes in the system. Thus, for a M-dimensional space, node insertion affects only O(M) existing nodes which is important for CANs with huge numbers of nodes. In Section 3.4 [p.70] we will further discuss load balancing in CANs. 3.2 Routing Using its neighbor coordinate set, a node routes a message toward its destination by simple greedy forwarding to the neighbor with coordinates closest to the destination coordinates (see Figure 3.2 [144]). For a M-dimensional space partitioned into N equal zones, individual nodes maintain 2M neighbors (one to advance and one to retreat along each dimension) and the average routing path length is (M/4)(N 1/M ) (each dimension has N 1/M nodes; on a torus, a destination will, on average be (1/4)(N 1/M ) nodes away along each of the M dimensions). For a M-dimensional space, they can grow the number of nodes (and hence zones) without increasing per node

88 68 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) Figure 3.2: Example of routing in a 2-d space. state while the path length grows as O(M 1/M ). Several proposed routing algorithms for location services route in O(log(N)) hops with each node maintaining O(log(N)) neighbors. In CAN this is possible if the number M (log 2 (N))/2. Because of CAN is supposed to be applied to very large systems with frequent topology changes, in the consideration above M is kept fixed to keep the number of neighbors independent of the system size. Note that, if one or more of a node s neighbors were to crash, a node would automatically route along the next best available path. If however, a node loses all its neighbors in a certain direction, and the repair mechanisms described in Section 3.3 [p.69] have not yet rebuilt the void in the coordinate space, then greedy forwarding may temporarily fail. In this case, the forwarding node first checks

89 3.3. NODE DEPARTURES 69 with its neighbors to see whether any of them can make progress toward the destination. This 1-hop route check is useful in circumventing certain voids and its more important at lower dimensions. If even the 1-hop route check fails, greedy routing fails an the message (see Section 3.3 [p.69]) is forwarded using the rules used to route recovery messages until it reaches a node from which greedy forwarding can resume. 3.3 Node departures When nodes leave a CAN, the zone they occupied must be taken over by the remaining nodes. For doing this a node can explicitly hand over its zone (i.e. its own VID and its list of neighbor VIDs and IP addresses) and the associated (key, value) database to a specific node called the takeover node. If the takeover s zone can be merged with the departing node s zone to produce a valid single zone, then this is done. If not, then the takeover node can temporarily handle both zones. The CAN is also designed to be robust to node or network failures, where one or more nodes simply become unreachable. A recovery algorithm has been defined that ensures that the takeover node and the failed node s neighbors independently work to reconstruct the routing structure at the failed node s zone. However in this case the data stored by the departing node is lost and need to be rebuilt. For example, the holders of the data can refresh the state (to prevent stale entries as well as to refresh lost entries, nodes that insert (key, value) into the CAN might periodically refresh these entries. Alternately, each (key, value) pair might be replicated at multiple points. In the CAN definition this issue is not addressed since the appropriate solution is largely dependent on application-level issues. We do not describe the detailed recovery process by which routing state is rebuilt when a node fails. The identification of a unique node, called the takeover node that occupies the departed node s zone and the process by which the departed node s neighbors dis-

90 70 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) cover the takeover node and vice versa are full described in [144]. 3.4 Load Balancing Since data is spread across the coordinate space using a uniform hash function, the volume of a node s zone is indicative of the size of the database the node will have to store, and hence indicative of the load placed on the node. A uniform partitioning of the space is thus desirable to achieve load balancing. Note that this is not sufficient for true load balancing because some pair (key, value) will be more popular than others thus putting higher load on the nodes hosting those pairs. This is basically the same as the hot-spot problem on the Web and can be addressed using caching and replication schemes as discussed in [145]. Other approaches for load balancing in CAN can be found in [162, 172, 153], while a general discussion about load balancing in DHTs can be found in [149]. 3.5 M-CAN: CAN-based Multicast The naive approach to implement flooding for a CAN overlay network is for each node that receives a message to forward that message to all of its neighbors. Nodes can filter out duplicate messages by maintaining a cache of previously received message ids. The problem with the naive strategy is that it can lead to a large number of duplicate messages. To reduce the number of duplicates, Ratnasamy, Handley, Karp, and Shenker presented in [146] (see also [144]) an efficient flooding algorithm that exploits the structure of the CAN coordinate space to limit the directions in which each node will forward messages. Nodes use the following five rules to decide whether to forward a message, and to decide to which neighbors to forward the message. 1. Origin Forwarding Rule: The multicast origin node forwards the message to all neighbors.

91 3.5. M-CAN: CAN-BASED MULTICAST 71 Figure 3.3: Directed Flooding over the CAN.[146] 2. General Forwarding Rule: A node receives a message from a neighboring node adjacent along dimension i. The node forwards that message to all adjacent neighbors along dimensions 1 through i 1. The node also forwards the message to those adjacent neighbors along dimension i in the opposite direction from where it received the message. 3. Duplicate Filter Rule: A node caches the message-ids of all received messages. When a node receives a duplicate, it does not forward the message. 4. Half-Way Filter Rule: A node does not forward a message along a particular dimension if that message has already traveled at least half-way across the space from the origin coordinate in that dimension. 5. Corner Filter Rule: Along the lowest dimension (dimension 1), a node n only forwards to a neighbor n A if a specific corner of n A is in contact with n. This specific corener is

92 72 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) Figure 3.4: Illustration of the race condition that affects the CAN efficient flooding algorithm presented in [146]. the one n A that is adjacent to n along dimension 1 and has the lowest coordinates along all other dimensions. Note that this rule eliminates certain messages that would otherwise be sent according to the two forwarding rules. Jones, Theimer, Wang, and Wolman discovered and fixed two flaws with the above algorithm [90]. The first flaw is an ambiguity in the half-way filter rule specified above. The authors state that the above algorithm ensures there will be no duplicate messages if the CAN coordinate space is evenly partitioned (i.e. all CAN nodes have equal sized zones). The following change to the half-way filter rule is needed to ensure that this property actually holds. When deciding whether or not to forward to a neighbor n, if n contains the point that is halfway across the space from the source coordinate in that dimension, then we only forward to n that neighbor from the positive direction. The second flaw they discovered is a race condition that can lead to certain nodes never receiving the flooded message. This race condition arises because when a node receives a duplicate message, it does not forward that message. Therefore, the order in which a node receives a message from its neighbors may determine the

93 3.5. M-CAN: CAN-BASED MULTICAST 73 directions in which that message is forwarded. To demonstrate this problem, Figure 3.4 [90] illustrates a situation where one of the nodes does not receive the multicast message. This figure shows a small portion of a 2-dimensional CAN, where the dashed line in the figure is the location along the y axis that is half-way from the origin. The sequence of message delivery times listed in the time line portion of Figure 3.4 causes node E to never receive the message. Note that a different ordering of message reception either at node n C or at node n D would have led to proper message delivery at node n E. For example, if we switch the order of messages at times T=1 and T=2, then the message from n A to n C is delivered before the message from n B to n C, which means that node n C will forward the message to n E. The idea behind fixing [90] the flooding algorithm is to make static forwarding decisions based on the relative position of a node to the multicast origin, rather than dynamic forwarding decisions based on the order of incoming messages. The new algorithm presented in [90] breaks up the forwarding process into two stages. In first stage, a node decides which dimensions and directions to the forward message along. In the second stage, a node applies a second set of rules to filter the subset of neighbors that satisfy the first stage rules. The stage one forwarding rules presented in [90] are: 1. If a node s region overlaps the origin along all dimensions less than or equal to i, then this node will forward the message in both the positive and the negative directions along dimension i. 2. If a node s region overlaps the origin along all dimensions less than i, then this node will forward the message only in one direction along dimension i. The direction to forward the message will be away from the origin coordinate, toward the half-way point. 3. For the lowest dimension (dimension 1), always forward only in one direction. As before, the forwarding direction will be away from the origin coordinate, toward the half-way point.

94 74 CHAPTER 3. CONTENT-ADDRESSABLE NETWORK (CAN) The stage two filtering rules presented in [90] are: 1. For all dimensions greater than 1, only forward to a neighboring node along dimension i if that neighbor s region overlaps the origin coordinates for all dimensions less than i. 2. The half-way filter rule from the original algorithm, with our modification described above. 3. The corner filter rule from the original algorithm. Although the rules for this algorithm look somewhat different from the original algorithm, the way that messages flow through the CAN coordinate space is quite similar to the original algorithm. An important side effect of the modified flooding algorithm is a significant reduction in the number of duplicate messages, due to the first rule in the filtering stage of the new algorithm. In [43] the tree-based application-level multicast for DHTs first introduced in Pastry 1 (namely SCRIBE[42]) and Tapestry 2 (namely Bayeux [195]), has been adapted to operate on CAN. A delay analysis of the [146] and [43] approaches has been given in [82]. A good survey of application-level multicast approaches in which the M-CAN is compared with SCRIBE can be found in [95]. 3.6 CAN Related Works In [12] Range queries over CAN have been investigated. They proposed three simple strategies for propagating Range query requests, and strategies to minimize the communication overhead during the attribute updates. Their strategies are limited to attributes that have a single value which belongs to R. Nearest Neighbor queries have been studied in [33]. They supposed that the hash function is able to map near objects into near keys. Moreover, they only consider the distance between the objects as the distance between the keys. 1 we described Pastry in Subsection [p.39] 2 we described Tapestry in Subsection [p.41]

95 3.6. CAN RELATED WORKS 75 In [147] various types of attacks on the CAN and possible countermeasures have been proposed. However, the problem of peer inserting unwanted or unpopular data into the CAN is not considered in this paper. Voice over IP (VoIP) over DHTs is considered in [16]. In particular, they presented a robust architecture for Session Initiation Protocol (SIP) infrastructures called CAN-based SIP (CASIP) which makes use of the CAN

97 Chapter 4 MCAN In this chapter, we present the Metric Content-Addressable Network (MCAN) which is the main object of this thesis. MCAN, originally presented in [65], is an extension of the CAN 1 to support storage and retrieval of generic metric space 2 objects. In order to manage metric data, the MCAN uses a pivot-based technique, presented in Section 4.1 [p.78], that maps metric data objects x D to an M-dimensional vector space R M. Then, the CAN protocol is used to partition the space and for navigation. Because of a particular property of the chosen mapping, namely contractivness, during similarity queries execution it is always possible to evaluate a lower bound for the distance between a generic object and all the objects stored in a generic node. Thus it is possible to exclude peers from the query execution. In order to reduce the number of the distances evaluated by each involved node, MCAN uses the pivot-based filtering described Section 4.2 [p.80]. In Section 4.8 [p.86] Range query similarity search over the MCAN, defined in [65], is presented. Moreover, three algorithms for executing Nearest Neighbor queries over the MCAN are presented in Section 4.9 [p.88] (they were defined in [66]). 1 see Chapter 3 [p.63] 2 see Section 1.3 [p.7] 77

98 78 CHAPTER 4. MCAN 4.1 Mapping In MCAN a special mapping function is used to map objects from a generic metric space D to an M-dimensional vector space R M in which we use L as distance (see Definition [p.10]). We will prove that this mapping is a contraction mapping. In the following we give a definition of mapping function (Definition 1) followed by the definition of contraction mapping (Definition 2) and by the proof that our mapping function is a contraction mapping (Theorem 1). Definition 1. Let {p 1, p 2,..., p M } a set of preselected objects in D, called pivots, and x a generic object in D. We define the mapping function F as: F (x) : D R M = (d(x, p 1 ), d(x, p 2 ),... d(x, p M )) Using F (to map objects from D in R M ) and L as metric distance for the in R M we can map any objects from the original metric space M(D, d) in a new vector space, which is also metric, M M (R M, L ). We now define an important subclass of mapping functions: contraction. Definition 2. A contraction mapping, or contraction, is a function f from a metric space M(D,d) to another metric space ˆM( ˆD, ˆd), with the property that there is some real number 0 λ 1 for which: x, y D, ˆd(f(x), f(y)) λ d(x, y). If λ [0, 1) then the mapping is said to be a strict contraction while if λ = 1 then the mapping is said to be nonexpansive or simply contraction. Sometimes strict contractions are called just contractions in which case the λ = 1 case is always referred as nonexpansive. Now we prove that the proposed mapping is a contraction mapping.

99 4.1. MAPPING 79 Theoreme 1. F is a contraction mapping from a generic metric space M = (D, d) to the M M = (R M, L ) metric space, (i.e. for any pair of objects in D, the distance between the mapped objects in the derived space M M obtained using F is never larger than the distance in the original metric space M). Proof. Let x, y D a pair of objects and ˆx, ŷ R M their mapped values, i.e. ˆx = F (x) and ŷ = F (y). From the L and F definitions: L (ˆx, ŷ) = max i d(x, p i ) d(y, p i ). Since the triangle inequality (Equation 1.5) for d in the original metric space M holds i, d(x, p i ) d(y, p i ) d(x, y) then L (ˆx, ŷ) = max i d(x, p i ) d(y, p i ) d(x, y) Finally we proved that our mapping function F is a contraction. In other words the distance L between any pair of mapped objects in the derived M M vector space is a lower bound of the original metric distance d between the two objects in M. This properties, together with the object distributing algorithm inherits from the CAN, will permit excluding some peers from the similarity query execution without lost of results Pivot Selection Using pivot for mapping, as we do, what we would the mapping to be less contractive as possible. To select convenient pivots in our experiments we used the incremental selection technique originally proposed in [35] together with other pivot selection techniques. They showed that the incremental selection is the best method in practice. Basically, the algorithm tries to maximize the average L distance between two objects

100 80 CHAPTER 4. MCAN Also, they showed that good pivots have the characteristic to be outliers, that is, good pivots are objects far away from each other and from the rest of the objects of the database, but an outlier does not always have the property to be a good pivot. It is interesting to note that outliers sets have good performance in uniformly distributed vector spaces, but have bad performance in general metric spaces, even worse than random selection in some cases. 4.2 Filtering In our experiments we used pivoted filtering to reduce distance evaluations during the execution of Range and Nearest Neighbor similarity queries by an involved nodes. However, a generic node participating in MCAN could use its own centralized data structures for efficiently searching between its metric objects (see Section 1.4 [p.12] for a small survey of centralized access methods for metric spaces). Performing a range query R(q,r) (Subsection [p.4]) we search for objects x X (where X D are the stored objects) that are nearest than r to the q, i.e.: d(x, q) r, we compute for the query object q D, it s mapped object where ˆq = F (q) = (d(q, p 1 ),..., d(q, p M )). Because the mapping is a contraction L (ˆx, ˆq) d(x, q) (see Theorem 1). Thus we can avoid the evaluation of d(x, ˆq) whenever L (ˆx, ˆq) > r. (4.1) In other words, the object x can be discarded if there exists a pivot p i such that i, d(q, p i ) d(x, p i ) > r. (4.2)

101 4.3. REGIONS 81 During the evaluation of a k Nearest Neighbors knn(q) (Section [p.4]) if y 1,..., y k D are the temporary results, we can avoid the evaluation of d(q, x) if L (ˆx, ˆq) > d(y k, q). (4.3) In other words, an object x can be discarded if there exists a pivot p i such that i, d(q, p i ) d(x, p i ) > d(y k, q). (4.4) While Equation 4.1 and Equation 4.3 are specific to MCAN mapping, Equation 4.2 and Equation 4.4 describe essentially the same conditions in the form used in the pre-computed distances literature (see Subsection [p.13]). Here we will not consider specific data structures to efficiently execute pivoted filtering. In our experiments we will consider only distance evaluation cost thus reducing the importance of specific data structures for optimizing memory or disk usage. However, well-known data structures have been proposed in the literature to efficiently exploit pre-computed distances, e.g. LAESA [122] and Spaghettis [44] (see Subsection [p.13]). 4.3 Regions We denote a peer of MCAN by the bold symbol n. Each peer n maintains its region information referred as n.r. Moreover, since the region n.r is an hyper-rectangle it can be uniquely identified by its vertex closest to the origin, denoted as n.r.ˆv = (n.r.v 1, n.r.v 2,..., n.r.v M ), and by the lengths of the relative sides, i.e. n.r.l 1, n.r.l 2,..., n.r.l M. More precisely, the region n.r is defined as follows n.r = { ˆx R M i, n.v i x i < n.v i + n.l i }

102 82 CHAPTER 4. MCAN The peer n also maintains the set of the neighbor peers information n.m {n 1,..., n h }. We can now introduce the formal definition of an M-dimensional MCAN structure, referred as MCAN M, which is composed of a set of N (N > 0) network peers {n 1,..., n h } such as: i, j i j n i.r n j.r =, (4.5) N n i.r = R M, (4.6) i=1 n i n j.m k (n i.r.v k + n i.r.l k = n j.r.v k ) (n j.r.v k + n j.r.l k = n i.r.v k ), w k [n i.r.v w, n i.r.v w + n i.r.l w [ [n j.r.v w, n j.r.v w + n j.r.l w [. (4.7) Equation 4.5 states that the zones covered by the network peers do not overlap. Equation 4.6 states that the union of the zones covers the whole MCAN M space R M (there are no holes). Finally, Equation 4.7 declares the condition for a network node n i to be a neighbor of n j. 4.4 Construction An important feature of the CAN structure is its capability to dynamically adapt to data-set size changes. As we will see in the experimental evaluation, we are interested in preserving the scalability of the MCAN, which means that we want to maintain a stable the response time of query execution. Since the number of objects a peer can maintain is limited, when a peer exceeds its limit it splits by sending a subset of its objects to a free peer that takes responsibility for a part of the original region. Note that, limiting the number of objects each peer can maintain, we also limit (reduce)

103 4.5. INSERT 83 the number of distance computations a peer have to compute during a query evaluation. It is important to observe that in some cases we might want to use all the peers available in the network. Previous work like have studied this possibility in a generic CAN structure by allowing a peer to split even if it does not exceed its storage capacity. Obviously, such methodology can also be applied in our MCAN. On the other hand, in a Peer-to-Peer environment, we would like to let the peers the possibility to freely join and leave the network, without affecting its consistency. As showed in Chapter 3 [p.63], this is possible with a CAN, which even provides some fault-tolerance capabilities. Since pivots need be determined before the insertion starts, we assume a characteristic subset of the indexed dataset (about 5000 objects) is known at the beginning. For selecting the pivots, we use the Incremental Selection algorithm described in Subsection [p.79]. This algorithm tries to maximize the average distance L between two arbitrary objects in the derived space (i.e. L (F (X), F (Y )) ). 4.5 Insert An insert operation can be initiated in any peer of the MCAN. It starts by mapping the inserted object X to the virtual coordinate space using function F (), and proceeds by checking if ˆx = F (X) lies in the zone maintained by the peer n, i.e. ˆx n.r. If this is not the case, the peer forwards the insertion request. From this point, the insertion proceeds with the greedy routing algorithm used in standard CAN structures: the inserting peer forwards the insertion operation to the neighbor peer which is closer to the point ˆx by using the L distance. The objective is to find the peer n for which ˆx n.r, minimizing the number of messages. We refer to this special peer as µ(ˆq) (i.e. ˆq µ(ˆq).r). If ˆx lies in the region maintained by the receiving peer, the object X is stored there, otherwise a neighbor peer is selected with the same technique

104 84 CHAPTER 4. MCAN and the insert operation is forwarded again until the object X is inserted. The peer µ(ˆx), which stores the object X must reply to the peer that started the insert operation. If the peer µ(ˆx) exceeds its capacity it splits. Eventually, the object X is inserted in µ(ˆx) or in the new allocated peer. 4.6 Split In MCAN, we apply a balanced split, i.e. the resulting regions contain the same number of objects. During this process, the splitting peer just requests a peer from a free peer list to join the network, and one half of the metric objects is then reallocated there. If we define n 1 as the splitting peer, n 1.R as the old region, n 1.R as the new one, and n 2 as the new peer, the split regions must satisfy the following equations: { n1.r n 2.R = n 1.R n 1.R n 2.R = Moreover, to respect these constraints, we create the new two regions by dividing the original one along one coordinate of the space. Therefore, the new regions, n 1.R and n 2.R, must satisfy the following equations: n 1.R.v s = n 1.R.v s n 2.R.v s = n 1.R.v s + n 1.R.l s n 2.R.l s = n 1.R.l s n 1.R.l s Note that that we only have to choose s and n 1.R.l s. In order to decide s, for each dimension i we find n 1.R.l i that divide the objects into two halves. Moreover, we choose to split along the dimension that maximizes the length of the shortest side. After the splitting process, the peer n 1 sends a message to all its neighbors n 1.M informing them about the update of its region. It also sends information about the new peer to the neighbors that are also neighbors of n 2. The new peer is informed by n 1 about

105 4.7. EXECUTION END DETECTION 85 its neighbors n 2.M (note that n 2.M n 1.M). At the end, n 1 can discard information about the peers that are not more its neighbors. 4.7 Execution End Detection One of the problem of Peer-to-Peer systems able to perform similarity search such as Range and Nearest Neighbor queries is that typically a set of nodes is involved in the execution to the query and all of them return their results to the requester. Moreover, we can not expected the answers to come in the order in which nodes are involved. However, the requester must be able to exactly know when there are no more nodes to wait for. Obviously, we don t want to force the requester to directly contact each involved node. In fact, because of the partial information about the network the requester has, it is not able to know in advance which nodes will participate in the query process. When the requester starts an operation, if none of its neighbor must be involved, it forwards its request to a not involved node. The routing mechanism of the CAN underlying structure, will forward the request to an involved node. Thus there is always an involved node which was contacted either directly from the requester or from a forwarding peer (not from another involved node). We call this special node the first-involved node. Without lost of generality, this first-involved node could even be the requester itself. After a firstinvolved node is reached, using the routing mechanism, all other involved nodes are contacted by an involved node (see Range query in Section 4.8 [p.86] and Nearest Neighbor in Section 4.9 [p.88]). Our simple solution for similarity queries execution end detection is based on a list of involved nodes managed by the requester. Any answering node must communicate the nodes that it has asked to participate in the similarity query execution. This node are added to the involved nodes list by the requester. An query execution is ended when the first-involved node and all the nodes which are in the involved nodes list have answered. When this condition is satisfied, it does not exists an involved

106 86 CHAPTER 4. MCAN node which has not yet answered. In fact, this hypothetical node would have been involved by another node which did not answered yet and which is not yet in the involved node list, and so on. Recursively following this chain of involved nodes, we should be able to find the first-involved node. But this is not possible because the first-involved node as already answered. To better understand, it is possible to imagine a tree which has the first involved node as root and involved nodes as childes of the node who involved them. From the root it must be possible to reach any involved node and vice versa. 4.8 Range Query A range query operation R(q,r) can start from any MCAN node. As shown in Figure 4.1, for a given query object and range radius, there is a certain number of nodes whose regions intersect the query region which is a hypercube with ˆq as center as 2r as side length. We denote as ˆq, r Obviously, only the intersecting nodes must process the range query operation. The requesting node maps the query object into the virtual coordinate space using the function F (). Then it checks if it is involved in the range query operation (i.e. when it intersects ˆq, r ). If the node is not involved in the query, it forwards the range query operation to the neighbor node that is closest to region ˆq, r, using the L distance. This operation is performed in a similar way as described for the insert operation. When a node that is involved in the range query is reached by the query request, it forwards it to each neighbor that is also involved and then it starts processing the range query over its local data-set. From this point the multicast algorithm proposed in [146] with improvement and corrections described in [90] (see Section 3.5 [p.70]) is used to forward the request to all involved peers. To efficiently parallelize the operation, each peer forward the message before starting executing the R(q,r) locally. In this thesis, we used a superset of the pivots chosen to define the MCAN space to reduce the number of distance evaluations

4.8. RANGE QUERY 87 Figure 4.1: Example of Range query in a two dimensional space. The darker square is the query region, while the brighter rectangles correspond to the involved nodes.

107 4.8. RANGE QUERY 87 Figure 4.1: Example of Range query in a two dimensional space. The darker square is the query region, while the brighter rectangles correspond to the involved nodes. performed inside a single node. Using the pivot-based filtering (see Section 4.2 [p.80]), we are able to significantly reduce the number of distance evaluations inside the nodes. In a more sophisticated implementation of MCAN, each node could have its own local data structure to efficiently search inside a single node. In order to allow the requesting node to know when all the nodes The requester, which is the peer receiving the answers coming from the involved nodes, detects the execution end using the mechanism described in Section 4.7 [p.85].

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to