Recherche par Similarité dans les Réseaux Pair-à-Pair Basés sur le Clustering : l'environnement HON (Hybrid Overlay Network)

Size: px

Start display at page:

Download "Recherche par Similarité dans les Réseaux Pair-à-Pair Basés sur le Clustering : l'environnement HON (Hybrid Overlay Network)"

Maud Stevens
5 years ago
Views:

1 UNIVERSITÉ DE BOURGOGNE ECOLE DOCTORALE BUFFON IMAGES ET MODÉLISATION DES OBJETS NATURELS THÈSE Pour obtenir le titre de DOCTEUR DE L'UNIVERSITÉ DE BOURGOGNE discipline : Informatique Recherche par Similarité dans les Réseaux Pair-à-Pair Basés sur le Clustering : l'environnement HON (Hybrid Overlay Network) Présentée par Mouna KACIMI EL HASSANI Soutenue le 10 janvier 2007 JURY Jean Marcel PALLO Professeur, Université de Bourgogne Président Hervé GUYENNET Professeur, Université de Franche-Comté Rapporteur Bruno SADEG Maître de conférences (habilité), Université du Havre Rapporteur Zahir TARI Professeur, RMIT University Rapporteur Ernesto DAMIANI Professeur, Université de Milan Examinateur Kokou YÉTONGNON Professeur, Université de Bourgogne Directeur de thèse

2 ABSTRACT P2P systems represent a large portion of Internet trac which makes the data discovery and retrieval of great importance to the user and the broad Internet community. Hence, the power of a P2P system comes from its ability to provide an ecient search service. With the evolution of information and communication technologies, the simple exact matching and immediate lookup became no longer sucient for many applications. These applications often require complex range queries or content based similarity search on data such as images, text and video. In this thesis, we present a Hybrid Overlay Network (HON) for ecient similarity search in a P2P system. HON organizes both peers and data in an n-dimensional space based on content description. It is based on two key ideas. First it organizes and clusters peers in the n-dimensional feature space to limit ooding overhead and send queries only to relevant peers. Second, it organizes and places similar data objects in relatively dense regions of the feature space to achieve ecient processing of complex queries such as range and neighboring queries. To improve the search cost, we propose a caching methodology to reduce the ooding among peers of the same region of the feature space. Each cache keeps trace of all the queries mapped to the same region. Therefore, similar queries are stored in the same cache which helps increasing the success hit and avoids redundancy. As the dimensionality of the feature space increases, the data objects are sparsely distributed over the space. This behavior of data in the feature space can signicantly reduce the benets of HON. To address this problem, we propose a dimensionality reduction technique (1) to select for each peer the most important features in describing its content and (2) to build multiple overlay networks of peers described by dierent features. Each overlay groups peers with common features and uses a feature space to organize data and cluster peers. We have developed a prototype using the Java programming language for the simulation of HON. We investigate dierent aspects of our solution including similarity search quality and performance cost, fault tolerance issues, query cache behavior and scalability. We show through extensive simulations that HON provides an ecient similarity search with a high success rate and recall. Moreover, we study the caching impact on the search cost and show that it decreases signicantly the query scope. Regarding the scalability, HON is adaptable to dynamic membership with low maintenance overheads. In addition, it eciently routes queries along best available paths which makes it resilient to peers' failures in spite of its hierarchical architecture. ii

3 Contents Abstract Table of Contents List of Figures List of Tables Introduction Research Scope and Objectives Research Overview Thesis Organization Research background Introduction Unstructured Overlay Networks Blind Search Gnutella Optimization Techniques Expanding Ring Random Walks Concluding Remarks Informed Search Centralized Index Distributed Index Concluding Remarks Structured Overlay Networks Loosely-Structured Systems Highly-Structured Systems CAN Chord Complex Queries in DHT Systems iii

4 2.3.3 Concluding Remarks Conclusion Design of Cluster-based P2P Systems Introduction Similarity Search Textual Applications Multimedia Applications Distributed Search P2P Similarity Search Clustering Approaches Semantic-based Approaches Low level Feature-based Approaches Discussion Problem Statement Data and peers organizing Feature selection Hierarchical architecture Caching mechanism Hybrid Overlay Network General Description Organization of an Overlay Feature Space of an Overlay Organizing Layers of an Overlay Peer Join and Departure Peer Join Peer Departure Clustering Issues in HON Requirements Anity-based Clustering Cells Anity Computation Grouping cells Density-based Clustering CLindex Approach Distributed Density Clustering in HON Caching Issues in HON iv

5 3.7 Performance Evaluation Conclusion Bibliographic Remarks Feature Space Dimensionality Introduction The Curse of Dimensionality Challenges of High Dimensional Clustering Dimensionality Reduction Techniques Feature Construction Feature Selection Feature Selection Process Feature Selection Algorithms Discussion Dimensionality Reduction in HON Weighted Feature Selection Overlays Creation Overlays Architecture Evaluation Results Conclusion Bibliographic Remarks Similarity Search and Query Caching Introduction Query Model Range Queries Nearest Neighbor Queries Query Processing Routing Indexes Query Routing Similarity Search Caching in HON Overview Web Caching P2P Caching Caching Schema Cache Denition v

6 5.5 Cache Management Cache Admission Policy Cache Replacement Policy Conclusion Bibliographic Remarks Evaluation methodology Prototype Design Physical Layer Elements Distribution Routing Layer Descriptors Protocols Application Layer Join Event Departure Event Search Event Simulation Setup Parameters Metrics Search Quality Search Cost Caching Impact Maintenance Cost Failure Cost Simulation Results Control parameters Search Success Rate Recall Query Scope Clustering Impact of the Search Cost Caching Impact on Search Performance Data Distribution Impact on Success Hit Caching Impact on Query Scope vi

7 Caching Impact on Recall Replacement Policies Scalability Tolerance to Failures Conclusion Conclusions and Perspectives Discussion Future Research Bibliography 144 vii

8 viii

9 List of Figures 2.1 Flooding queries Gnutella protocol Expanding Ring Centralized index architecture BitTorrent Architecture FastTrack Architecture Request sequence example in Freenet Example of 2-dimensional CAN before and after peer E joins system Example of Chord system Chord Operations Examples of queries Challenges of distributed search Clustering impact on search eciency Semantic Overlay Network Hybrid Overlays Networks Partition cells and Clusters Architecture Layers Example1: Cells Usage and Anity Matrix Queries and peers number Example2: Cells Usage and Anity Matrix Cluster Forming Algorithm Example1: density-based algorithm Example2: density-based algorithm Example3: density-based algorithm Sparsity of Data Query Unstability Example of 2-dimensional data ix

10 4.4 Axes change from (f 1, f 2 ) to (f 1, f 2) Principal Components Four Steps of Feature Selection Feature Density Proximity and Nearest Neighbor Graph Sahred Nearest Neighbor Graph Overlays of Features Architecture of Overlay Networks Weighted feature selection performance Precise search using discarded features Possible relations between cell and range query Query Forms Possible relations between cell and nearst neighbors query Routing Indexes Example of similarity search Memory and Disk Cache Admission Policies Web Caching Distributed P2P Caching Caching in HON Cache Placement Cache Content Physical Layer Index Structures Routing Layer Application Layer Impact of control parameters on mapping eciency Data Distributions Impact of cell granularity on the success rate Impact of the Threshold on the recall Impact of cell granurality on the query scope Clustering Eect Caching impact on search eciency Impact of replacement policies on caching performance Average maintenance costs x

11 6.14 Impact of replacement policies on the query scope Impact of number of cells on the maintenance cost Tolerance to failures xi

12 xii

13 List of Tables 2.1 Description of Gnutella Protocol Description of HON Protocol Description of Control Parameters Description of Workload Parameters xiii

14 Chapter 1 Introduction Information searching has emerged in recent years as an important characteristic in the design of distributed systems. This importance stems from (1) the fact that distributed computing and network architectures allow access to a growing number of documents of various forms, and (2) the emergence of systems and paradigms that facilitate information lookup. Ecient search algorithms and methods have therefore become a key issue in many applications. For example, there has been an increasing interest in peer-to-peer (P2P) computing as platforms for search-based applications, spurred by the popularity of le and contents sharing applications such as Gnutella [40], Napster [83], edonkey [33] and Kazaa [61]. Other examples include the web, semantic web and audio and video based applications. Most internet users view web through search engines such as Google and Yahoo that perform documents lookup for keyword queries. Typically, the search engines retrieve documents that match the query keywords, however they may at times return approximate answers based on similarity measures. The semantic web, as dened by Tim Berners-Lee, is viewed as a mesh of data interconnected in such a way as to be easily processable by machines or humans. One of the goals of the semantic web is to dene common data interchange format for data sharing. Underlying this denition is the need for complex search mechanism to aid the discovery and localization of the relevant data processed by the applications. Information searching deals with the representation, organization and storage of data to facilitate their access by users. The goal of information searching is to determine which documents of a collection satisfy user queries. Two main forms of information searching can be distinguished. Exact Match Search consists in retrieving data that exactly equal a query. In its most common structure, an exact search query is a set of keywords. An exact search returns data that match exactly the keywords of the query. Similarity Search is the second form of information search and consists in retrieving 1

15 Chapter 1. Introduction all data that partially or approximately match a given query. Similarity search is commonly used in textual and multimedia applications. For example, it is used in textual applications to address spelling problems by identifying the closest matches for any text string not found in a dictionary. In multimedia applications, many queries are content-based and are used to retrieve objects that are similar to a given object. Some examples of these applications include face recognition, ngerprint matching, voice recognition, and multimedia databases. An example of query can be the identication of objects such as a tree or a man contained in some database images. Previous work in P2P systems has focused to a large extent on architectural and information retrieval issues. In essence, a P2P system is viewed as a distributed network system composed of participants (or peers) with similar resource capabilities. From an architecture point of view, P2P systems can be classied into unstructured systems and structured systems. The unstructured P2P systems organize the peers in a random graph with no control over their contents, that is each peer is in charge of its contents. The unstructured P2P systems can be further classied into (1) centralized systems which maintain a central directory of data locations or indexes, (2) decentralized systems in which the peer maintain no network state or context information, and (3) the hybrid systems which combine the characteristics of centralized and decentralized by using super peers to manage other peers with less resources and capabilities. The structured P2P systems keep a tight control over network topology and peer contents by placing data not randomly in peers but at specic locations dened by an indexing strategy. Many search algorithms have been proposed for peer-to-peer systems during the past few years. Most of these algorithms rely on the underlying overlay structure of the P2P systems. Referring to the classication of P2P systems above, we briey review existing search algorithms, specifying their characteristics and capabilities for similarity search. The details of the search methods and other background discussions are given in chapter 2. In centralized unstructured P2P systems, queries are submitted to the central index server to retrieve the location of data which are then downloaded directly from the peer containing the data. Similarity search in this case solely depends on the index search capabilities of the server. Centralized search suers from single point failure problems of the server. Information search in decentralized unstructured systems is a blind search based on query ooding methods in which a query issued by a peer is executed by progressive propagation to the direct neighbors of the peer, which in turn propagate the query to their neighbors and so on. Several methods have been used to improve the performance and limit the time cost of the basic ooding search. The Time-To-Live (TTL) method reduces the query search scope by limiting the propagation time or the

16 3 number of hops. Another method for reducing search scope is the random walk [26, 4], in which a query is forwarded to a randomly chosen peer at each step until the required query object is found or sucient responses are obtained for approximate queries. In some random walks, the selection of peer at each step may be based on some properties or capabilities of the neighbors. For instance, a neighbor with high computing capacity may be preferred over the other peers. One limitation of random walks is that some walks may be useless, resulting in degraded search performance. Moreover, ooding techniques are ecient for locating popular data objects for which several duplicate copies exist in a large number of peers. In this case the related network trac of the query can be reduced since the requested object is likely to be found in few hops. On the other hand, ood techniques can exhibit search quality and cost performance for remote unpopular objects which may not be found if TTL limit is reached or may incur a high sear cost if they are found. Search methods in structured P2P systems are based on informed search which, contrary to blind search, uses peer content localization information to direct queries to the appropriate peers. DHT (Distributed Hashing Table) techniques have been used in several systems [97, 110, 101, 123, 23, 77, 78, 68]. The basic idea of DHT-based search is to facilitate data access by organizing the data in a key space using hashing keys based on peer properties (e.g., IP address). Each peer and data in the network is assigned a unique identier Id peer and Id data using hashing key based on peer properties and data content respectively. Ratnasamy et al. [97] propose a Content Addressable Network (CAN) that organizes data in an n-dimensional key space by partitioning the key space into zones associated with peers. DHT search are more appropriate for exact queries. Thus, the main challenge for these systems is to process complex queries such as similarity, approximate and range selections. This challenge was recently addressed in [47, 114, 44, 100] by adding a layer on top of the existing DHT systems to process multi-attribute range queries for textual data. Cluster-based P2P systems represent a third category of architecture. These systems build an overlay structure on top of the unstructured P2P systems. The clusters are created by organizing peers into groups based on their common properties or interests. Dierent properties such as network related information [66, 13, 5], application needs [117, 80], peer characteristics [53, 74], and similarities between peers contents [108, 87, 28, 85, 86, 112, 43] have been used to create clusters. For example, Sripanidkulchai et al. [108] take into account query traces over the P2P network [108]. Hang et al. [87, 104] generate signature vectors based on the low level features of the peer content. Other researchers [28, 85, 86, 112] associate peers with semantic descriptions that can

17 4 Chapter 1. Introduction be simple keyword-based annotations, schema or ontologies. Typically, the goal of clustering methods is to limit ooding by sending queries to the peers which are likely to contain the relevant data. Similarity issues have been addressed by some clustering techniques such as the Firework query model presented in [87, 104] and the SON system described in [112]. However, most clustering approaches organize peers based on their largest data interests, providing an approximate description of peers' content. In some cases, the characteristics of some peer content, which may never be retrieved by the search queries, are not taken into account. From an information search point of view, the basic idea of cluster-based P2P systems is to limit the scope of ooding to the members of a cluster. Query processing in cluster-based P2P systems is generally done in two steps. When a peer issues a query request it rst attempts to match the query to the contents of the peers in its cluster to initiate an intra-cluster query execution. If there is no match, a selection process is carried out to determine the clusters that can process the query. Note that this query processing method in some way resembles what is done in hierarchical hybrid P2P systems where super peers process inter-cluster queries. 1.1 Research Scope and Objectives Our concern in this thesis will be on similarity search methods in cluster-based P2P systems. The goal is to investigate the problem of creating clusters using the content descriptions of peers and to propose a corresponding query model. As stated above, P2P systems allow data sharing and information search over multiple dynamic peers that can join and leave the network at any time. The distributed nature of P2P networks can be used to increase the robustness in case of failures by replicating data over multiple peers and enabling peers to nd data without relying on centralized index servers. These characteristics make information retrieval in P2P networks a very challenging task consisting of several issues. First, peers holding relevant answers to a given query cannot be directly dened. Peers maintain a partial view of the network, leading to (1) a partial lookup and (2) a high latency for query on non replicated data. Second, peers can be heterogeneous by having dierent schema to describe their data or dierent languages to represent their queries. Thus, an ecient search in P2P network has (1) to provide mapping techniques to communicate between heterogeneous peers and (2) to combine the results from dierent peers to form a coherent whole. The third issue of information retrieval in P2P networks is the dynamicity of peers. Routing protocol has to be tolerant to peers churn by forwarding queries to relevant peers' through available

18 Section 1.2. Research Overview 5 paths avoiding search failure. Similarity search can be inherently expensive especially when the query processing technique is not designed to manage approximate queries. In P2P systems the search space can grow exponentially with the number of peers and data. Therefore, the retrieval of similar data among all peers in the network requires some specic search scheme. The objectives of this research can be summarized as follows. 1. to combine the characteristics of CAN and clustering techniques to provide an ecient similarity search in P2P networks by (1) organizing data in a feature space describing their content to provide an ecient processing of complex queries such as range and nearest neighbor queries and (2) organizing peers in the same feature space of their data to create similar clusters and limit the ooding overhead over the network. 2. to address the issue of high feature space dimensionality by (1) selecting for each peer the signicant features in describing its data and (2) dividing the global set of features into several subsets of a limited size. 3. to use a hybrid P2P architecture introducing super peer concept to reduce the search delay, manage eciently system structure when peers leave and join the network, and balance the load of peers management between several super peers. 4. to develop caching mechanisms to reduce search cost inside clusters and improve routing and lookup performance. The goal is to keep track of queries and peers that return good results. Then, future queries similar to those stored in the cache can take advantages of cached information. 1.2 Research Overview The goal of the research presented in this thesis is to design a similarity search method for the underlying cluster-based P2P system. We address ve main problems. The rst problem is the design of a Hybrid Overlay Network, called HON, on top of the unstructured peer architecture. HON organizes both data and peers in n-dimensional feature space based on content description. We assume that the contents of peers are described by a set of features. Peers may be heterogeneous where each peer has a specic set of features that are important in describing its data objects. For example, a peer holding text les can be represented by a set of keywords where each keyword is a feature. Another peer containing images can be described by low level features such as

19 6 Chapter 1. Introduction color, texture or shape. Therefore, the number of features describing peers content, that is the dimensionality of the feature space, increases exponentially when the number of participating peers increases. To avoid high dimensionality problems, we (1) divide the global set of features into several subsets by selecting the most important features in describing peers content and (2) associate each subset to one Hybrid Overlay Network. Each subset composes a feature space of a limited size. HON is based on two key ideas. First, it organizes and clusters peers sharing similar contents to limit ooding overhead and send queries only to relevant peers. Second, it organizes and places similar data objects in relatively dense and adjacent regions of the feature space to achieve ecient processing of complex queries such as range and neighboring queries. The feature space is partitioned into cells obtained by dividing the range values of each feature into a number of intervals. Two data are similar if they are mapped to the same cell. The distribution of data objects over the cells denes the similarity between peers. Two peers are similar if their contents are distributed on the same sub regions of the feature space. Similar peers are grouped into the same cluster. The second problem we examine is the clustering of peers in an overlay network of the feature space. HON clusters are dened by partition regions of the feature space. The key idea is that peers that are mapped to the cells of partitioning region are grouped into the same cluster. A partitioning region is a set of cells such that each cell is adjacent to one or more other cells of the set. Two cells are adjacent if they share (n- 1) hyperplane of the feature space. The architecture of a cluster is a two level hierarchy consisting of super peers and simple peers. The super peers, which are responsible for the management of the clusters, have high processing and storage capacities while the simple peers have limited capabilities. A peer can join one or more clusters in dierent overlays. We investigate and use a density-based algorithm to create clusters. The key idea of this clustering technique is to start with the cells with the highest density, that is the cells to which the highest number of peers are mapped, and group its adjacent or neighbor cells until a connected partitioning region is obtained. The process is repeated with the cell having the highest density among the unused cells that are not already marked as member of a cluster. The third problem is related to the dimension of the feature space. As we have discussed above, a peer data in a hybrid overlay network is a point in a feature space whose dimensionality is dened by the number of its features. As the dimensionality of the feature space increases the data objects are sparsely distributed over the space. This behavior of data in the feature space can greatly reduce the benets of creating overlays and clusters of peers with similar contents. To address the dimensionality

20 Section 1.3. Thesis Organization 7 issue, we view a P2P systems as one or more overlays. Each overlay is described by a specic number of features. We propose a dimensionality reduction technique to guide the selection of features for the overlays. The idea of the reduction technique is to select the features that describe dense regions of the feature space. Using a variant of the weighted feature selection algorithm proposed by Wang et al [118], each peer selects the most signicant features that describe its content. Once the signicant features of peers are selected, an algorithm based on features neighborhood is used to select the features that belong to the same overlay. The fourth problem we investigate in this thesis is the query model. The similarity search in HON combines the characteristics of range and nearest neighbor query models. Range query model species range values for each feature and retrieve their answers from a region of adjacent cells in the feature space. By contrast, nearest neighbors query model retrieves the closest objects to a specic point in the features space using a similarity threshold. The methodologies of query routing and data objects localization are implemented in HON by maintaining partial routing tables in each peer. Simple peers forward queries to their super peers that provide a searching service, which either executes a query locally on their clusters or submits the query over the P2P network to remote peers. To improve the search cost, we use caching methodologies to limit the ooding among peers at the cell level. A cache is placed in each non empty cell and keeps trace of all the queries sent to the same cell. Two queries that are mapped to the same cell are similar. Therefore, they share common peers that are supposed to hold the relevant answers. In this way, similar queries are always sent to and stored in the same cache which helps increasing the success hit and avoids redundancy. The fth problem is the design of a prototype for evaluating the proposed P2P architecture and search methods. We have developed a prototype using the Java programming language for the simulation of the hybrid overlay network. We investigated dierent aspects of our solution including similarity search quality and performance cost, fault tolerance issues, query cache behavior and scalability. 1.3 Thesis Organization The remainder of this thesis is organized as follows. Chapter 2 presents a literature review on P2P networks and information search in these networks. Chapter 3 presents an overview of the main contributions of our work. First, several characteristics of search techniques are discussed and several traditional applications that uses that techniques are presented. Then, P2P similarity search is discussed focusing on clustering issues.

21 8 Chapter 1. Introduction Next, we present the design of a cluster-based P2P hybrid overlay network (HON). In chapter 4, we examine high dimensionality issues and propose a solution for selecting the features of overlays in the HON approach. In chapter 5, we investigate similarity search and caching issues in HON. We focus on the denition of query routing strategies and the design of distributed cache techniques in HON. Chapter 6 presents the design of a simulation prototype to measure the performance of the proposed HON solution. We focus on evaluating dierent aspects including similarity search quality, maintenance costs and fault tolerance. Several routing and localization methodologies are implemented and evaluated. Finally, chapter 7 concludes the thesis and presents some perspectives and future work.

22 Chapter 2 Research background Contents 2.1 Introduction Unstructured Overlay Networks Blind Search Gnutella Optimization Techniques Expanding Ring Random Walks Concluding Remarks Informed Search Centralized Index Distributed Index Concluding Remarks Structured Overlay Networks Loosely-Structured Systems Highly-Structured Systems CAN Chord Complex Queries in DHT Systems Concluding Remarks Conclusion

23 10 Chapter 2. Research background This thesis focuses on several research areas: P2P overlay networks, similarity search in high dimensional spaces and caching methodologies. To ensure a logical ow of this dissertation and a better understanding of the dierent areas mentioned above, in this chapter, we give an overview of P2P networks including the architecture aspect and search techniques. In chapter 3 we give an overview of the existing cluster-based P2P overlay networks which are related to our approach. In chapter 4, we give an overview of the dierent problems of high dimensionality and the dierent dimensionality reduction techniques. Finally in chapter 5 we review caching mechanisms that have been proposed for traditional databases, the web and P2P networks. 2.1 Introduction Peer-to-peer (P2P) computing is an alternative to the centralized and client-server models of computing, where nodes can play client and server roles for dierent purpose, i.e., a node can be a server for one kind of request and a client for others. Pure P2P networks are basically distributed systems without any hierarchical organization or centralized control. All participant nodes called peers have similar capabilities by playing both client and server roles. Peers are enabled to share their resources (information, processing, etc) using direct interactions. The dierence between P2P and distributed systems is that peers form self-organizing networks oering various services such as an ecient information search, load balancing, redundant storage, autonomy, dynamism, anonymity, massive scalability and fault tolerance. P2P search networks are an ecient mechanism for sharing information between large numbers of users. Existing systems such as Kazza [61] allow several millions of simultaneous users to search and retrieve data from distributed repository containing hundred of millions of objects. Those systems represent a large portion of the Internet trac which make the data discovery and retrieval of great importance to the user and the broad Internet community. Many issues have been addressed by the existing P2P search techniques to guarantee system eciency. Among these issues we note the bandwidth consumption, network load, adaptability to changing topologies, accuracy and approximate search. The search techniques functionalities rely on the underlying P2P overlay infrastructure. Two classes of P2P overlay networks can be distinguished: Structured and Unstructured. The meaning of Structured is that P2P overlay network topology is tightly controlled and contents are placed not at random peers but at specied locations that will make subsequent queries more ecient. By contrast, Unstructured P2P overlay

24 Section 2.2. Unstructured Overlay Networks 11 networks are Ad-Hoc in nature, and do not present the possibilities of being unied under a common platform for application development. In this chapter, we describe the current search techniques for both P2P overlay categories and we present their characteristics. 2.2 Unstructured Overlay Networks Unstructured P2P overlay networks organize peers in a random graph that may be at where all participant peers have the same functionalities, or hierarchical by using super peers that have additional functionalities for special purpose. The P2P systems relaying on a at graph are called pure systems and those relying on a hierarchical graph are called hybrid systems. Each peer holds a local index of its content and initiates or forwards requests for objects which cannot be found in its local repository. Current search techniques in unstructured P2P overlay networks can be categorized as blind or informed techniques. In a blind search, peers have no information about objects locations, while the informed search uses a central or distributed index service containing related information to objects locations to manage the search process Blind Search Search using a blind technique consists in propagating the query through the network without having any information where the required objects can be stored. This technique is called Flooding. Each query issued from a peer is broadcast (ooded) to directly connected peers, which themselves ood their peers etc, until the query is answered or the ooding step number reaches a maximum threshold (depends on the network topology). As shown in gure 2.1, the peers communicate directly to search and download data. This technique makes the entire network fault tolerant because there are no centralized resources. However, it requires a large network bandwidth which may penalizes the system eciency. This ooding technique is used by Gnutella [40] an unstructured P2P le sharing network. It consists of randomly connected hosts where le placement on hosts is unrelated to the overlay network. A detailed description of the Gnutella network is given in the following.

25 12 Chapter 2. Research background Figure 2.1: Flooding queries Gnutella The Gnutella system appeared in March It was realized by Justin Frankel and Tom Pepper of Nullsoft, the company that produced Winamp (a digital jukebox for Windows which supports a huge variety of audio and video formats). Shortly afterwards, AOL bought Nullsoft and forbad Justin Frankel and Tom Pepper from continuing their work because of the Gnutella's potential use in sharing unauthorized music les. Then, the Gnutella le sharing system was reverse engineered (i.e., analyzed in details with the goal of constructing a new device or program that does the same thing without actually copying anything from the original version) from the original code and a second generation of Gnutella was born [17]. Since then, it was impossible to stop the le sharing from growing in size and popularity. Gnutella peers called servents perform tasks associated to both clients and servers. They provide a client-side through which users can initiate queries, and a server-side that accepts queries from other servents. The way in which the servents communicate over the network is dened by the Gnutella protocol consisting of a set of rules managing the inter-servent packet exchange. The current Gnutella protocol (version 4) [24] consists of a set of descriptors used to exchange data between servents. Five descriptors are dened by Gnutella protocol: Ping, Pong, Query, QueryHit and Push. The structure of a descriptor header is given in table 2.1. Note that the TTL descriptor is used to avoid creating loops on the network and therefore increasing the bandwidth consumption and the servents' load. When the TTL reaches 0, the descriptor will no longer be forwarded. As a descriptor is passed from a servent to servent, the TTL and Hops elds of the header must satisfy the following

26 Section 2.2. Unstructured Overlay Networks 13 Table 2.1: Description of Gnutella Protocol Descriptor ID A unique descriptor identier on the network. It is used to make sure that the same descriptor does not pass twice by the same servent creating a loop. Payload Descriptor Indicates the type of the descriptor( Ping, Pong,...). TTL Time-To-Live. The number of times the descriptor will be forwarded by Gnutella servents before it is removed from the network. Each servent will decrement the TTL before passing the descriptor on to another servent. Hops The number of times the descriptor has been forwarded. Payload Length The length of the descriptor. condition: T T L(0) = T T L(i) + Hops(i) where TTL(i) and Hops(i) are the values of the TTL and Hops elds of the header at the i th hop of the descriptor, for i>=0. The TTL is a mechanism of expiring descriptors on the network. Servents should carefully choose the TTL values and lower them as necessary. High TTL values may lead to unnecessary amount of network trac resulting in poor network performance. The Gnutella protocol is composed of three steps: Connecting: A Gnutella servent connects to the network by sending a Ping descriptor to discover the existing hosts. The servents receiving the Ping descriptor reply by a Pong descriptor and forward the received Ping descriptor to their directly connected servents and so on, as shown in gure 2.2a. A Pong descriptor contains the address of a connected Gnutella servent and information regarding the amount of data it is making available to the network. When the Pong descriptor is received by the requesting servent, a TCP/IP connection to the sender of the Pong descriptor is created. Once the servent is connected, it can start searching data through the entire network. Searching: To search data in the Gnutella network, a Query descriptor is sent from the requesting servent to all its directly connected servents. When a servent receives a Query descriptor, it carries out two tasks. First, it checks its content to look for any data that match the query. Second, it forwards the Query descriptor to all the servents to which it is connected. These servents check their directories and send the Query descriptor to their connected servents. This process continues until the TTL of the Query descriptor reaches 0 or until the requesting servent leaves the network.

27 14 Chapter 2. Research background (a) Ping/Pong Routing (b) Query/QueryHit/Push Figure 2.2: Gnutella protocol Downloading: Each servent holding the required data, sends a QueryHit descriptor containing its IP address to the requesting servent. When the requesting servent receives the QueryHit descriptor, it initiates an HTTP connection to the servent sending the QueryHit descriptor to start data downloading. A servent sends a Push descriptor if it receives a QueryHit descriptor from a servent that does not support incoming connections. This might happens when the servent sending the QueryHit descriptor is located behind a rewall. An example of QueryHit and Push descriptors is given in gure 2.2b. In November 2000, observations of Gnutella network trac [99] found 55% of the trac to be Ping and Pong packets while only 36% was being used for user Query packets. To reduce the pure overhead of Ping/Pong packets that were increasing the bandwidth consumption, the pong caching technique has been proposed [72]. Using this technique each servent holds a cache to store Pong descriptors. When a servent receives a Ping descriptor, it sends back a selection of Pong descriptors that have been recently stored in its cache, rather than broadcasting the Ping descriptor to all its connected peers. To maintain up-to-date the information stored in a cache, the relevant servent periodically broadcasts a Ping to all its peers. Pong caching results in less trac, therefore higher performance. A similar optimization technique for caching QueryHit results has been considered in [107] to address scaling issues in Gnutella le search method.

28 Section 2.2. Unstructured Overlay Networks Optimization Techniques Gnutella uses TTL(Time-To-Live) to control the number of packets forwarded over the network. Nevertheless, choosing the appropriate TTL is not an easy task. If the TTL is high, unnecessary packets can overload the network, while if it is too low, the required object might be not reachable. Among the features having an impact on the TTL value, we note the network topology and the popularity of the required objects. The popularity of an object is measured by its number of replications through the entire network. More the object popularity increases, more the appropriate TTL decreases. Unfortunately, since in practice the replication ratio of an object is unknown, users have to set TTLs with high values to ensure the query success. Another problem with ooding is the query messages duplication. It means that a servent might receive multiple copies of the same query from its multiple neighbors. Duplicated queries lead to a pure overhead which requires duplication detection mechanisms to delete and not forward duplicated messages. Though, the number of duplicated messages in ooding algorithms can be excessive, especially when the TTL increases. Since ooding has inherent limitations, several optimization techniques have been proposed to improve its performance [22, 75, 38] and some examples are presented in the following Expanding Ring This approach consists in setting a low TTL on initial query messages. If no replies are returned after a timeout period, the query message is resent with a higher TTL. The re-broadcasting of a query continues until a result is returned or the TTL reached a predened maximum level [27]. An example of expanding ring is given in gure 2.3. The experiment results presented in [27] show that, despite the successive retries to nd an object, expanding ring reduces message overhead signicantly compared to the traditional ooding that uses a xed TTL. However the expanding ring does not address the message duplication issue inherent to ooding Random Walks Random walk is a technique in which a query message is forwarded to a randomly chosen neighbor at each step until the required object is found [4]. The forwarded message is called a walker. The random walk technique decreases signicantly the message overhead compared to the expanding ring. However, using one random walker increases the latency, i.e., the required time to retrieve answers. To solve this problem,

29 16 Chapter 2. Research background (a) First try (b) Second try (c) Third try Figure 2.3: Expanding Ring Lv and al. [27] propose to increase the number of walkers. Instead of sending one query message, a requesting node sends k query messages, and each query message takes its own random walker. Since the multiple-random walks require a mechanism to terminate the walkers, two methods have been used in [27]: TTL and Checking. TTL method is similar to the one used in the ooding technique; each walker terminates when its TTL reaches 0. Using the checking method, a walker periodically checks the original requesting peer before walking to the next peer. The experiments realized in [27] show that multiple walkers reduce the latency, but also generate more loads comparing to standard technique that uses one walker Concluding Remarks The scalability is an important feature in the unstructured P2P networks that use a blind search. The key of a scalable search is to cover the right number of peers minimizing the latency and network overhead. Three parameters have to be taken into account to improve the performance of blind searches [27]: 1. Adaptive termination: the query messages have to be expired when the requesting peer gets the required objects. 2. Message duplication: each query should visit a node just once. Multiple visits of the same query generate overloads. 3. Coverage granurality: each additional step in the search should not signicantly increase the number of visited nodes.

30 Section 2.2. Unstructured Overlay Networks 17 Figure 2.4: Centralized index architecture The adaptive termination cannot be satised using a xed TTL technique because the TTL values depend on the objects popularity and locations which are unknown in practice. Therefore, the requesting peer might get the results while the query packets still broadcast peers over the network. The expanding ring technique might provide a partial adaptive termination, since the TTL value is progressively increased depending on the fact that the required object is found or not. The checking method is a good example of adaptive termination, because the requesting peer is checked before continuing the walk. Thus, if the required object is found, all the relevant query messages are destroyed. The coverage granularity depends on the search technique. In ooding, an additional step could exponentially increase the number of visited peers; while in random walk, an additional step increases the number of visited peers by a constant value k equals to the walkers number Informed Search Unlike the blind search, peers using an informed search technique hold information about the peers to which a query has to be forwarded. It consists in building indexes containing peers information, and uses these indexes to forward queries to the appropriate peers. The index-based search can rely on a centralized or distributed architecture. The P2P systems related to each type of architecture are presented in the following sections Centralized Index In a centralized architecture, a central server keeps information about all the peers of the network. All queries pass through the central server except downloading task which

31 18 Chapter 2. Research background is made directly between two peers. The centralized index approach is simple to implement and helps to locate les quickly and eciently. However, it shows some scalability limits, because it requires bigger servers and a large storage when the number of users and queries increases. Furthermore, centralized systems are considered inherently vulnerable because of their single point of failure. We present in the following two popular P2P systems that use a centralized index to locate data. Napster: The centralized model was made popular by Napster [83], a le sharing system. The peers of the network connect to a central directory server which maintains a table registering user connection information, e.g., IP address, connection bandwidth, etc, and a table listing the les that each user holds and shares in the network, along with metadata descriptions of the les, e.g., lename, time of creation, etc. Upon a query issued from a peer, the central index will match the query with the best peer in its directory having the required le. Depending on the user needs, the best peer could be the one that is the cheapest, the fastest, or the most available. Then a le exchange will occur directly between the two peers. Figure 2.4 illustrates the architecture design of Napster. BitTorrent: is a centralized P2P system [15] that uses tit-for-tat protocol to share data in the network. Using this protocol, peers upload to peers from which data have been received, with the goal to discourage free-riders. Peers with high upload speed will probably also be able to download with a high speed, thus achieving high bandwidth utilization. The download speed of a peer will be reduced if the upload speed has been limited. Therefore, the content will be shared among peers to improve scalability. To start a BitTorrent deployment, a static le with the extension.torrent is put on an ordinary web server. The.torrent le contains information about the le, its length, name, and hashing information and URL of a tracker as illustrated in gure 2.5. Trackers are responsible for helping peers to nd each other. Each peer sends downloading information to the tracker, such as the le name and the listening port. Then, the tracker replies by sending a list of peers which are downloading the same le. BitTorrent cuts les into pieces of xed size (256 Kbytes). Each peer announce to all its peers the pieces it has, and uses hashing function to hash all the pieces that are included in the.torrent le. When a peer nishes downloading a piece and checks if the hash matches, it announce that it has that piece to all its peers.

32 Section 2.2. Unstructured Overlay Networks 19 Figure 2.5: BitTorrent Architecture Distributed Index The distributed architecture uses several super peers instead of a central server. A super peer is dened as a peer having high bandwidth, disk space and processing capacities. Super peers are dynamically assigned the task of serving a small subpart of network peers, by indexing and caching their les. Distributed systems share the load of a central server between several super peers which results in increasing the discovery time. In another hand, distributed systems do not have a single point of failure. If a super peer is down, its connected peers can open new connections with other super peers. FastTrack/KaZaA: Is a distributed le sharing system [95] that relies on meta-data searching. Two layers are dened in FastTrack, super peers layer and ordinary peers layer as shown in gure 2.6. The ordinary peers transmit the meta-data of the data les they are sharing to the super peers. All the queries are also forwarded to the super peers. Then, ooding-based search is performed in the network of super peers to locate the required les. Both KaZaA [61] and Grokster [41] are applications of FastTrack. Ultrapeers: Super peers concept has been introduced in Gnutella system to make the network more ecient. Ultrapeers approach [106] creates two level hierarchy of peers in Gnutella network, where faster peers takes charge of a large part of the network load. Therefore, peers in Gnutella network can be divided into leaf-peers and ulrapeers. Leafpeers maintain only a single connection to an ultrapeer, while ultrappers maintain many leaf connections and a small number of connections to other ultrapeers. Ultrappers hold indexes about the shared les of their leaf-peers. The building of these indexes can be

33 20 Chapter 2. Research background Figure 2.6: FastTrack Architecture made in two ways. First by using Clip2 Reector protocol, where ultrapeers periodically send an indexing query to leaf-peers that respond with a query reply naming all shared les. Second by using Query Routing Protocol (QRP), where leaf peers periodically send to their ultrapeers query routing tables (QRP tables). Utlrapeers then forward queries only to leaf-peers having a QRP table that has a corresponding entry. Note that QRP tables are not propagated among ultrapeers Concluding Remarks Informed search in unstructured networks helps to improve performance by reducing the trac over the network. Unlike the blind search, the informed search uses indexes to forward the queries to peers that are supposed to have the required data. Several parameters have to be considered to maintain the eciency of the informed search in the unstructured P2P overlays. Among theses parameters we note: 1. Failure point: the system should not have a single failure point for availability purpose. 2. Load balancing: in a distributed system, the load has to be equitably shared between super peers to increase eciency. Unlike centralized systems, distributed systems have no single failure point, which makes the system more robust and reliable. The load balancing is an important issue in distributed systems. In one hand, the number of super peers can be xed, which facilitates the load balancing task. In the other hand, if the number of super peers is

34 Section 2.3. Structured Overlay Networks 21 dynamic, additional measurement have to be considered to elect new super peers if the number of peers increases, or update connections when one or more super peers leave the network. 2.3 Structured Overlay Networks Structured P2P systems [97, 110, 101, 123, 23, 77] emerged in attempt to address scalability issue that was faced by the unstructured networks. Files or pointers are placed at specied locations and they also provide a mapping between the le identier and the location. As a result, queries can be eciently routed to the node with the requested le. Two types of structured systems can be distinguished: Highly-Structured systems and Loosely-Structured systems. Highly-Structured systems use a Distributed Hashing Table (DHT) creating a key space to organize data objects and store them in specic locations according to peers' identiers. Loosely-structured systems are the intermediates between the highly-structured and the unstructured systems where le locations are not completely specied, therefore not all searches succeed. In the following, a detailed presentation of the two categories is given Loosely-Structured Systems An example of loosely structured systems is Freenet [23] which is a le sharing system. Its main goal is to provide (1) an anonymous method for storing and retrieving information and (2) security mechanisms against malicious peers. Freenet network maintains a set of les in a limited disk space allocated by the system. When the disk space is exhausted, les are replaced using LRU (Least Recently Used) replacement strategy. Each le in Freenet is identied by a hashing key that can be generated using dierent strategies. One strategy is to allow each user to provide a short text description of the le. This description is hashed to generate two keys. The rst key is a public key which is designed as le identier. It is made available to all users. The second key is a private key which is used to sign the le for security purpose. Other strategies for generating keys can be used helping users to create hierarchical le structures or to generate disjoint name spaces. Peers join and search operations are tightly related in Freenet. A search operation is performed on the le key. If a le with that key is found, it is returned to the requesting peer as a form of key collision indication. When no existing le is found, this accomplishes both the replication of the le that occurs during a search as well as

35 22 Chapter 2. Research background Figure 2.7: Request sequence example in Freenet preserving the integrity of the routing tables by placing the new le in a cluster of les with similar keys. When a new peer joins the network, it rst discovers at least one other peer in the network to which it can connect. Then, the new peer performs a search operation which is considered as a notication of the peer presence in the network. Figure 2.7 depicts a typical sequence of request messages in Freenet network. The user sends a query at peer A, which forwards the query to peer B which in turn forwards it to peer C. We notice that peer C cannot contact any other peer, thus, it returns a failed query message to peer B. Peer B tries the second path by forwarding the query to peer E, then to peer F that delivers it to peer B. Peer B detects that there is a loop and returns a failed query message to peer F. Peer F cannot contact any other peer and returns a failed query message to peer E. Peer E tries the second path by forwarding the query to peer D, which holds the required data. The peer D returns the required data to peer A via peers E and B. The data is cached in peers E, B and A. Thus, a routing shortcut is created for the next similar queries Highly-Structured Systems Highly-Structured Systems [97, 110, 101, 123, 77, 78, 68] use a Distributed Hashing Table (DHT) to store data objects in specic locations according to peers' identiers. DHT systems assign from the same key space, a unique identier to each peer called NodeID and a unique identier to each data object called a key. Each data object is mapped to a unique peer having the closest identier to the data object key. Each peer maintains a routing table including its neighboring peers' N odeids and IP addresses. A lookup query is identied by a unique key and is forwarded in a progressive manner to peers having NodeIDs that are closer to the query key. Each DHT-based system has its own methodology to dene the key space, organize data and route queries. To locate data in DHT-based systems, an average of O(logN) hops is required, where N

Section 2.3. Structured Overlay Networks 23 Figure 2.8: Example of 2-dimensional CAN before and after peer E joins system is the number of peers in the system.

36 Section 2.3. Structured Overlay Networks 23 Figure 2.8: Example of 2-dimensional CAN before and after peer E joins system is the number of peers in the system. The lookup latency in DHT-based P2P overlay networks varies according to the underlying network, because a path between two peers can be signicantly dierent from the path on the DHT-based overlay network. In this section, we give an overview of two structured overlay networks: Content Addressable Network (CAN), Chord and other approaches that improve the search in such systems CAN The Content Addressable Network (CAN) [97] consists in a logical multidimensional Cartesian coordinate space. The whole space is dynamically partitioned into zones among all the peers in the system, where each peer manages individually a distinct zone. A peer maintains a routing table including the IP address and the coordinate zone of each of its neighbors in the coordinate space. A peer routes messages to its closest neighbor peer to the destination coordinate. Queries and messages contain the destination coordinates. The routing performance of CAN is given by O(d.N 1/d ) where d is the dimension of the coordinate space, and N is the number of peers. CAN uses the coordinate space to store (key, value) pairs. We note that a value can be an address, a document or an arbitrary data item and the key is its corresponding identier. To store (key, value) pairs, a key K is mapped to a point P in the coordinate space using a uniform hash function. Then, the pair (K, V ) is stored at the peer that owns the zone containing the point P. To look for an entry corresponding to K, any peer can use the same uniform hash function to map K to P and then retrieve the

37 24 Chapter 2. Research background value V from P. If P is not owned by the requesting peer, the request is routed from neighbor to neighbor till it reaches the peer that owns the zone containing the point P. When a new peer joins the system, it chooses randomly a point P in the coordinate space and sends a JOIN request to the peer that owns the zone containing the point P. When that peer receives the JOIN request, it splits its zone into two parts and allocates one part to the new peer. For example, in a 2-dimensional space, a zone would rst be split along the X dimension, then along the Y dimension, and so on. After the new peer has obtained its zone, it gets from the previous peer all the relevant (K, V ) pairs and the IP addresses of its neighbors. For example, in gure 2.8, when the peer F joins the network, it shares the zone with the peer E and all the routing tables are updated. When a peer leaves the network, its zone is assigned to one of its neighbor peers using a specic takeover algorithm. The peer managing the zone of the disconnected peer updates its neighbor list to remove the irrelevant peers. Then, each peer in the system sends an update notication to all its neighbors that will update their own neighbor lists. To improve data availability, CAN algorithm might maintain multiple, independent coordinate spaces, where each peer in the system is assigned a dierent zone in each coordinate space called reality. Then, for CAN with r realities, each peer holds r independent neighbor sets. For better performance, CAN may possibly replicate a single (K, V ) pair at k distinct peers in the system by using k dierent hash functions to map a given key onto k points in the coordinate space. A (K, V ) pair is considered unavailable when all the k replicas are simultaneously unavailable. Thus, queries for a specic hash table entry can be forwarded in parallel to all the k peers to reduce the average query latency and enhance reliability and fault resiliency properties. For better routing metrics, CAN might take into account the underlying IP topology and connection latency alongside the Cartesian distance between source and destination Chord Chord [110] uses a consistent hashing [59] to balance the load on the system by allowing peers (1) to receive almost the same number of keys, and (2) to join and leave the network with minimal interruption. If the N th peer joins (or leaves) the network, only an O(1/N) fraction of the keys are moved to a dierent location, and only O(log 2 N) update messages are required. To improve the scalability of consistent hashing, each peer in Chord has a partial knowledge about other peers. It maintains routing information about O(logN) other peers, where N is the number of peers in the system. The consistent hash function assigns an m-bit identier to each peer and data key.

38 Section 2.3. Structured Overlay Networks 25 Figure 2.9: Example of Chord system The m-bit identier is computed using SHA-1 hash standard [1]. The identier of a peer is based on its IP address, while the identier of the data is provided by hashing the data key. To reduce the probability of having the same identier for two dierent keys, the length m of an identier should be large. Identiers are organized on an identier circle modulo 2m. Key k is assigned to the rst peer whose identier is equal to or follows k in the identier space. This peer is called the successor peer of key k, denoted by successor(k). If identiers are represented as a circle of numbers from 0 to 2m 1 as shown in gure 2.9, then successor(k) is the rst peer clockwise from k. The circle of identiers is called the Chord Ring. When a peer n joins the network, some keys that have been assigned to the successor of n have to be assigned to the peer n. When a peer n leaves the network, all of its assigned keys are reassigned to its successor. Therefore, peers join and leave the system with (logn) 2 performance. In Figure 2.9, the Chord ring is depicted with m = 4. This specic ring has three peers and stores three keys. The successor of the identier 1 is peer 1, thus key 1 will be located at NodeID 1. The successor of the identier 2 is peer 3, thus key 2 will be located at NodeID 3, and the successor of the identier 6 is peer 0, thus key 6 will be located at NodeID 0. Similarly, if a peer with an identier equals to 7 joins the network, it would store the key with identier 6 from the peer with identier 0. Each peer n maintains a routing table called the nger table as shown in gure 2.10, having a maximum size of m entries. The i th entry in the table at peer n contains the identier of the rst peer S that succeeds n given by S = successor(n + 2i-1), where 1 <= i <= m. Peer S is the i th nger of peer n(n : finger[i]). A nger table entry includes both the Chord identier and the IP address (and port number) of the relevant peer. Lookup queries require the matching of key and NodeID. A given identier could

39 26 Chapter 2. Research background Figure 2.10: Chord Operations be passed around the circle via these successor pointers until they reach a pair of peers that include the required identier; the second peer in the pair is the peer the query maps to Complex Queries in DHT Systems More recently, some approaches [100, 42, 114, 47, 92] have focused on extending DHT systems to support more complex queries such as range queries. Gupta et al. [42] propose Meghdoot system that creates a logical cartesian space with 2n dimensions, where n is the number of attributes that dene the system schema S = {A 1, A 2,..., A n }. All the peers in the system use the same schema S to describe their data and queries. Each attribute is described by the tuple {Name, T ype, Min, Max}. The attributes are identied by their unique names. The data types may be integer, oating point or character strings. The values Min and Max describe the range of domain values of the attribute. Each attribute A i corresponds to the dimensions 2i and 2i 1 of the logical space. This two dimensions correspond to the bounds Min i and Max i of the attribute A i. The logical space is partitioned into zones and each peer is responsible for one zone. The peers maintain a multidimensional distributed hash table as described in CAN [18]. This approach has been proposed to provide scalable content search mechanisms by taking into account semantics of peers' data. Thus, it allows an ecient routing of range queries for textual data. Triantallou et al. [114] propose a closely related approach to the previous one. They make the same assumptions presented above but the main dierence between the two approaches is that Triantallou et al use Chord instead of CAN.

40 Section 2.4. Conclusion 27 Huebsch et al. [47] have proposed a semantic DHT system called PIER, based on CAN approach. The hashing function that calculates the keys is based on namespace and resourceid. The namespace identies the application or group a data object belongs to. For query processing, each namespace corresponds to a relation. Namespaces do not need to be predened, they are created implicitly when the rst corresponding data object joins the network and they are destroyed when the last of all its related data objects leaves the network. The resourceid is a value that carries some semantic meaning about the object. Data objects with the same namespace and resourceid will have the same key, so they are assigned to the same peer. Thus, queries having the same resourceid meaning similar queries, are sent to the same peer performing an approximate search. Similarly, Papadimos et al. [92] propose an architecture where peers represent their data objects using a name space of multiple hierarchical categories. Queries are then routed eciently without depending on centralized servers Concluding Remarks Highly structured P2P systems are based on DHT techniques to organize data in a key space. To each data and peers in the system is associated a unique key that helps data organizing and query lookup. DHT techniques have been developed in the literature to improve data search eciency, system scalability and fault tolerance. DHT search has three main characteristics: 1. It is adapted to exact match queries and does not support complex queries such as content-based and partial match. Some approaches have been proposed addressing this issue by adding a layer on top of the existing DHT systems to process multi-attribute range queries 2. Each peer is responsible for a set of objects in the system. Thus, update operations have to be maintained to keep coherent information when data change or when peers leave or join the network. 3. DHT stores a copy of or a pointer to each data object (or value) at the peer responsible for the data object's key, which limit its use in some applications that require privacy.

41 28 Chapter 2. Research background 2.4 Conclusion In this chapter, we have presented dierent search models in P2P networks that depend on the underlying architecture. Unstructured networks usually lies on (1) a blind search that oods the network to route queries, or (2) an informed search that uses a centralized server or super peers concept to keep indexes about peers content. Structured networks can be highly structured or loosely structured. Highly-Structured systems use a Distributed Hashing Table (DHT) creating a key space to organize data objects and store them in specic locations according to peers' identiers. Loosely-Structured systems are the intermediates between the highly-structured and the unstructured systems where le locations are not completely specied. We have discussed the characteristics and issues of each search model and we have presented an overview of some approaches proposed to address those issues. Other search models based on clustering techniques are presented in the next chapter.

42 Chapter 3 Design of Cluster-based P2P Systems: The Hybrid Overlay Network Contents 3.1 Introduction Similarity Search Textual Applications Multimedia Applications Distributed Search P2P Similarity Search Clustering Approaches Semantic-based Approaches Low level Feature-based Approaches Discussion Problem Statement Data and peers organizing Feature selection Hierarchical architecture Caching mechanism Hybrid Overlay Network General Description Organization of an Overlay

43 30 Chapter 3. Design of Cluster-based P2P Systems Feature Space of an Overlay Organizing Layers of an Overlay Peer Join and Departure Peer Join Peer Departure Clustering Issues in HON Requirements Anity-based Clustering Cells Anity Computation Grouping cells Density-based Clustering CLindex Approach Distributed Density Clustering in HON Caching Issues in HON Performance Evaluation Conclusion Bibliographic Remarks

44 Section 3.1. Introduction 31 In P2P architectures, peers exchange information and interact with other peers to process queries. The methods used to route queries among peers can signicantly affect the eciency of information retrieval by consuming high network bandwidth and incurring delays. There is an increasing need to cluster peers according to common characteristics, to improve the performance of information retrieval by organizing data into segmented subspaces and peers into groups based on their content. Several characteristics can be used to represent peer content, including semantic annotations of documents and physical feature descriptions of the objects composing a document. In this chapter, we present a cluster-based design of P2P system for distributed similarity search. First, we describe several characteristics of search techniques and present several traditional applications that use these techniques. In section 2, we present P2P similarity search, focusing on clustering issues. In section 3, we state the problem that is tackled in this thesis and discuss the features that should be included in a good solution. The next four sections are devoted to our approach and present an overview of the main contributions of our work. In section 4, we present the design of the Hybrid Overlay Network (HON) for clustering peers sharing similar content. We discuss clustering issues in section 5 and describe the two main P2P clustering algorithms investigated in this thesis. In section 6, we review some caching issues and discuss the HON cache model. In section 7, we review the implementation of a prototype and discuss dierent aspects of performance evaluation. Finally section 8 concludes this chapter and section 9 presents some bibliographic remarks. 3.1 Introduction Information searching is a central issue in many applications and may take two forms. Exact Match Search consists in retrieving data that exactly equal a query. This type of simple search is commonly used in traditional databases composed of records, where each record has a key. A query submitted to these databases returns the records containing keys that match the search key. Exact match is used in many applications. In string matching applications an exact search consists in detecting copies of a string in another string. In image-based applications, an exact match consists in retrieving images that are identical to a sample image submitted as a query. Figure 3.1a shows an example of an exact search of multimedia objects. The image query contains a face and is used to retrieve the objects that contain the same face. Similarity Search is the second form of information search that is commonly used. It consists in retrieving all data that partially or approximately match a given query. Similarity search can be used

45 32 Chapter 3. Design of Cluster-based P2P Systems Figure 3.1: Examples of queries in many applications including content-based information retrieval, image database applications, video search, etc. Figure 3.1b illustrates an example of a similarity search based on the same sample image as in gure 3.1a. In this case, the returned objects contain round faces that have common characteristics with the query image. In the remainder of the chapter we focus on various aspects of similarity search in distributed environments. We particularly investigate distributed search in P2P architectures using clustering and caching methods Similarity Search Similarity search queries can be divided into two types: nearest neighbors queries and range queries. Each type of queries is designed for special issues. For example, the nearest neighbors queries can be used to address spelling problems by identifying the closest matches for any text string not found in a dictionary. Exact matching is not adequate since we live in an error-prone world. Range queries can be used in geographical information system (GIS) applications, where any data object with d numerical elds can be modeled as a point in a d-dimensional space. A range query species a region in the d-dimensional space and asks for all points or the number of points in the region. For example, if the data object is a person's height, weight and income, a range query can ask for all people with incomes between zero and a thousand dollars, with height

46 Section 3.1. Introduction 33 between six and seven feet, and weight between fty and one hundred pounds Textual Applications Textual documents are generally not structured to easily provide the desired information, particularly when they are in many cases searched for semantic concepts of interests. Several research studies have been devoted to this problem [9, 37, 102] and have basically solved it by retrieving documents similar to a given query. The user can present a document as a query and the system nds and retrieves similar documents. Some similarity approaches map a document to a vector of real values where each value corresponds to a dimension representing a vocabulary word. A value represents the degree of relevance of the word to the document. Similarity functions are then dened on the dimensions used to describe a document. Another problem related to text retrieval is spelling. Since huge text resources with low quality control are available on the web, and typing or spelling errors are present in the text and the query, documents which contain a misspelled word are no longer retrievable by correctly written query. Models of similarity among words are then required to look for close variants of queries Multimedia Applications Multimedia data such as images, audio and video cannot be meaningfully queried using exact matching because they cannot be compared for equality. For example, the probability that two images are pixel-wise equal is negligible unless they are digital copies of the same source. In multimedia applications, many queries are content-based and are used to retrieve objects that are similar to a given object. Some example applications include face recognition, ngerprint matching, voice recognition, and multimedia databases. An example of query can be the identication of objects such as a tree or a man contained in some database images. If the repository is tagged and each tag contains a full description of what the image contains, then, the content-based query is resolved using traditional keyword-based search on the set of image annotations. Unfortunately, images cannot be easily tagged for every possible query. Moreover, object recognition in real world scenes involves complex operations and results might not be be accurate. A solution to this problem is to provide a query image and then the system searches all the images similar to the query. This approach is based on the denition of similarity functions among objects.

47 34 Chapter 3. Design of Cluster-based P2P Systems Figure 3.2: Challenges of distributed search Distributed Search The web is a successful example of a medium for publishing and accessing content over the Internet. Search engines for the Web and large corporate networks are based on centralized databases which are used to store and index documents to facilitate information retrieval. Centralized search models suer from several problems. First, information which is proprietary, costs money or needs to be controlled by its publisher cannot be stored without restrictions in a centralized architecture. The information privacy issue can be partially resolved by storing indexes instead of real data in a central repository. In both cases, the central repository size increases with number of elements and may become a crucial performance problem. Second, centralized models require costly high bandwidth connectivity and powerful servers for an ecient service. Third, the centralized component is a single point of failure that is vulnerable to many types of attacks and the source of many problems. Alternatively, in distributed architectures data can reside on several databases on dierent sites. A distributed architecture allows sharing of data storage and searching over multiple network nodes. This alternative generally is more scalable than the centralized model and can allow a larger number of simultaneous users. Moreover, distributed architectures are more fault tolerant and less vulnerable to total failure. If one node or components is down, only a part of the information is lost. The performance of the system degrades progressively but the system does not stop running. The main challenges of information retrieval in distributed architectures are dened by Losee et al. [73] as follows: 1. Database Selection: is to determine which database(s) to use. The choice of a database might take into account some factors such as if some sites are uncooper-

48 Section 3.2. P2P Similarity Search 35 ative or charge a fee for using a database. Another factor is how much information sites make available about their collections. 2. Query Translation: since each database uses its own schema to describe the data, queries has to be translated into a form that the requested databases understand. The use of databases having dierent schemas generate several problems of translation. Even if two databases support the same schema, their vocabularies and query syntax may dier. This makes the translation task very challenging in autonomous distributed systems. 3. Results Merging: focuses on how to combine the results from dierent databases into a coherent whole. The main issue here is that each site uses its own similarity function and may return results that are not ranked, and even if they are ranked, no numeric values are associated with these ranking. This problem makes dicult the combination of the dierent results to obtain a single ranked list. In our work, we focus on information retrieval in distributed environments. Particularly, we address the issue of similarity search in P2P networks which is very challenging because of the autonomy and dynamism of peers. In the following, we discuss the different search techniques that have been introduced for P2P networks, and we study their ability to provide an ecient similarity search. 3.2 P2P Similarity Search In chapter 2, we have presented several P2P search techniques that depend on the underlying overlay infrastructure. There are two main types of search techniques in unstructured P2P networks: (1) the blind search technique, which does not rely on peer information or relation oods queries over the network to retrieve relevant answers and (2) the informed search technique, which uses information collected on the peers to forward queries to relevant peers only. Both techniques suer from several limitations. First, blind search techniques require a large network bandwidth, which reduces search eciency. Second, blind search techniques provide only partial retrieval of unpopular les that are lightly duplicated in the network. The query may time out before reaching unpopular les on peers located in other regions of the ooded network. Third, both techniques are based on peer indexes or le names that do not describe peer content which can be particularly severe when dealing with similarity search. The most common technique used in structured networks is the Distributed Hashing Tables (DHT), which organize data in a key space for an ecient access and a complete

49 36 Chapter 3. Design of Cluster-based P2P Systems lookup. They have been proposed to address scalability issue which is related to the number of nodes in P2P networks. However, DHT search can not be used for similarity search for three main reasons. First, it is adapted to exact match queries and does not support complex queries such as content-based and partial match. This is simply because the hash function used to map key values to the space segment is dicult to adapt to approximate key values. Second, each peer is responsible for a set of objects in the system requiring a high maintenance cost for updating relevant information when data change or peers leave or join the network. Third, it is necessary to store a copy or a pointer to each data object (or value) at the peer responsible for the data object's key. As a result, DHT might not be an appropriate approach for some applications that require privacy. Similarity search can be inherently expensive, especially when the query processing technique is not designed to handle similar queries. First, the search space can grow exponentially with the number of peers and data. Second, peers containing similar data objects are usually not organized in such a way to facilitate their retrieval. We have discussed above the dierent search techniques that have been proposed in the literature. With regard to similarity search, clustering-based search techniques are a good alternative to the existing search techniques. The goal is to create links on top of unstructured P2P overlay networks to peers with similar characteristics. Queries can be easily forwarded to groups of peers holding similar information. In the following, we give a brief overview of clustering techniques that have been proposed for P2P networks focusing on content-based approaches Clustering Approaches Figure 3.3 illustrates the impact of clustering on search eciency. It shows the dierence between a search on a network without clustering and a search over a cluster-based network. In case (a) the query looks for all the peers that contain the concept A. Thus, the query is ooded with a maximum of two hops. Thus, ve messages are sent over the network and only one answer among the four existing answers is retrieved. In case (b) the same query is ooded with a maximum of three hops. Ten messages are generated and only two answers are retrieved. In case (c) peers that share the same concepts are organized into clusters. Therefore the same query is rst routed to the relevant cluster, and then ooded inside the cluster to all the peers. As a result, only ve messages are required to retrieve all peers holding the concept A. There are two main categories of P2P clustering techniques. Context-based tech-

50 Section 3.2. P2P Similarity Search 37 (a) Flooding with hop=2 (b) Flooding with hop=3 (c) Search using clustering Figure 3.3: Clustering impact on search eciency niques create clusters using peer properties (IP address, network distance, application needs, etc), while Content-based techniques group peers according to content (semantic, low-level features, etc). Dierent context-based information or properties can be used to cluster peers. Some clustering approaches are based on network related information [66] or peer characteristics [53, 74, 5]. Other approaches use functionalities or properties of applications. For example, Wang et al. in [117] dene an application oriented P2P architecture called Friends Troubleshooting P2P Network (FTN) for resolving machine conguration problems. Similarly, Marti et al. [80] propose a friend network of peers for dealing with security related attacks. In our work, we focus on Content-based clustering which exploits similarities among documents of peers. An appropriate representation of peer content is required to assign peers with similar content to the same clusters. Several approaches have been proposed in this direction [108, 87, 28, 85, 86, 112, 43]. Two main categories can be distinguished: Semantic-based approaches and Low level Feature-based approaches Semantic-based Approaches Semantic-based approaches [28, 85, 86, 112] associate peers with semantic descriptions that can be simple keyword-based annotations, schema or ontologies. The descriptions

51 38 Chapter 3. Design of Cluster-based P2P Systems Figure 3.4: Semantic Overlay Network are usually based on common domain concepts. Nejdl et al. address clustering strategies for RDF-based P2P networks in [86]. Their solution is a schema-based approach where the content of peers are annotated using RDF and RDF-schema. The resulting RDF P2P network is a hierarchical architecture consisting of (1) super peers interconnected by a hypercube topology and (2) simple peers connected to the super peers. When a peer joins the network, it publishes a metadata-based description of its content to the super peers. The RDF-based descriptions are then used to carry out semantic comparison between peer and super peers. The Piazza project [112] is a peer data management system (PDMS) aimed at sharing semantically heterogeneous data and schemas. Clusters are built by creating mappings between semantically similar peers. Another semantic-based approach, called Semantic Overlay Networks (SONs), is proposed by Crespo et al. [28]. The goal of the approach is to create multiple clusters called semantic overlay networks to improve search performance. The overlay networks are dened so as to select, for a given query, a small number of the overlays that contain peers with high number of relevant answers. The authors use concept classication hierarchy to form the overlay networks. They classify each peer into one or more concepts in the hierarchy. A SON is associated to one concept in the classication. Figure 3.4a shows three classication hierarchies for music documents. The documents are classied by style (rock, jazz, etc) and sub-style (soft, dance, etc) in the rst hierarchy, by decades in the second hierarchy, and by tone (warm, exciting, etc) in the third hierarchy. Figure 3.4b shows an assignment of peers to three SONs (Rock, Rap and Jazz). Note that a peer can belong to more than one cluster. The peer C belongs to the Rap

52 Section 3.2. P2P Similarity Search 39 and Rock SONs. The authors discuss several strategies for assigning peers to a SON. The conservative strategy puts a peer in a SON if it has at least one document classied in the corresponding concept while the less conservative strategy places a peer if it has a signicant number of documents corresponding to the concept. This last strategy has two main benets. First, the peers to which the request is sent will have many matches, so the request is answered eciently. Second, the peers that have few results for a query will not receive it, avoiding wasting resources on that request. The Hybrid Overlay Network approach which is one of the contributions of this thesis is based on and extends some of the ideas of the SON approach. The presentation of the HON approach is given in section Low level Feature-based Approaches A second approach to content-based clustering uses low level features to select the member peers of a cluster. Both documents and peers are characterized by feature vectors which are used to dene similarities measures. Low level Feature-based approaches create clusters according to low level feature descriptions of peer content. For example, Hang et al. [87] propose a CBIR (Content-Based Image Retrieval) system on top of a P2P network by grouping peers that share similar images. Each peer extracts the content description of its shared images to form a collection of feature vectors to its signature value. The signature vectors are used to calculate similarity measures between peers. Two types of links can be established between peers. A random link connects a peer p to another peer in a random manner while an attractive link is an explicit connection made by a peer to another peer with similar images. The DIstributed COntent-based Visual Information Retrieval (DISCOVIR) project [104] also groups together peers that share similar images. To achieve ecient query routing, DISCOVIR uses a Firework Query Model (FQM) [87] which processes queries as follows. When a peer initiates or receives a query message, the query is routed selectively according to its content. Once it reaches its designated cluster, the query message is broadcast by peers inside the cluster. In addition, FQM introduces a selective query message routing to minimize the number of messages passing and to maximize the ability of retrieving relevant data Discussion In this section, we have reviewed the characteristics and discussed the limitations of search techniques used in two P2P network categories, namely the unstructured and

53 40 Chapter 3. Design of Cluster-based P2P Systems the structured architectures. When considered for similarity search, the search techniques of unstructured networks exhibit excessive processing delays and require high network bandwidth. The DHT-based search technique used by structured networks is ecient for exact match queries. However, distributed hash functions used in this search algorithm cannot be easily extended to key with approximate values. Clustering-based P2P networks were also presented in this section with a particular emphasis on the advantages of content-based clustering. The goal of the search techniques used in cluster-based networks in to send the queries to groups (or clusters) of peers with similar content. The similarity among peers can be determined semantically in the SON networks or can be computed from the signatures represented by feature vectors of peers. 3.3 Problem Statement The focus of this thesis is on the design of cluster-based P2P system and an ecient query routing and processing in distributed environment Data and peers organizing Ecient design of similarity search in P2P environment requires careful organization of data and peers. Query eciency can only be achieved if the data being retrieved are structured appropriately. DHT and clustering techniques presented previously provide respectively data and peers organization to perform an ecient search. DHT techniques organize data in a key space for an ecient indexing but do not take into account the similarity between them. Moreover, the DHT techniques require a high cooperation among all participant peers to manage and distribute the peer content in the data space. Peers in security and privacy oriented applications may not accept to place their data on an unknown peer. On the other hand, clustering approaches group peers with similar interest to avoid the blind search. Though, they generate an approximate description of peers which aect the search result accuracy and recall. The objective of our work is to organize both data and peers to provide an ecient similarity search in P2P networks. Therefore, we combine the characteristics of CAN and clustering techniques. The Content Addressable Network (CAN) presented in chapter 2 organizes data in an n-dimensional key space. The space is partitioned into zones where each zone is owned by one peer. An object key is a point in the space, and the object is stored at the node whose zone contains the point. This organization

54 Section 3.3. Problem Statement 41 of data provides an ecient query routing based on objects keys. However, these keys are generated using hashing functions and do not indicate any similarity between objects. Our goal is to use a space of features describing the content of data objects. The feature vectors of objects are used as keys. Therefore, similar objects are close to each other in the feature space to provide an ecient processing of complex queries such as range and nearest neighbor queries. Subsequently, peers can be organized in the same feature space of their data which facilitates the creation of similar clusters and limits the ooding overhead by sending queries only to relevant peers Feature selection A fundamental issue is the creation of the feature space by identifying the signicant features used to describe peer data. The dimensionality of the feature space depends on the type of data objects contained in peers. Non traditional data such as images and multimedia are represented by a large number of features, introducing the problem of high dimensional data described by the curse of dimensionality [12]. Problems with high dimensionality result from the fact that a xed number of data points become sparse as the dimensionality increases, resulting in a multidimensional space that is mostly empty. In addition, the cost of creating and maintaining clusters in such spaces can be prohibitive. To avoid high dimensionality problems, we study the existing dimensionality reduction solutions and adopt a feature selection technique to build multiple overlay networks of peers described by dierent features. Each overlay groups peers with common features and uses a feature space to organize data and cluster peers Hierarchical architecture In chapter 2 we discussed three underlying P2P architecture: distributed architectures, centralized architectures and hybrid architectures. We stated the important of these (underlying) architectures in (1) the development of query routing algorithms and (2) the management and maintenance of a coherent system state when peers join and leave the network. Regarding query routing algorithms, centralized architectures are based on ecient algorithm which uses centralized index to nd answers. Distributed architectures use ooding-based algorithm which increases network bandwidth consumption. Hybrid architectures use partial ooding. Regarding the network structure, centralized architectures suer from a single point of failure and risk network bottleneck. Distributed systems maintain very little state information and therefore scale eciently when peers join and leave the system. However this is not the case for distributed DHT-

55 42 Chapter 3. Design of Cluster-based P2P Systems based architectures which must maintain complex hash indexes and incur high cost for managing arrivals and departures of peers. The work presented in this thesis is based on a hybrid architecture which introduces the concept of super peers and simple peers. Hybrid P2P architectures have several properties: 1. They reduce the search delay and bandwidth consumption. A hybrid architecture provides a fast search by dividing the network into clusters. Therefore, search information is limited to smaller set of super peers where each one has indexed the content of its peers. 2. They manage eciently the system structure when peers join and leave the network. In addition, super peer can control peers activities to avoid malicious behaviors. 3. They balance the load of peers management between several super peers. Therefore the risk of network congestion is limited Caching mechanism Another important issue in the design of cluster-based architecture is caching. Our objective is to use caching in HON to reduce ooding overhead inside clusters. Thus, improve routing and lookup performance. The goal of HON caching is to keep track of queries and peers that return good results. Then, future similar queries to those stored in the cache can benet from cached information. Only peers that returned good results are queried to retrieve answers. Therefore, the portion of irrelevant peers participating in the query processing is reduced. To facilitate sharing and retrieval of cached information, peers share information about their queries in the same way they share their data objects. Since similar queries are served from the same region of the feature space, the cache is placed in a specic point of each region. Thus, similar queries and their information are stored in the cache of their region improving cooperation and avoiding redundancy. P2P caching is in some way dierent from traditional web caching systems. The main dierence is due to the fact that in traditional web caching, the stored data is kept on well identied static web servers. By contrast, in P2P systems query results are combination of partial results from one or more peers which can frequently connect to (or disconnect from) the network. Thus, caching in P2P systems is a challenging task that requires a careful conception and management.

56 Section 3.4. Hybrid Overlay Network Hybrid Overlay Network The goal of the Hybrid Overlay Network (HON) is to allow ecient similarity search in P2P architectures. It organizes both peers and data in an n-dimensional feature space based on content description. First, we present the architecture of HON consisting of dierent overlays, which are used to identify dierent set of features required to describe the content of peers. We present the HON components and the dierent levels of peers used to manage the network. Second, we dene the structure of the feature space. Third, we present the dierent steps required to build the system. Finally, we introduce two alternatives for clustering allowing ecient similarity query processing General Description The main idea of HON is to organize peers into clusters according to the data descriptions [54]. We assume that the content of peers is described by a set of features F called Feature Space. The whole set of features F may not be necessary for the description of the content of a peer. In another word, a peer can be described by a subset of F depending on its data objects. For example, a peer holding text les can be represented by a set of keywords where each keyword is a feature. Another peer containing images can be described by low level features such as color, texture or shape. Because peers can be very heterogeneous with regard their data content, the number of features in the feature space, that is the dimensionality of F, may grow exponentially as the number of participating peers increases. As a result, problems related to high dimensionality appear and aect system performance. High dimensionality problems can be avoided by dividing the global set of features into several subsets of a limited size. These subsets contain the most important features to characterize peers content. To each set of selected features corresponds an overlay. Thus, HON consists of multiple overlays of dierent types of features. Figure 3.5a shows the subdivision of a feature space F composed of the features f 1, f 2,..., f n into overlays. High dimensionality problems and overlays creation are presented in details in the next chapter Organization of an Overlay An overlay is composed of one or more clusters of similar peers as illustrated in gure 3.5b. Each overlay is associated with a set of features which can be semantic features such as keywords or low level features such as color, texture and shape. Overlay clusters are organized in a two level hierarchy consisting of super peers and simple peers. The

57 44 Chapter 3. Design of Cluster-based P2P Systems Figure 3.5: Hybrid Overlays Networks super peers which manage the peers of the clusters must have high processing and storage capacities while the simple peers have limited capabilities. A super peer contains the index tables used to route queries inside its cluster and to neighboring clusters. An index table contains the descriptions of clusters and IP addresses of their related peers. The index tables are discussed more extensively in chapter Feature Space of an Overlay The feature space has n dimensions represented by the features f 1, f 2,..., f n used to describe the data of the peers mapped to an overlay. It is used to create the clusters and to organize the data contained in peers. The values of each feature f i range from a minimum value fi min to a maximum value fi max. Figure 3.6a shows a feature space composed of two features f 1 and f 2 taking their values from 0 to 10. The main idea is to dene a partition of the feature space into cells by dividing the range values of each feature into several partitions. For example, the range values of features in gure 3.6a are divided into 4 partitions resulting in 16 cells. Formally, the range values [f min, f max of a feature f i is divided into m i intervals of equal size f i max fi min m i i i ], for i = 1, 2...n. We denote the resulting set of cells φ = {ϕ 1, ϕ 2,..., ϕ m }, where m is given by m = n i=1 m i. Once the feature space is partitioned, we use the distribution of data over the cells as the basis for dening peer similarity and for creating clusters. Two peers are similar if their data are distributed on the same region-based partitions of the feature space dened by:

58 Section 3.4. Hybrid Overlay Network 45 Figure 3.6: Partition cells and Clusters Denition 1 (Region-based partition) Consider the set of partition cells φ = {ϕ 1, ϕ 2,..., ϕ m } of the feature space of an overlay and the power set 2 φ of φ. Let R 2 φ be the set of partitioning regions. A region R R is such that R = {r i r i R, r i = φ, and (i, j) r i rj = } Organizing Layers of an Overlay several steps are needed to organize the data and create the clusters of an overlay. The steps are shown in gure 3.7 as three layers: the data layer, the peer layer and the clustering layer. Data Layer: The data layer represents the rst step and is essentially used to organize the content of the peers in the feature space. Consider a peer P i and its data represented by a set of objects O = {O ij } = {[f ij1, f ij2,..., f ijk,...f ijn ]}, where O ij is the j th data object of peer P i and f ijk is the k th feature value of O ij. Each object O ij corresponds to one point in the feature space and is mapped to one cell. Let α ik denote the number of objects of peer P i in the partition cell ϕ k,k=1,2..,m, the content of P i is represented by a signature vector S i dened over cells of the feature space as S i = [α i1, α i2,..., α ik,...α im ]. The signature vector of a peer P i records the distribution of the objects of the peer over the cells and denes the density of each cell with respect to the peer. This notion of cell density will be used to map the peers of an overlay to the cells of the feature space

46 Chapter 3. Design of Cluster-based P2P Systems Figure 3.7: Architecture Layers and to create clusters. Figure 3.6b shows the distribution of data objects of the peers P 1, P 2, P 3 over the partition cells of gure 3.

59 46 Chapter 3. Design of Cluster-based P2P Systems Figure 3.7: Architecture Layers and to create clusters. Figure 3.6b shows the distribution of data objects of the peers P 1, P 2, P 3 over the partition cells of gure 3.6a. Peers Layer: The goal of this layer is to organize and map the peers in the feature space. Each peer is mapped to the set of cells to which its data are mapped in the previous layer. The mapping of peers to cells is done using a threshold value T as follows: a peer is mapped to a cell only if it has at least T objects in the cell. Peers that are mapped to the same cell are related by a Cell-similarity dened as follows: Denition 2 (Cell-Similarity) Two peers P i and P j are cell-similar with respect to a cell ϕ φ if they are both mapped to ϕ. This denition can easily be extended to dene the cell-similarity of two peers P i, P j over a set S φ of cells S = {ϕ 1, ϕ 2,..., ϕ r }. Two peers are cell-similar over a set S if they are cell-similar with respect to all the cells in S. The denition of cell-similarity over a set of cells will be used later to characterize the similarity of peers that belong to a partition region or a cluster. In the example of gure 3.6, the peers P 2, P 3 are cell-similar with respect to cells ϕ 14, ϕ 15. Clustering Layer: The goal of this layer is to create clusters represented by partitioning regions of the feature space. A cluster is composed of all the peers mapped to the cells of a partitioning region used to dene the cluster. In the example of gure 3.6, the three peers P 1, P 2, P 3 are mapped to three clusters C 1, C 2, C 3. Peers P 1 and

60 Section 3.4. Hybrid Overlay Network 47 P 3 which are both mapped to cluster C 3 are not cell-similar with respect to any cell of C 3 since data of both P 1 and P 3 are never included or mapped to the same cell in C 3. However, both P 1 and P 3 are cell-similar to P 2. P 1 and P 3 can be used to process range queries submitted to cluster C 3. In section 3.5, we discuss clustering issues, requirements and solutions. Then we present a density-based algorithm that we use in HON to create clusters Peer Join and Departure In this section we describe the protocol of peer join and leave Peer Join When a peer connects to the network, it requires several steps to join one or more clusters in one or more overlays. 1. Feature Selection: the peer denes the set of features that represent its content to determine the set of its relevant overlays. Note that a peer can belong to one or more overlays. 2. Overlay Discovery: the peer connects to any online super peer to get overlays information. An overlay information includes the identier of the overlay, the set of features describing the overlay and their range values, the number of partitions and the threshold value T used for peer mapping. 3. Signature Mask Computation: the peer computes its signature mask using the feature spaces of its relevant overlays. For each overlay, the peer maps its data into the corresponding feature space using the threshold value T. Let φ = {ϕ 1, ϕ 2,..., ϕ m } be the set of cells of the feature space of an overlay. Remind that the signature of a peer P i is given by S i = [α i1, α i2,..., α ik,...α im ] where α ik denotes the number of objects of peer P i in the partition cell ϕ k,k=1,2..,m. A signature mask SMP i of a peer P i is given by SMP i = [β i1, β i2,..., β im ] where: { 1 if α ik > T β ik = 0 otherwise The signature mask indicates the set of cells to which the peer is mapped. For example, in a feature space divided into four cells φ = {ϕ 1, ϕ 2, ϕ 3, ϕ 4 }, the signature mask of a peer belonging to cells ϕ 1 and ϕ 3 is: [1, 0, 1, 0].

61 48 Chapter 3. Design of Cluster-based P2P Systems 4. Send Join Notication: The peer sends a join notication to each of its relevant overlays. A join notication contains its signature mask for the corresponding overlay. In each overlay, the join notication is received by super peers. Each super peer computes the distance between the signature mask of its cluster and the signature mask of the connecting peer. The signature mask of a cluster C j is given by SMC j = [γ j1, γ j2,..., γ jm ], where: { 1 if ϕ k C j γ jk = 0 otherwise The signature mask of a cluster indicates the set of its cells. For example, in a feature space divided into four cells φ = {ϕ 1, ϕ 2, ϕ 3, ϕ 4 }, the signature of a cluster containing cells ϕ 1 and ϕ 4 is: [1, 0, 0, 1]. Let SMP i = [β i1, β i2,..., β im ] be the signature mask of a peer P i and SMC j = [γ j1, γ j2,..., γ jm ] be the signature mask of the cluster C j. The distance between SMP i and SMC j is given by: P eerclusterdistance(smp i, SMC j ) = (β i1 γ j1 )... (β im γ jm ) Considering the formula given above, the distance between a peer and a cluster takes two values 0 or 1. It is equal to 1 if the peer belongs to at least one cell managed by the cluster. Otherwise, it is equal to Send Join Accept: super peers that have a distance equal to 1 between their clusters and the connecting peer, send back an accept join notication with their IP addresses. 6. Connection: when the connecting peer receives the accept join from its relevant super peers, it sends a connection notication. Then, super peers update their index tables Peer Departure We distinguish two cases in this scenario. Departure of simple peers and departure of super peers. 1. Simple Peers Disconnection: when a simple peer disconnects from the network, it simply sends a notication to its super peers. These super peers remove the relevant information of the disconnecting peer from their index tables.

62 Section 3.5. Clustering Issues in HON Super Peers Disconnection: We assume that each super peer duplicates its content in some other peers of its cluster called mirror peers. Thus, when a super peer disconnects from the network, it sends a notication of one of its mirror peers. The mirror peer takes over the clusters of the disconnected super peer and sends a notication to all the contained peers to update their index tables with the information of the new super peer. 3.5 Clustering Issues in HON The goal of the clustering design is to place peers with similar data objects in neighbor cells in the feature space of an overlay, that is, the peers of the entire overlay are partitioned into clusters of cell-similar peers. Therefore, peers that are near to any query point are directly and eciently retrieved. In this section we discuss the requirement of clustering in HON and propose two dierent alternatives Requirements The design of clustering in P2P systems and particularly in HON, must take into account several parameters. First, query routing and processing must be carried out in a distributed way to avoid single points of failure and over loading. Second, peers holding similar data must be grouped in such a way that the number of results returned for each query is maximized without exhausting a high number of resources. Third, changes in the content and the dynamic behavior of peers must be taken into account. Considering a P2P network, data of a peer may change continuously. This fact must be considered and used to update clusters. The following characteristics of the feature space must be considered in the denition of clusters. 1. Cell-Similarity: The data that are mapped to a cell are similar. Likewise two peers are cell-similar with respect to a given cell if they are mapped to the same cell. 2. Cell-Adjacency: Two cells are adjacent if they share a (d 1) dimensional hyperplane, where d represents the dimension of the feature space. The most similar data objects to the content of a cell are the data objects of its adjacent cells. 3. Cell-density: It represents the number of data objects in a cell. It is used with a threshold to map peers to cells. As stated above, a peer is mapped to a cell if it contains a sucient number of data that are mapped to the cell.

63 50 Chapter 3. Design of Cluster-based P2P Systems In the next two sections we present two types of clustering techniques investigated in this thesis Anity-based Clustering The objective of the anity-based clustering is to group cells so that queries access only few groups. In this context, an optimal clustering scheme minimizes the interclusters communication required to send queries from cluster to another. Anity-based clustering is based on the vertical fragmentation technique used in database systems to minimize page access and execution time of user applications [91]. The underlying idea of the anity-based vertical fragmentation is to place attributes with high anity value in the same fragments. The anity or bond between two attributes represents the total number of times the two attributes are accessed together. We extend the idea of attribute anity to cell-anity and use it to dene a clustering technique which consists of two steps: 1. Cell-anity computation: since the anity-based clustering places in one cluster the cells usually required together, the rst step consists in dening a measure of cell anity that would describe more precisely the notion of togetherness of two cells. As in the vertical fragmentation mentioned above, the measure of cell anity represents the total number of times that two or more cells are accessed together by a query. 2. Grouping cells: the fundamental task of this step is to cluster the cells with larger anity values together and the ones with smaller values together Cells Anity Computation The anity of cells indicates how closely related the cells are. This measure can be related to several parameters. In this section we propose two methods for computing cell-anity. The rst method is based on the access frequencies of queries submitted by peers and the second one introduces in addition other parameters such as the number of peers participating in the communication between cells. In the following we present the two approaches used to dene the anity between cells in the feature space. The rst method uses the access frequencies of queries to dene the anity between cells. Let Q= {q 1, q 2,..., q q } be a set of queries on data from cells φ = {ϕ 1, ϕ 2,..., ϕ m }. Each of those cells can be a Target Cell for a given query that is dened as follows:

64 Section 3.5. Clustering Issues in HON 51 Figure 3.8: Example1: Cells Usage and Anity Matrix Denition 3 (Target Cell) Let ϕ be a cell described by its content C ϕ ={O 1, O 2,...O k } where O i is a data object. A cell ϕ is a target cell of a query q (denoted by q ϕ) if O i C ϕ where O i is an answer to the query q. For each query q i Q and each cell ϕ j φ, we associate a cell usage value dened as follows: Denition 4 (Cell Usage Value) (a) Let Q be a set of queries, and q i a query of Q. The cell usage value of q i for a given cell ϕ j is dened as: { 1 if q i ϕ j ϕ j not empty use(q i, ϕ j ) = 0 otherwise use(q i, ϕ j ) is equal to 1, if the cell ϕ j is a target cell of the query q i and it is not an empty cell. (b) The set C(q i ) of the cells accessed by the query q i is dened by C(q i ) = {ϕ j use(q i, ϕ j ) = 1} The usage values are represented in matrix form as shown in gure 3.8a, where the entry (i, j) denotes use(q i, ϕ j ). Cells usage values are not suciently general to form the

65 52 Chapter 3. Design of Cluster-based P2P Systems basis of cell clustering. These values do not represent the weight of query frequencies. The frequency measure can be included in the denition of the cell anity measure, a (ϕ i, ϕ j ), which measures the bond between two cells of a feature space based on how they are accessed by queries. The bond measure is related to how the two cells are accessed together by dierent queries. More precisely, the anity measure of two cells records the total number of queries that access them together. Formally, the anity measure is dened as follows: Denition 5 (Cell Anity Value) The cell anity measure between two cells ϕ i and ϕ j belonging to a feature space composed of the set of cells φ = {ϕ 1, ϕ 2,..., ϕ m } with respect to the set of queries Q= {q 1, q 2,..., q q } is: q a(ϕ i, ϕ j ) = use(q k, ϕ i ) use(q k, ϕ j )) k=1 The result of this computation is an n n matrix, called Cell Anity matrix (CA) where each element is dened as above. From the example shown in gure 3.8b, the anity measure between the cells ϕ 3 and ϕ 4 is computed as: 4 a(ϕ 3, ϕ 4 ) = (use(q k, ϕ 3 ) use(q k, ϕ 4 )) k=1 = (use(q 1, ϕ 3 ) use(q 1, ϕ 4 )) +(use(q 2, ϕ 3 ) use(q 2, ϕ 4 )) +(use(q 3, ϕ 3 ) use(q 3, ϕ 4 )) +(use(q 4, ϕ 3 ) use(q 4, ϕ 4 )) = (1 1) + (1 0) + (0 1) + (1 1) = = 2. The complete cell anity matrix of the example in 3.8a is shown in gure 3.8b. Note that for completeness the diagonal values are also computed even though they are meaningless. The second method creates groups of queries with similar characteristics and computes the cell similarity with respect to the groups. Since the number of queries in a P2P network can be very high, this method reduces the size of the usage matrix and decreases the execution time of the algorithm. We assume that a query is characterized

66 Section 3.5. Clustering Issues in HON 53 Figure 3.9: Queries and peers number by the set of cells to which it is mapped. Let G = {g 1, g 2,..., g g } be the set of queries groups. Each group g i represents the queries {q 1, q 2,..., q l } that have the same target cells. For each group of queries g i and a cell ϕ j, we dene a cell usage value as follows: Denition 6 (Cell Usage Value) (a) A cell usage value, denoted use(g i, ϕ j ) is given by: { 1 if g i ϕ j ϕ j not empty use(g i, ϕ j ) = 0 otherwise use(g i, ϕ j ) is equal to 1, if ϕ j is a target cell of all the queries composing the group g i, and it is not an empty cell. (b) The cells accessed by a query groupe g i = {g i1,..., g im } denoted C(g i ) and dened by: C(g i ) = {C(q ik )}, q ik g i where C(q ik ) represent the cells accessed by q ik. As presented before, the usage values are represented by a matrix as shown in gure 3.10a, where entry (i, j) denotes use(g i, ϕ j ). Two parameters are used to dene the cells anity measure: (1) the number of queries constituting each group and (2) the number of peers that send queries of each group. We denote by acc(g i ) the number of queries in a group g i, which represents the

67 54 Chapter 3. Design of Cluster-based P2P Systems Figure 3.10: Example2: Cells Usage and Anity Matrix access frequency of the group g i to target cells. Figure 3.9a gives an example of two queries groups g 1 and g 2 mapped to the sets {ϕ 1, ϕ 2 } and {ϕ 3, ϕ 4 } respectively. Since acc(g 1 )=4 and acc(g 2 )=2, the anity between the cells ϕ 1 and ϕ 2 is stronger than the anity between ϕ 3 and ϕ 4. The number of peers in a group g i also aects the anity measure. Let np(g i ) be the number of peers that execute one or more queries from the group g i and C(g i ) the target cells of g i. The anity among the cells of C(g i ) increases when np(g i ) increases. The higher the number of np(g i ), the stronger is the anity of the cells of C(g i ). As shown in gure 3.9(b), np(g 1 )=3 and np(g 2 )=1, thus, the anity between the cells ϕ 1 and ϕ 2 is stronger than between ϕ 3 and ϕ 4. The anity between cells takes into account both access frequencies of queries and the number of peers that execute the queries. Denition 7 (Cell Anity Value) The cell anity measure between two cells ϕ i and ϕ j of a feature space composed of a set of cells φ = {ϕ 1, ϕ 2,..., ϕ m } with respect to the set of queries groups G= {g 1, g 2,..., g g } is dened by: a(ϕ i, ϕ j ) = acc(g k ).pn(g k ) use(g k,ϕ i ) use(g k,ϕ j ) The result of this computation is an n n matrix, called Cells Anity matrix (CA) where each element is dened as above. Consider the cell usage matrix of gure 3.10a. Assume the following values: {acc(g 1 ) = 4, acc(g 2 ) = 2, acc(g 3 ) = 3, acc(g 4 ) = 1, pn(g 1 ) = 3, pn(g 2 ) = 1, pn(g 3 ) = 5, pn(g 4 ) = 3}., the anity measure between the

68 Section 3.5. Clustering Issues in HON 55 cells ϕ 3 and ϕ 4 is computed over the groups, namely (g 1 ) and (g 4 ), that access both ϕ 3 and ϕ 4. The resulting anity value aff(ϕ 3, ϕ 4 ) is given by: 1 4 a(ϕ 3, ϕ 4 ) = acc(g 1 )pn(g 1 )) + acc(g 1 )pn(g 1 )) k=1 = = = 15 k= Grouping cells This step uses the anity measures dened above to group one or more cells of the feature space. A Bond Energy Algorithm can be used, permutes its rows and columns, and generates a grouped anity matrix that denes peers clusters. More details about this algorithm can be found in [91]. The anity-based clustering suers from several limitations which restricts its use in distributed P2P environments. First, the anity-based clustering requires a continuous control of query traces to maintain coherent information about cells anities. The anity between cells depends on several parameters and mainly on the type of data queries distribution. Moreover, this algorithm was suggested for centralized databases where a global view about details of communication between entities is easily available. Due to the complexity of the computation and the strong centralization point of the algorithm, this algorithm was not investigated further in the thesis. The implementation of the simulation prototype of HON architecture (presented in details in chapter 6) is based on the density-based clustering algorithm described in the next section Density-based Clustering The second clustering technique investigated in this thesis is a density-based clustering method, which extends to P2P environments the CLindex clustering method. Below we rst present the main steps of the CLindex algorithm. Next, we describe the extended variant of CLindex used in HON approach CLindex Approach CLindex is a data clustering algorithm [69] originally designed for centralized databases. CLindex aims (1) to cluster data to minimize disk time access for retrieving similar

69 56 Chapter 3. Design of Cluster-based P2P Systems objects, (2) to build an index on the clusters to minimize localization cost. CLindex approach considers the same assumptions that we have made in HON. It uses an n- dimensional feature space divided into cells where the data are organized according to their description. The idea of Clindex is to group the cells into clusters based on their densities using an algorithm called CF (Cluster-Forming). To illustrate the procedure, gure 3.11 shows a 2-dimensional feature space divided into cells, where the number in the cells represents their densities. The CF algorithm works in the following way: 1. Compute the density of cells 2. Select the unmarked cells that have the highest density. For example in gure 3.11, CF selects all cells that have a density equals to 7). A selected cell can be in one of three conditions: (a) If it is not adjacent to a cluster, it forms a new cluster (b) If it is adjacent to one cluster, it joins the adjacent cluster (c) If it is adjacent to more than one cluster, the CF algorithm invokes the Cli- Cutting algorithm (CC) to determine to which cluster the cell belongs, or if the clusters should be combined. 3. Mark the processed cells. 4. Repeat the process until all the cells with densities greater than a threshold called horizon are marked. The cells that do not belong to any cluster, i.e., that are below the horizon, are grouped into an outlier cluster. When there are more than one cluster adjacent to a cell, the Cli-Cutting procedure decides to which cluster the cell belongs, and if necessary it merges adjacent clusters. Many heuristics have been studied in [69]. One heuristic is to place the selected cell in the cluster that has minimum number of objects, so as to balance the cluster sizes. Some heuristics use the shape of the clusters to decide where to place the selected cells. Note that the snake-like shape is bad for a similarity search since the distance between two objects within a cluster can be too large. The best shape in this case is regular round region. The centroid of each cluster can be computed and the one that is the closest to the cell is selected. Another heuristic is to replicate popular boundary cells in more than one cluster. Which heuristics are better may depend on the dataset and the recall requirement of the application.

70 Section 3.5. Clustering Issues in HON 57 Figure 3.11: Cluster Forming Algorithm Distributed Density Clustering in HON The Clindex approach we have described above has a centralized entity that may restricts its use in a distributed environment. It requires a central component for collecting information and computing the densities of the cells. In our design, several modications to the original algorithm are required to remove the centralized characteristics and to not degrade the performance of the resulting architecture. First, an active peer is associated with each non empty cell to maintain cell state information and to compute cell density. The active peer of a cell also maintains links to the active peers of adjacent cells. Moreover, super peers are associated with the created clusters like active peers at the cell level. The super peers maintain stat information on the clusters, carry out various computations such as query processing task and update routing tables that are required to communicate other clusters in an overlay. In this way, densities information is distributed among active peers and no centralized control has to be set to run the clustering algorithm. Before describing the algorithm, we give formal denitions of adjacency and density of cells and clusters. Denition 8 (Adjacency) (1) Let a cell ϕ i be represented by [v 1i,..., v ni ] where v ji is the range value of the feature j in the the cell i. Two cells ϕ i and ϕ j are adjacent if k v ki v kj and l k, v il = v jl. This means that the two cells are adjacent if they share (n 1) dimensional hyperplan

71 58 Chapter 3. Design of Cluster-based P2P Systems Figure 3.12: Example1: density-based algorithm where n is the number of features. (b) Two clusters C 1 and C 2 are adjacent if ϕ i C 1 and ϕ j C 2 ϕ i is adjacent to ϕ j. (c) A cell ϕ i is adjacent to a cluster C, if ϕ j C ϕ i is adjacent to ϕ j. Denition 9 (Density) (a) The density of a cell represents the number of its data objects. We note D ϕi the density of the cell ϕ i. (b) Similarly, the density of a cluster is the number of the data objects of its contained peers. We note D Cj the density of the cluster Cj. (c) The size of a cluster is equal to its density. We dene MaxSize a maximum size allowed for a cluster. This value is predened and used in the clustering algorithm Initially, when peers progressively join the network, the densities of cells are low. Afterwards, when the number of peers and data increases, the densities of the cells increase. The cells are clustered using a distributed Density-Based (DB) algorithm. The input of the DB algorithm includes the set of active peers. The output consists of a set of clusters. For each active peer p i, the cell description is used to determine when the corresponding active peer joins or initiates a new cluster based on whether or not the density reaches a predened threshold value. Let AC ϕi be the set of adjacent clusters to the cell ϕ i. Dierent cases can be considered:

72 Section 3.5. Clustering Issues in HON 59 Algorithm 1 Density-Based (DB) 1: Input: AP = {p 1,..., p m } {Active peers of cells} 2: Output: CS {A set of clusters} 3: for each p i SP do 4: ϕ Cell(p i ) {Cell of the active peer p i } 5: D ϕ ϕ {Compute the density of ϕ} 6: if D ϕ > DensityT hreshold then 7: AC F indadjacentclusters(ϕ, CS) {Adjacent clusters to ϕ} 8: if AC = then 9: C CreateNewCluster(ϕ, p i ) 10: SC SC C {Add the cluster C to SC} 11: else 12: C l ExtractLowestDenseCluster(AC) 13: C C l ϕ {ϕ joins the cluster C a } 14: D c Density(C) 15: if D c >= MaxSize then 16: C 1, C 1 Split(C) {Split C into C 1 and C 2 } 17: p l SuperP eer(c l ) {Extract the super peer of C a } 18: Assign(p l, C 1 ) { Assign the super peer p l to C 1 } 19: Assign(p i, C 2 ) { Assign p i as a super peer to C 2 } 20: SC (SC - C) C 1 C 2 21: else 22: for each C i AC do 23: D i Density(C i ) {Compute the density of the cluster C i } 24: if (D i + D c ) > MaxSize then 25: C m Merge(C i, C l ) {Merge C i and C l } 26: p m ChooseSuperP eer(c i, C l ) 27: Assign(p m, C m ) /* Assign p m to C m 28: SC (SC - C i )-C l {Remove C i et C l from SC} 29: SC SC C m {Add the cluster C m to SC} 30: end if((d i + D c ) > MaxSize 31: end for 32: end ifd c >= MaxSize 33: end ifac = 34: end ifd ϕ > DensityT hreshold 35: end for

73 60 Chapter 3. Design of Cluster-based P2P Systems Figure 3.13: Example2: density-based algorithm Case 1: If AC ϕi =, that is, the cell ϕ i is not adjacent to any existing cluster, then the active peer p i creates a new cluster C containing the cell ϕ i and becomes the super peer of the new cluster C. It sends a cluster notication to other super peers to update their index tables. Figure 3.12 shows an example where the cell A is not adjacent to any existing cluster, therefore, a new cluster C3 is created. Case 2: If AC ϕi, that is, the cell ϕ i is adjacent to at least one cluster, then a cluster C l with the lowest density among all the adjacent clusters to ϕ i is selected and the cell ϕ i joins the cluster C l resulting in a cluster C. The cluster C is given by: C = C l ϕ i. Figure 3.13 illustrates an example where cell A is adjacent one cluster. The rst decision of the algorithm is to join cell A to cluster C 1. Then, all peers of the cell connect the super peer of the cluster C 1. The next step is to compute the density D c of the cluster C. Two cases can be distinguished:

74 Section 3.6. Caching Issues in HON 61 Figure 3.14: Example3: density-based algorithm 1. Case 2.1: If D c is higher than MaxSize, the cluster C is divided into two clusters C 1 and C 2 of balanced densities. Recall that C = C l ϕ i. Thus, the super peer p l is assigned to cluster C 1 and the active peer p i of the cell ϕ i becomes the super peer of cluster C 2. In gure 3.13, we can notice that the second decision of the algorithm is to split the cluster C 1 into two clusters C 1.1 and C Case 2.2: If D c is lower than MaxSize, a cluster merging process is started. The cell ϕ i belongs to cluster C and may be adjacent to other clusters. Thus, respecting the MaxSize value, cluster C can be merged with one or more clusters that are adjacent to ϕ i. We analyze the set AC ϕi and for each cluster C i AC ϕi, we compute the sum of the densities S=Density(C i ) + Density(C). If S is lower than MaxSize, then cluster C is merged with the cluster C i. Then one of the super peers of the merged clusters C and C i, is selected to be the super peer of the resulting cluster. Figure 3.14 shows an example where cell A is adjacent to two clusters C 1 and C 2. Since the total of clusters densities is lower than MaxSize (6 in this example), the two clusters are merged in one cluster C. 3.6 Caching Issues in HON The information search in HON is based on cells which are used to measure the similarity between queries, peers and data objects. Each query falls into a set of target cells where

75 62 Chapter 3. Design of Cluster-based P2P Systems the relevant answers reside. When the query reaches the target cells, it is ooded to all their peers to retrieve the required data objects. The ooding within target cells might penalize the search performance if the cell has a high size and contains a high number of peers. Therefore, a large portion of irrelevant peers are involved in the query processing which increases the search cost. To limit the ooding overhead inside cells, we propose a caching mechanism that helps storing queries results for future use. In chapter 5, we study the caching schema that we use to improve search performance in HON [58]. We introduce new denitions for cache admission policies taking into account the dynamicity of P2P networks. The cache in HON consists in storing queries description and peers that answered each query. The main idea is to assign a cache to one peer in each non empty cell. Placing the cache in each cell helps to collect all similar queries in the same cache. Two queries that are mapped to the same cell are considered as similar. Therefore, they share common peers that are supposed to hold the relevant answers. In this way, similar queries are always sent to and stored in the same cache which helps increasing the success hit and avoids redundancy. In addition, we have been interested in the problem of cache replacement policies in P2P networks. The main issue is to store the data in the cache for a short period of time because of the network dynamicity. Therefore, we have investigated two replacement policies. The rst replacement technique we use in HON is LRU (Least Recently Used) which removes the data that has not been used for the longest period of time. The second policy is NFU (Not Frequently Used) which removes the less popular data in the cache. To nalize our caching approach, we have made some experiments that are presented in chapter 6 to test its performance. 3.7 Performance Evaluation We have developed a prototype using Java to evaluate the performance of HON. The prototype is composed of three layers. A physical layer which contains the components and the parameters of the system. A routing layer constituted of a set of communication protocols, and an application layer which allows the simulation of some events in the system such as peer join and departure. In our experiments, we have rst evaluated search performance in HON using several metrics such as success rate, recall and query scope. Second, scalability representing the cost of building and updating HON. Third, caching performance studying dierent replacement policies and analyzing the impact of the cache in reducing the query scope. Forth, the tolerance of HON to peers' failures.

76 Section 3.9. Bibliographic Remarks 63 We have used several parameters to evaluate HON such as the threshold value used to map peers to cells, cell granularity and data distribution. We have shown using extensive performance evaluation of similarity search quality that HON achieves a high success rate and recall [57, 56]. Moreover, the use of caching helps to reduce the query scope which avoids exhausting a high number of resources when searching information. We have shown through our simulation results that HON is scalable to large network size and numbers of data objects and adaptable to dynamic membership with low maintenance overheads [55]. Furthermore, it eciently routes queries along best available paths which makes it resilient to peers' failures [55]. The results of our experiments are detailed in chapter Conclusion We have presented a Hybrid Overlay Network for organizing data and clustering peers using n-dimensional feature space to perform similarity search. We have presented the architecture of HON consisting of multiple overlays separating peers that are described by common features. Each overlay is composed of a set of peer clusters. Each cluster is managed by one super peer. Clusters of peers are based on the distribution of the content of peers over cells. Two alternatives have been studied in this section. An anity-based clustering that groups peers belonging to cells which are usually required together. A density-based clustering is also proposed as a second alternative. It consists in clustering peers belonging to adjacent cells to perform similarity. In addition the density-based algorithm takes advantages of high dense cells to improve the recall. 3.9 Bibliographic Remarks Several research studies discuss clustering aspect in P2P networks. Some clustering approaches are based on network related information. Krishnamurthy et al. [66] dene a P2P architecture called CAP that groups peers according to their IP address prexes. Bestavros et al. [13] use DNS (Domain Name Server) information to group web clients served by the same DNS. Other approaches use peer characteristics for clustering [53, 74, 5]. JXTA Search framework [53] allows clustering by both geographic and property similarities. Agrawal et al. [5] use a distance metric based on network latency for clustering. Loser et al. [74] use static properties of peers such as IP domain, or dynamic properties such as the number of resources of a peer. Cohen et al [25] exploits associations inherent in human

77 64 Chapter 3. Design of Cluster-based P2P Systems selections to steer the search process to peers supposed to have an answer to the query. Another method for creating clusters is to consider the functionalities or properties of applications. Wang et al. in [117] propose a P2P architecture called Friends Troubleshooting P2P Network (FTN) for resolving machine conguration problems. Marti et al. [80] propose a friend network of peers for dealing with security related attacks. Several clustering techniques exploit similarities among the documents of peers. Crespo et al. [28] propose Semantic Overlay Networks (SON) that uses classication hierarchies to cluster peers. Physical features describing the content of documents have been considered by Hang et al. [87, 104] to create clusters of peers sharing similar images. Semantic proximity between peers has been addressed by several approaches using high dimensional techniques. Tang et al. [97] propose psearch which is based on CAN system and uses latent Semantic Indexing LSI to generate high dimensional semantic vectors for each document and query. Ganesan et al. [39] focus on studying data locality properties and routing costs in a high dimensional data organizing space. Li et al. [70] have proposed a Semantic Small World (SSW) approach to facilitate ecient semantic-based search in P2P networks using a high dimensional space describing peers content. The heterogeneity between data and schema has been the focus of several studies. Nejdl et al. [86, 85] have addressed clustering strategies for peers annotated using RDFschema to provide a support for inhomogeneous schema-based networks. The Piazza project [112] proposes a peer data management system (PDMS) that creates mappings between semantically similar peers. Other clustering approaches are based on peer requests relationships to build semantic links between peers. Sripanidkulchai et al. [108] propose a technique that creates shortcuts between similar peers. Tempich et al. [113] investigate social metaphors for routing queries to relevant peers in the network. Handurukande et al. [43] investigate real query traces collected in the edonkey 2000 P2P network using dierent strategies that exploit semantic proximity between peers. Cohen et al. [25] exploits associations inherent in human selections to steer the search process to peers supposed to have an answer to the query. Agrawal et al. [5] present and formulate the P2P clustering problem studying and comparing several clustering algorithms.

78 Chapter 4 Feature Space Dimensionality Contents 4.1 Introduction The Curse of Dimensionality Challenges of High Dimensional Clustering Dimensionality Reduction Techniques Feature Construction Feature Selection Feature Selection Process Feature Selection Algorithms Discussion Dimensionality Reduction in HON Weighted Feature Selection Overlays Creation Overlays Architecture Evaluation Results Conclusion Bibliographic Remarks

79 66 Chapter 4. Feature Space Dimensionality Data objects in HON are represented as points in a feature space. The dimensionality of a feature space is dened by the number of its features. When the dimensionality increases, the data objects become sparse and the distance measures become increasingly meaningless which leads to serious problems aecting HON. This chapter consists in studying the impact of high dimensionality on clustering performance and query results quality. First, we describe the problem of high dimensionality and its impact on HON. In section 2, we give an overview of the curse of dimensionality. Section 3 presentes the challenges of clustering high dimensional data. Section 4 presents two main approaches of dimensionality reduction: Feature Construction and Feature Selection. Section 5 presents the feature selection technique that we have adopted to reduce the dimensionality in HON. Section 6 describes how to build multiple overlays with a limited number of features. Section 7 presents some evaluation results. Finally section 8 concludes this chapter. 4.1 Introduction In recent years, content-based retrieval of high dimensional data has been a challenging problem in several elds such as data analysis and data mining. A number of applications including multimedia and text retrieval require the use of high dimensional methods to provide capabilities for nding data objects which are similar by content. These methods generally work well for low dimensionality problems; however they perform poorly with increasing dimensionality. The problems of high dimensional data are described by the curse of dimensionality. This phrase was originated by Richard Bellman [12] and means the impossibility of optimizing a function of many variables by a brut force search on a discrete multidimensional grid. Problems with high dimensionality result from the fact that a xed number of data points become sparse as the dimensionality increases. Therefore the multidimensional space becomes mostly empty which makes impossible to determine the distribution of data since no signicant data points can be available. This problem is known as the empty space phenomenon and is related to density estimation problems in learning techniques. For clustering purposes, the most relevant aspect of the curse of dimensionality is the impact of increasing dimensionality on distance and similarity. This problem is related to the Query instability phenomenon and Nearest Neighbor Problem. As the dimensionality increases, distances between points become relatively uniform, which leads to a poor discrimination of the nearest and farthest points with respect to a query point. Therefore, the notion of nearest neighbor of a data point becomes meaningless.

80 Section 4.2. The Curse of Dimensionality 67 Another problem of high dimensionality related to HON is the computation time and the storage space that increases exponentially as the number of cells increases. Therefore, the number of connections between cells becomes quadratic even if only O(n) non-empty cells are considered. To solve the problem of high dimensionality that might penalize HON performance, in this chapter we introduce high dimensionality problems, study the existing dimensionality reduction techniques and adopt a feature selection technique to our distributed and dynamic environment. 4.2 The Curse of Dimensionality The empty space phenomenon is the starting point of the curse of dimensionality. It results from the fact that the number of data points do not increase exponentially with the number of dimensions. Jimenez et al. [51] and Hinneburg et al [46] have studied the behavior of data in high dimensions that leads to two main consequences. First, in high dimensional space, it is generally impossible to determine the distribution of data with sucient statistical signicance. The problem is that for an increasing number of dimensions, the number of data points is not high enough to determine the distribution of data. The second one is that high dimensional space is mostly empty, which implies that data can be described in a lower dimensionality without losing signicant amount of information. Let us take an example of a uniform distribution in a high dimensional space. We assume that we have a 50-dimensional space, where each feature is divided into 2 partitions. That means that the number of cells is equal to 2 50 (approximately equal to ). If we generate one billion of data (10 12 ) and assume that these data are uniformly distributed in the feature space, every data will be mapped to a dierent cell. Therefore, only 1% of cells are lled, which means that the space is mostly empty. Here it is impossible to determine the distribution of data based on one percent of cells. In this case, the only thing that can be veried is the distribution of the data in a projected space having lower dimensionality A related problem to the empty space phenomenon is the sparsity of data. Figure 4.1 shows a data set of 10 points randomly placed between 0 and 1 in a one dimensional space. We notice that adding new dimensionalities spreads these data points along other axis. This problem has many consequences especially on the neighborhood measurement between data objects. As the dimensionality increases, the dierence between the distance to the nearest and the farthest neighbor of a data point goes to zero. This problem is known as the Nearest Neighbor Problem (NN) [18, 62, 14, 115]. The nearest

81 68 Chapter 4. Feature Space Dimensionality Figure 4.1: Sparsity of Data neighbor is denes as: Denition 10 (Nearest Neighbor) Let x NN be the nearest neighbor of a given query q R d in the data set D R d : x NN = {x D x D, x x : dist(x, q) dist(x, q)} where dist (x,q) is a function that measures the distance between the data point x and the query q. It has been shown in [14] that the nearest neighbor becomes meaningless when the dimensionality increases. This is an even more fundamental problem because it is related to the search quality. There can be several reasons for the meaningless of the nearest neighbor search in high dimensional space. One of these reasons is the sparsity of the data objects in the space where it is very unlikely that data points are nearer to each other than the average distance between data points. Hence, all pairs of points are almost equidistant from one another for a wide range of data distributions and distance functions. As mentioned before, the distance between the nearest and the farthest neighbor to a given query point approaches zero with increasing dimensionality. Beyer et al. have proved in [14] that this dierence does not increase as fast as the distance from the query point to the nearest points when the dimensionality goes to innity. In such a case, the nearest neighbor query is called unstable. We assume that D mind is the distance of the query point q to the nearest neighbor and D maxd is the distance of the same query point q to the farthest neighbor in d-dimensional space. The theorem by Beyer et al. [14] states that under certain rather general preconditions the dierence between the distances of the nearest and the farthest points (D maxd D mind ) to D mind converges to zero with increasing dimensionality.

82 Section 4.2. The Curse of Dimensionality 69 nnn Figure 4.2: Query Unstability The theorem by Beyer et al. can be formally stated as follows: Theorem (Beyer et al) If lim d var( X d E( X d ) = 0, D maxd D mind D mind 1. X d : a data point from the space R d. p 0 where: 2.. : the distance of a vector to the origin (0,...,0) 3. E(X): Expected value 4. var(x) variance of a random variable X In high dimensional space, even though there is a well dened nearest neighbor, the dierence in distance between the nearest neighbor and any other point in the data set is very small. Thus, the answer might not help in solving some concrete problems such as nding minimal traveling cost. To describe this phenomenon, Beyer et al. [14] examine the number of points that fall into a query sphere enlarged by some factor ɛ. If few points fall into this enlarged sphere, it means that the data point nearest to the query point is separated from the rest of the data in a meaningful way as shown in gure 4.2a. On the other hand, if many data points fall into the enlarged sphere, dierentiating the nearest neighbor from these data points is meaningless if ɛ is small, as shown in gure 4.2b. This phenomenon is described by Query Instability and is dened as follows: Denition 12 (Query Instability) A nearest neighbor query is unstable for a given ɛ if the distance from the query point

83 70 Chapter 4. Feature Space Dimensionality to most data points is less than (1+ ɛ) times the distance from the query point to its nearest neighbor. Beyer et al. [14] have shown that in many situations, for any xed number ɛ > 0, the probability that a query is unstable converges to 1 when the dimensionality increases. 4.3 Challenges of High Dimensional Clustering Clustering problem has been studied in several areas such as data mining and machine learning. Many diverse techniques have appeared in order to discover clusters in large datasets. The basic clustering techniques are classied into two categories: Partitional and Hierarchical. First, Partitional algorithms construct k partitions of the data, where each cluster optimizes a specic criterion, such as the distance or similarity measure from the mean within each cluster. The most commonly used partitional algorithms are k-means [49, 60], PAM (Partitioning Around Medoids) [60], CLARA (Clustering LARge Applications) [60] and CLARANS (Clustering LARge ApplicatioNS ) [88]. Second, Hierarchical algorithms which create a hierarchical decomposition of data objects. This decomposition can be Agglomerative or Divisive. Agglomerative techniques consider at the beginning that each object belongs to a separated cluster, and successively merge the clusters according to a distance measure. The algorithm may stop when all objects are in a single cluster or when specic criteria are satised. On the other hand, Divisive algorithms start with one cluster of all objects and successively split it into smaller clusters, until each object (or a set of objects) falls in a separate cluster. Both Partitional and Hierarchical algorithms perform well when the dimensionality of the data is relatively small. However, they might provide meaningless clusters when the dimensionality increases. As described before, in high dimensional spaces we cannot distinguish the farthest and the nearest neighbor from a given query point. Therefore, clustering algorithms based on distance schema such as OPTICS [7] and DBSCAN [34] are sensitive to the curse of dimensionality. High dimensional data need specic clustering techniques to deal with the problems described above. Several clustering approaches have been proposed in the literature for this purpose such as CLIQUE [6], MAFIA [82], DENCLUE [45], OptiGrid [46], WaveCluster [103]and STING [119]. These techniques are widely used in high dimensional databases and most of them are classied as Grid-Based approaches. They organize the data set into a number of cells and then work with objects belonging to these cells. They form the clusters by merging the existing grid cells. Consequently

84 Section 4.4. Dimensionality Reduction Techniques 71 they do not depend on any specic distance measure. We note that the clustering approach that we use in HON falls into this category. It has been shown in [46, 109] that Grid-based approaches suer from other problems when the dimensionality increases. First, the number of cells increases exponentially with increasing dimensionality. For example, using a 30 dimensional data and only 2 partitions for each feature, we can have more than one billion of cells. Moreover, these cells need to be represented and stored using arrays. As the dimensionality increases, these algorithms will have space problems to store the description of all the feature space. Furthermore, as it was described in section 4.2, the high dimensional space is mostly empty. Therefore, a cluster might be divided into a large number of cells and many or even all these cells might have a density less than the required threshold. In addition, the clusters might only exist in subsets of high dimensional spaces. Since the number of possible subspaces is also exponential in the dimensionality of the space, clusters cannot be easily dened. 4.4 Dimensionality Reduction Techniques The dimensionality of the feature space can be reduced without losing important information. Dimensionality reduction techniques can be classied into two categories: Feature Construction and Feature Selection. Feature construction approaches project points from a higher dimensional space to a subspace having a lower dimensionality. The subspace is composed of a set of new features that are the combination of the original features. Feature selection techniques consider that not all features are important in describing data objects. Therefore, they select a subset of the original features that describes most of the data and discard the rest. In the following, a detailed presentation of both dimensionality reduction categories is given Feature Construction The purpose of dimensionality reduction using feature construction is to transform a high dimensional space into a low dimensional space, while retaining the most of underlying structure of the data. The main goal is to avoid nearest neighbor problems and to reduce the computation time and storage space. Reducing data dimensionality might leads to a loss of information. To illustrate this, let us consider the example of gure 4.3. In the two dimensional feature space <f 1,f 2 >, the two closest points to A are B and F. However, if we retain only the feature f 1 coordinates, then the point

85 72 Chapter 4. Feature Space Dimensionality Figure 4.3: Example of 2-dimensional data D becomes one of the closest points to A. Likewise, if we retain only the feature f 2 coordinates, then the point C becomes one of the closest points to A. In both cases, there is a loss of information when we move from a 2-dimensional space to 1-dimensional space. To preserve the original distances between data points, the data may be rst transformed so that most of the information gets concentrated in few dimensions. Theses dimensions are then identied and the coordinates of the data items in these dimensions are subsequently used for indexing to support fast searching. Figure 4.4 shows the data transformation illustrated by the projection of data in a new space {f 1, f 2} and then a reduction of dimensionality is applied preserving the original distances between the dierent data points. The closest points to A are B and F, and remain the same when we retain only the feature f 1 or f 2. In mathematical terms, the problem of dimensionality reduction using feature construction can be stated as follows: given the n-dimensional random variable x = (x 1,..., x n ) T, nd a lower dimensional representation of it, s= (s 1,..., s k ) T with k n, that captures the content in the original data, according to some criteria. The components of s are sometimes called the hidden components. Dierent elds use dierent names for the n multivariate vectors: the term variable is mostly used in statistics, while feature and attribute are alternatively commonly used in the computer science and machine learning literature. There are several dimensionality reduction techniques that have been proposed for dierent problems. Carrerira-Peripinan classies in [19] the dimensionality reduction problems into three categories: Hard, Soft and Visualisation. In Hard dimensionality reduction problems, the dimension of data objects ranges from hundreds to hundreds

86 Section 4.4. Dimensionality Reduction Techniques 73 Figure 4.4: Axes change from (f 1, f 2 ) to (f 1, f 2) of thousands of features. Those features are often repeated measures of a certain magnitude in dierent points of space or in dierent instance of time. This category of problems includes pattern recognition and classication studies involving images or audio data types. The Soft category of dimensionality problem includes data objects having less than tens of features. Most statistical analysis in elds such as social science and psychology fall into this category. Features describing the data objects are most of the time a set of observed or measured values. The number of these features is never too-high which makes the dimensionality reduction not very drastic. The last category of reduction problems is the visualisation where the data objects are not high dimensional, but they need to be projected in 2, 3 or 4 dimensional spaces in order to plot them. Our approach falls into the rst category of reduction problems since we are using multimedia data objects such as images and audio. It is considered as a Hard problem because the dimensionality of data objects can be of thousands of features. Each category of reduction problems has its appropriate dimensionality reduction techniques. In practical cases, the most widespread technique used for Hard problems is Principal Components Analysis (PCA). The Soft problems also employ the PCA algorithm and other techniques such as Factor Analysis [31], Discriminant Analysis [10] and Multidimensional Scaling [67]. For Visualisation purpose many methods have been used in practice including PCA, Projection Pursuit [51], Multidimensional Scaling [67], and Self-Organizing maps [11] including their variants. These techniques are described in several papers that give a general overview [19, 90, 115, 36]. They are classied into two categories: Linear and Non-Linear. Most dimensionality reduction methods are linear, meaning that the extracted features are linear functions of the input features. Each of the k n components of the new variable is a linear combination of the original

87 74 Chapter 4. Feature Space Dimensionality Figure 4.5: Principal Components variables. s i = w i,1 x w i,n x n, for i = 1,..., k or S = W X where W k p is the linear transformation weight matrix. Expressing the same relationship as: X = AS, with A n k Linear methods are easy to understand and are very simple to implement, but the linearity assumptions does not lead to good results in many cases. For example, Images of handwritten digits do not conform to the linearity assumption. A rotation of objects can at best be approximated by linear functions only in a small neighborhood. These limitations have motivated the design of nonlinear methods in a general setting. Nonlinear methods try to preserve linearity in the locality of each point. They try to map the original data space to a lower subspace preserving as much as possible the structure of the original space. Such techniques process each individual data point separately. Therefore, non-linear techniques might provide dierent subspaces. Principal Component Analysis (P CA) is one of the most popular linear techniques. In various elds, it is also known as the Singular Value decomposition (SVD), and the Empirical Orthogonal Function (EOF). P CA aims to reduce the dimensionality of data objects by nding few orthogonal linear combinations (the P C s ) of the original features with the largest variance. The rst PC represented by s 1 is the linear combination with the largest variance. We have V ar(x) the variance of the variable x and s 1 = x T w 1,

88 Section 4.4. Dimensionality Reduction Techniques 75 where the n-dimensional coecient vector w 1 = (w 1,1,..., w 1,n ) T solves: w 1 = argmax w=1 V ar{x T w} The second P C is the linear combination with the second largest variance and orthogonal to the rst P C, and so on. The number of P C s is equal to the number of the original features. For many datasets, the rst P C s explains most of the variance; therefore the rest can be disregarded with minimal loss of information. Figure 4.5 shows an example of two principle components. Let us consider a sample {x i } m i=1 in R n with mean x = 1 m n i=1 x i and covariance matrix Σ = E(x x)(x x) T. We can use a spectral decomposition to write Σ as Σ = UΛU T, where Λ = diag(λ 1,..., λ n ) is the diagonal matrix of the ordered eigenvalues λ 1... λ n, and U is a n n orthogonal matrix containing the eigenvectors. PCs are then given by the n k matrix S where: S = U T X Feature Selection Feature selection has been the focus of interests of many elds and applications such as data mining [29, 30, 63], machine learning [16, 52, 64, 65], pattern recognition [48, 81, 105, 120], text categorization [31, 89, 121] and image retrieval [122, 111]. The main idea of feature selection is to reduce the number of features by removing irrelevant, redundant, or noisy information to improve the application performance such as computational time and result comprehensibility. In this section, we present the feature selection process and some of feature selection algorithms related to our research area Feature Selection Process Feature selection methods search through the subsets of N features, and try to nd the best one among the competing 2 N candidate subsets according to some evaluation function. The search space of 2 N subsets is exponentially prohibitive and requires an exhaustive processing time. Therefore, dierent practical methods use heuristic or random search to reduce the computational complexity and to nd an optimal result [32, 84]. These methods need a stopping criterion to prevent an exhaustive search. A typical feature selection process consists of four basic steps [71]: Subset Generation, Subset Evaluation, Stooping Criterion and Result Validation as shown in gure 4.6.

89 76 Chapter 4. Feature Space Dimensionality Subset Generation: consists of generating a candidate subset for evaluation. The eciency of this process depends on the search starting point and search strategy. First, the starting point of the search process takes dierent forms that inuence the search direction. A search might start with an empty set and successively add features or start with a full set and successively remove features. These two techniques are respectively called forward and backward. Another technique starts the search by selecting a random set of features to optimize the search computation. Second, the search strategy that has to be dened in such a way to avoid an exhaustive search since the search space is exponentially prohibitive. Dierent search strategies have been proposed in the literature: complete, sequential and random search. 1. The complete search guarantees to nd the optimal result using heuristic functions to reduce the search space. Therefore, even if the search space order is O(2 N ), a smaller number of subset is evaluated. 2. The sequential search does not take into account the completeness and risks to lose optimal subsets. Sequential methods add and remove features one a time. 3. The random search selects randomly the subset of features in two dierent ways. The rst one uses a sequential search by adding random values to the original subset, while the second way does not depend on any previous subset and is processed in a completely random manner. Subset Evaluation: each subset that results from the subset generation step needs to be evaluated using an evaluation criterion. Several criteria can be used to select the optimal subset. We note that an optimal subset selected using one criterion may not be optimal according to another criterion. Two groups of evaluation criteria can be distinguished. Independent criteria that are generally used in the lter model where no mining algorithm is involved to evaluate the optimality of the feature subset. By contrast, Dependent criteria are used in the wrapper model that required predetermined mining algorithm to dene which features are selected. Stooping Criterion: determines when the feature selection process should stop. The stooping criteria can take dierent forms. For example, by analyzing the evolution of the subset of selected features. If the addition (or deletion) of any feature does not produce a better subset, the feature selection process can stop. Another example is to consider the minimum number of features or the maximum number of iterations.

90 Section 4.4. Dimensionality Reduction Techniques 77 Figure 4.6: Four Steps of Feature Selection Result Validation: is the last step of the feature selection process. It measures the nal result using dierent methods. For example, using a prior knowledge about the data where the relevant features are known in advance. In this case, the known set of features is compared with the selected features to measure the result eciency. Other methods that are more appropriate to real world applications do not rely on such prior knowledge. A possible result validation technique can use a classication error rate as a performance indicator. This technique can simply conduct a before-and-after experiment to compare the error rate of the classier applied on the full set of features and on the selected subset of features Feature Selection Algorithms Feature selection can be found in many areas of data mining such as classication, clustering, association rules and regression. In our work we focus on studying feature selection algorithms that concern classication and clustering. Feature selection for classication uses labeled data and is called supervised feature selection, while feature selection for clustering uses unlabeled data and is called unsupervised feature selection. Feature selection algorithms use dierent evaluation criteria. They can be classied into three categories: Filter Model, Wrapper Model and Hybrid Model. The lter model is based on gereral characteristics of the data to evaluate and select feature subsets without involving any mining algorithm. The wrapper model requires a predetermined mining algorithm and uses its performance as the evaluation criterion. This model tends to be more computationally expensive than the lter model. Finally, the hybrid model aims to combine the advantages of the two models by exploiting their dierent evaluation criteria in dierent search stages.

91 78 Chapter 4. Feature Space Dimensionality Discussion We have presented two classes of dimensionality reduction techniques. In this section we describe their properties and discuss their possible application in P2P environments and particularly in HON. The rst class of dimensionality reduction is feature construction that aims to project the data objects from a high dimensional space to a lower dimensional space. The resulting subspace is described by a set of new features that are the combination of the original features. Most of feature construction techniques are centralized. They are called GDR techniques (Global Dimensionality Reduction) because they use the whole dataset for the projection. This projection results in a unique subspace that preserves as much as possible the original information. In a P2P context, the dataset is highly distributed among peers. Therefore, we need to collect all the data in a central server to do the projection which might limit the use of GDR techniques in P2P systems. It has been shown in [20] that GDR techniques provide good results only if the dataset is globally correlated which means that the variation in the data can be captured by few dimensions. In practice, datasets are often not globally correlated. Thus, GDR techniques lead to a signicant loss of information. Even when global correlation does not exist, there might exist subsets of data that are locally correlated. The idea here is to use Local Dimensionality Reduction (LDR) that is applied individually to clusters of locally correlated data. As a result, we will have a dierent subspace for each cluster. In P2P systems, peers belonging to dierent subspaces need to communicate. Hence, mapping techniques have to be introduced. Since the features that describe the new subspaces are meaningless to the user because they are a combination of the original features, the mapping needs more complex schema. Some eorts have been made to propose distributed dimensionality reduction techniques [76, 3]. The idea of these techniques is based on sampling. Each node of the network sends a sample of data representing its content to a central server. This server projects the set of samples in new subspace and sends its description to all nodes of the network. When the nodes receive that description, they project all their data in the new subspace. Actually, these techniques have a distributed input but a centralized processing. The dierence with the centralized techniques is that not all the dataset is sent to the central server but only samples. We note that a sample is signicant only for its local node. If we group these samples, we might have a set of data that are not globally correlated. Therefore, the dimensionality reduction provides a loss of information. To summarize, feature construction techniques can be hardly used in P2P systems

92 Section 4.5. Dimensionality Reduction in HON 79 because of many reasons. First they are mainly centralized and require a global view of the existing data set. Second, these techniques perform well when the data are globally correlated. This condition cannot be always satised in practice like in P2P systems. Third, the local application of dimensionality reduction techniques on clusters or individual peers results in dierent subspaces. As mentioned before, the mapping between subspaces where the features are not meaningful to the user might be dicult. As a result, we use in our work the feature selection model to reduce the dimensionality in HON. Feature selection can be applied in each peer independently from the other peers in the network. Moreover, it preserves the original features which reduce the complexity of mapping between the resulting subspaces. 4.5 Dimensionality Reduction in HON Dimensionality reduction in HON is based on feature selection where each peer selects a subset of features that describe most of its data objects. To select this subset of features, we exploit a lter technique that uses a goodness criterion depending on cells densities and threshold values to eliminate insignicant features. The idea is to select features that describe the dense regions of the feature space. For this purpose, we use a variant of the Weighted Feature Selection algorithm proposed by Wang et al. [118]. This algorithm aims to (1) cluster data point using k-means algorithm [49, 60] and (2) extract the relevant features for each cluster using histogram analysis to assign greater weight to relevant features. In our work, the selection of features is processed in each peer without any clustering consideration. We describe in the following how features weights are dened Weighted Feature Selection Let f 1,...f n be the set of features describing the feature space. We have dened a partition of the feature space into cells by dividing the range of values [fi min, fi max ] of a feature f i into m i intervals of size f i max fi min, for i = 1, 2...n. We assume that the m i denser is the distribution of a given feature f i, the greater is the probability that f i is the dominant feature in representing the dataset. Considering this assumption, we rst start by computing for each feature f i the number of data objects in its partitions. Let D ij be the number of data objects in the partition P ij of the feature f i. We then dene

93 80 Chapter 4. Feature Space Dimensionality Figure 4.7: Feature Density a region of a feature f i by: m i F eatureregion (i) = D ij f i max The density for the feature f i is dened by: F eaturedensity (i) = 1 j=1 f min i m i F eatureregion i Max(D ij ) (fi max fi min ) The F eaturedensity (i) represents the density value of the distribution of the feature f i. Following a uniform distribution, the number of data objects in each partition P ij is almost the same as shown in gure 4.7b. Consequently the ratio tends to 1 resulting in a low feature density. F eatureresgion i Max(D ij ) (fi max fi min ) By contrast, if the data objects are distributed using a Zipan low where they are highly concentrated in few partitions as shown in gure 4.7a, the feature density value becomes higher. Therefore, the larger is F eaturedensity (i), the denser is the value distribution for the f i. In other words, the feature f i is a dominant feature in describing the dataset. The F eaturedensity (i) values are then used to compute the weight that we associate to each feature. This weight value will indicate the relevancy of features. Let {w i,..., w n } be the corresponding weights to the features {f 1,...f n }. We dene the weight w i of the feature f i as: w i = F eaturedensity (i) n j=1 F eaturedensity (j) Two dierent strategies can be used to select the important features in describing the dataset. The rst strategy uses a threshold value T W. If the weight w i of a feature f i is higher that the threshold value T W, then the feature f i will be selected.

94 Section 4.5. Dimensionality Reduction in HON 81 Figure 4.8: Proximity and Nearest Neighbor Graph Otherwise, it will be discarded. In this case the number of selected features will vary from a dataset to another one because the selection is based on the weight of features and their importance in describing the content. The second strategy consists in dening the number k of features that have to be selected. In this case, we rank the features according to their weight values. Then only the k rst features will be selected as the most important ones. Each peer run locally the Weighted Feature Selection technique to select the most relevant features to its data objects. Therefore, two peers can be described by the same set of features, common set of features or completely dierent set of features. our goal is to build overlays of limited number of features, where each overlay is associated to a subset of features f 1,..., f k that describe common set of peers Overlays Creation To dene features that compose each overlay, we employ the Jarvis-Patrick technique [50] which uses a Nearest Neighbor algorithm. The idea of this algorithm is to determine the J nearest neighbors for each object in the dataset to be clustered. Two objects are placed in the same cluster if (1) they are contained in each other's neighbor list (2) they have at least k nearest neighbors in common. In our work we apply the Jarvis-Patrick approach to dene features that belong to the same overlay. We introduce a new denition of features neighborhood and a specic measure to select the nearest neighbors to a given feature. Denition 13 (Feature Neighborhood) Let f 1 and f 2 to be two features that describe respectively two set of peers P 1 = {p 11,..., p 1r } and P 2 = {p 21,..., p 2s }. The two features f 1 and f 2 are neighbors if they share at least

95 82 Chapter 4. Feature Space Dimensionality Figure 4.9: Sahred Nearest Neighbor Graph one peer. It means if p i p i P 1 and p i P 2. The neighborhood between features depends on the number of shared peers. Let us assume that the feature f 1 shares the description of 100 peers with the feature f 2 and the description of 10 peers with the feature f 3. We consider here that the feature f 1 is nearer to f 2 than to f 3. To select the nearest neighbors, we dene a measure called Neighborhood Degree described as follows: Denition 14 (Neighborhood Degree) Let f 1 be a feature that describes the set of peers P 1 = p 1..., p r and f 2 a feature that describes the set of peers P 2 = p 1,..., p s. The neighborhood degree between the feature f 1 and f 2 is dened by the number of their shared peers: N D (f 1, f 2 ) = COUNT (p i ) p i P 1 and p i P 2 We describe now how the Jarvis-Patrick algorithm is used to decide which features to put in the same overlay. The algorithm is composed of three main steps: 1. We nd the n nearest neighbors of all the features. We build a Weighted Proximity Graph where the nodes represent the features and the edges represent the neighboring relationship between features. An edge links two nodes if the corresponding features are neighbors. A weight indicating the degree of neighborhood is associated to each edge. Figure 4.8a shows the proximity graph representing the neighbors of all the features. After constructing the proximity graph, we break all links that have a weak weight. We need then to dene a Threshold Degree value T D. If the weight of an edge linking two features is lower than T D, the edge is broken. Therefore, only strong links between nodes will remain, and the graph is transformed to a Nearest Neighbor Graph. As shown in gure 4.8b, all the edges

96 Section 4.6. Overlays Architecture 83 Figure 4.10: Overlays of Features having a weight less than 10 are deleted 2. We dene the number of nearest neighbors shared by any two features. To do that, we transform the Nearest Neighbor Graph to a Shared Nearest Neighbor Graph by replacing the weight of each edge in the nearest neighbor graph between two features by the number of their shared neighbors. In gure 4.9b, the edges linking two features show that the features are neighbors. The weight indicates the strength of the edge in the shared nearest neighbor graph. 3. We group features into overlays. If two features share more than SHT neighbors, i.e., have a link in the shared nearest neighbor graph higher than a predened threshold SHT, then the two features and any overlay they belong to are merged. Hence, overlays are dened as connected components in the shared nearest neighbor graph. The number of features is an important criterion to decide if two overlays can be merged or not. We remind that an overlay needs to have a limited number of features to avoid high dimensionality problems. In gure 4.10, we show an example where we set the Shared Threshold value SHT to 1. In this case two overlays can be dened: Overlay1, containing the features {f 2, f 3, f 6 } and Overlay2, containing the features {f 1, f 4, f 5, f 7 }. 4.6 Overlays Architecture We have presented in the previous sections how to select the most important features in describing data objects and how to organize them into a set of overlays. The following step is to organize peers according to these overlays and to dene the routing mechanisms. We note here that two peers can be described by a dierent set of features. Therefore, they will belong to dierent overlays.

97 84 Chapter 4. Feature Space Dimensionality Figure 4.11: Architecture of Overlay Networks Figure 4.11 shows the architecture of the system after the dimensionality reduction. An overlay is represented by a space of features. Each verlay maintains a description of the space of its constituting features. In each overlay, data and peers are organized in the feature space, and clusters of cells are created as presented previously in chapter 3. Peers belonging to the same overlay communicate through their shared feature spaces, while peers that belong to dierent overlays use their common peers as a bridge for routing messages. To create the overlays architecture, we consider at the beginning that there is one super peer to which all peers connect. When a peer joins the network, it sends to the super peer the set of features that describes its content. The super peer creates a nearest neighbor graph of all features that describe the existing peers and updates it after each peer join and departure. When the links between features in the nearest neighbor graph become strength, the super peer creates overlays of features and assigns their management to other super peers. Each of those super peers will create its own feature space and will organize into it the data and the peers as described previously. 4.7 Evaluation Results We have evaluated the weighted feature selection technique that we proposed to reduce dimensionality in HON. An ecient dimensionality reduction technique should preserve data information as much as possible. Concretely, we use three metrics for the evaluation: Recall, Precision and F-Measure described as follows:

98 Section 4.7. Evaluation Results Recall r: represents the percentage of retrieved responses out of the available responses in the network. Let T R be the total number of the relevant responses for a query Q and RR be the number of the retrieved responses. The recall is computed by RR/T R. 2. Precision p: represents the percentage of relevant responses to the query out of the retrieved responses. Let RR be the number of the retrieved responses for a query Q and RS be the number of the relevant responses. The precision is computed by RS/RR. 3. F-Measure: is a weighted harmonic mean of precision and recall. Let α (0, + ) be the weight of r and 1 be the weight of p. The weighted harmonic mean of r and p is given by: F α = (α + 1)rp r + αp In our simulation we use an F 1 measure that gives the same weight to the recall r and the precision p. F 1 is given by: F 1 = 2rp r + p We consider in our simulation a global set of features that describe peers' data objects (content). Each peer applies locally the weighted feature selection algorithm to select the important features in describing its content. The number of selected features varies from a peer to another depending on the distribution of its data objects. Two types of data distribution are used in this simulation: uniform data distribution and Zipan data distribution dened later in section Recall that using a uniform distribution, peers' data objects have equal chance to be mapped to any cell of the feature space. By contrast, using a Zipan distribution, data objects are mapped to few cells of the feature space. We simulate 1000 peers described by 30 common features. When peers select their relevant features using the weighted feature selection algorithm, they initiate 500, 000 queries to evaluate the quality of the search in the reduced spaces by computing the average recall, precision and F-measure. To study the behavior of the feature selection algorithm according to data distributions, we run four dierent simulations corresponding to four possible cases. In the rst case, we consider a uniform distribution of peers' content and a uniform distribution of queries. In the second case, we consider a Zipan distribution of peers' content and a uniform distribution of queries. In the third case, we

86 Chapter 4. Feature Space Dimensionality Figure 4.12: Weighted feature selection performance consider a uniform distribution of peers' content and a Zipan distribution of queries.

99 86 Chapter 4. Feature Space Dimensionality Figure 4.12: Weighted feature selection performance consider a uniform distribution of peers' content and a Zipan distribution of queries. Finally, we consider a Zipan distribution of peers' content and a Zipan distribution of queries. In the rst simulation, queries follow a uniform distribution where peers send queries randomly to cells. Figure 4.12 shows that using a uniform distribution for peers' content, the average recall is equal to 58% and the precision is equal to 32%. This means that 58% of the relevant answers for a given query are retrieved using the reduced spaces. In addition, the relevant answers correspond to 32% of the total answers. Thus, 68% of unnecessary messages related to irrelevant answers are generated over the network. The F-measure in this case is equal to 42% indicating the non eciency of the search. Note that in the second simulation, a Zipan distribution increases the recall to 87% but do not improve the performance providing 23% of precision as shown in gure In the third and the fourth simulations, queries follow a Zipan distribution where peers send queries to few cells in the feature space. Figure 4.12 shows that the average precision and F-measure is signicantly improved using a Zipan distribution of queries. For example, using a uniform distribution of peers' content, the average precision is equal to 51%. Thus, a reduced amount equal to 49% of unnecessary messages is generated over the network. The F-measure in this case is equal to 58% indicating a better eciency than the rst two simulations. Note that using a Zipan distribution of peers' content provides the highest performance of 88% of recall, 56% of precision and 68% of F-measure. According to the results presented above, a Zipan distribution of data objects and queries performs the weighted feature selection algorithm comparing to uniform

100 Section 4.7. Evaluation Results 87 Figure 4.13: Precise search using discarded features distribution. This can be explained by the fact that data objects of peers following a Zipan distribution fall into few cells in the feature space. Data objects of each peer are highly concentrated in few partitions of some features. Therefore, the weighted feature selection algorithm can eciently determine the most relevant features to peers content comparing to uniform distribution. A uniform distribution maps peers` data objects randomly to cells. In this case, most of the features have almost the same number of data objects in their partitions and are dened as irrelevant features. Subsequently, a larger amount of information is lost resulting in a low precision and recall. The results of the proposed weighted feature selection algorithm is not encouraging because almost 50% of irrelevant responses are returned when searching a data object. To provide more precise search, we propose that each peer makes use of the discarded features. When a peer selects the most relevant features to its data objects, it saves the descriptions related to all the discarded features. The relevant features are used for assigning peers to overlays, organizing peers and their data in a feature space, creating clusters and routing queries. By contrast, discarded features can be used when computing similarities between queries and data objects to provide a precise search. Consider a query Q described by the set of features F Q. When a peer receives the query Q, it computes the similarity between Q and each of its data objects in two steps: 1. In the rst step, the peer computes the similarity using its relevant features that are common with the query Q and builds a list S of the similar objects to the query Q.

101 88 Chapter 4. Feature Space Dimensionality 2. In the second step, the peer computes the similarity between each object of the list S and the query Q using the discarded features. The peer considers the discarded features that describe the query Q. The goal is to remove false positives and to keep as much as possible only the relevant answers to the query Q We have run a set of experiments using the same previous conguration to evaluate the eciency of the search when taking advantage of discarded features. Figure 4.13 measures the search performance at the end of each step. When using a uniform distribution, the search using relevant features provide 32% of precision. At the end of the second step that selects the relevant answers using the discarded features, the precision increases to 91% performing the search eciency. We can notice that the Zip- an distribution of data objects provides the highest precision of 95%. To summarize, the weighted feature selection algorithm denes important features in describing peers contents based on densities. Thus, it provides good results using a Zipan data distribution rather than using uniform distribution. The peers are organized into overlays and clusters are created using their relevant features. When searching data objects, peers can use their discarded features to provide a precise search. Relying only on relevant features to compute similarities might lead to lose information. 4.8 Conclusion We have discussed in this chapter the problems related to high dimensionality of the feature space that is used to organize data and peers into similar clusters. The goal of this chapter was to build a multi-overlays architecture, where to each overlay is associated the subset of features of limited size to avoid high dimensionality problems. Therefore, each peer reduces its dimensionality by selecting a subset of dominant features. We have adopted a weighted feature selection based on a lter technique that uses a goodness criterion depending on cells densities and threshold values to eliminate insignicant features. The idea was to select features that describe the dense regions of the feature space. Once the dominated features are selected for each peer, we create overlays of features using a variation of the Jarvis-Patrick algorithm to dene features that belong to the same overlay. This algorithm is based on the neighborhood between features which is dened according to the number of their shared peers. We concluded the chapter by presenting architecture of overlays of limited number of features, and some evaluation results that show that the weighted feature selection provides good results using a Zipan distribution of peers` data objects

102 Section 4.9. Bibliographic Remarks Bibliographic Remarks The high dimensionality problem described by the curse of dimensionality has been the focus of many research studies. Jimenez et al. [51] and Hinneburg et al. [46] have studied the behavior of data in a high dimensional space. One of the most important consequences is the Nearest Neighbor Problem (NN) that have been analyzed in [18, 62, 14, 115]. Carrerira-Peripinan et al. [19] propose a classication of high dimensionality problems and discuss several solutions. Several dimensionality reduction techniques called Feature Construction have been proposed such as Factor Analysis [31], Discriminant Analysis [10], Multidimensional Scaling [67], Projection Pursuit [51], Multidimensional Scaling [67] and Self-Organizing maps [11]. These techniques are also described in several papers that give a general overview [19, 90, 115, 36]. A second alternative for reducing dimensionality is Feature Selection. It has been the focus of interests of many elds and application such as data mining [29, 30, 63], machine learning [16, 52, 64, 65], pattern recognition [48, 81, 105, 120], text categorization [31, 89, 121] and image retrieval [122, 111]. Some eorts have been done to make the dimensionality reduction distributed [76, 3]. The idea of these techniques is based on sampling. Clustering high dimensional data is a challenging issue. The most commonly used partitional algorithms are k-means [49, 60], PAM (Partitioning Around Medoids) [60], CLARA (Clustering LARge Applications) [60] and CLARANS (Clustering LARge ApplicatioNS ) [88]. Since clustering based on distance between objects suers from several problems when the dimensionality increases, several clustering approaches have been proposed to solve this problem such as CLIQUE [6], MAFIA [82], DENCLUE [45], OptiGrid [46], WaveCluster [103]and STING [119]. It has been shown in [46, 109] that these solutions suer from other problems when the dimensionality increases.

103 90 Chapter 4. Feature Space Dimensionality

104 Chapter 5 Similarity Search and Query Caching Contents 5.1 Introduction Query Model Range Queries Nearest Neighbor Queries Query Processing Routing Indexes Query Routing Similarity Search Caching in HON Overview Web Caching P2P Caching Caching Schema Cache Denition Cache Management Cache Admission Policy Cache Replacement Policy Conclusion Bibliographic Remarks

105 92 Chapter 5. Similarity Search and Query Caching Query routing is a challenging issue in distributed and dynamic environments. An eective query routing does not only minimize the query response time and the overall processing cost, but also eliminates unnecessary communication overhead over the global network. In this chapter, we dene query models based on similarity and present query routing protocol in HON. Moreover, we investigate caching mechanisms to improve the search performance. Section 1 introduces the main assumptions of HON search. Section 2 presents query models consisting of range and nearest neighbor queries. Section 3 presents the structure of the index tables used by peers for routing messages, and the routing protocol in HON. Section 4 introduces the caching concept and describes the design of our caching schema. Section 5 presents the caching management in HON. Finally section 6 concludes this chapter. 5.1 Introduction We have now seen how to (1) build overlays on top of the underlying architecture of a P2P network using a selection of data content features and (2) organize an overlay into clusters using cell densities. The problem now is how to process queries eciently in the resulting cluster based overlay architecture (1) by limiting the scope of ooding within clusters and (2) by using caching mechanism to minimize the search cost. In this chapter we focus on the denition of query routing strategies and the design of a distributed cache techniques in HON. The cluster-based architecture we dened in chapter 3 is a hierarchical hybrid P2P network composed of super peers associated with clusters and simple peers mapped to cells. In reality, the peers contained in a cell of a feature space are also organized in two levels with an active peer which manages other peers called passive peers. Active and passive peers are described in detail in section 5.4. We present a similarity search query model combining the characteristics of both range and nearest neighbors query methods. It is based on a routing protocol that takes advantage of the hierarchy of peers. Query routing is processed in two steps by rst sending the query using direct routing to super peers of relevant clusters, and second by ooding the query within the selected clusters. The second issue addressed in this chapter is distributed caching. Caching mechanism is used in traditional databases to reduce disk accesses and to speed up data retrieval and transfer. Recently caching has been used in the web to reduce latency and network trac. In HON, caching is used at the cell lever to limit the ooding among the peers of the cell. Caching in P2P systems is in some way dierent from the traditional web caching. The main dierence is due to the fact that in traditional approaches the

106 Section 5.2. Query Model 93 cache data are stored on well identied static servers whereas in P2P systems, the data can be stored on one or more peers that frequently connect to and disconnect from the network. 5.2 Query Model We consider two forms of Similarity Query: Range Queries and Nearest Neighbors Queries. Range queries specify range values for each feature and retrieve their answers from a region of adjacent cells in the feature space. The second form is the nearest neighbors queries which is dened using Similarity Threshold ST. The query is described by a single query point in the feature space and a distance measure used to dene similar objects that are within a limit ST to the single query point. Both forms of similarity queries are based on the cell-similarity we have dened in section Recall that two objects belonging to the same cell are cell-similar Range Queries A range query Q is described by a set of range values {v Q1, v Q2,...v Qn }, where v Qi is a range value [Min Qi, Max Qi ] associated to the i th feature. To answer a range query Q, we have to dene the set of its target cells. Recall that (cf., section ) a target cell is dened as a cell containing the relevant answers of a query. To compute target cells of a range query we use the following selection process. Let D ϕ ={v ϕ1, v ϕ2,...v ϕn } be the description of the cell ϕ, where v ϕi = [Min ϕi, Max ϕi ] is the range value of the i th feature for ϕ. We select a cell ϕ as a target cell of the query Q if i {1,..., n} v ϕi v Qi, where n is the number of features. Figure 5.1 gives an example of a two-dimensional range query and the dierent relations it can have with a cell. Several cases can be distinguished. In Case1, cell ϕ is not a target cell since it does not intersect the query. In Cases 2 and 4, cell ϕ intersects a part of the query, thus ϕ is one of the target cells of the query. Other target cells must be dened to cover all the query range values. Finally, in Case3, cell ϕ is the unique target cell since it covers completely the query. Note that the target cells of a range query are determined by the peer that initiates the query. Recall that each peer holds information about feature spaces of its relevant overlays and takes in charge the mapping of its data and queries over cells. Figure 5.2a shows the target cells of a range query in a two-dimensional feature space. The shaded area represents the 2-dimension range of the query and the corresponding target cells are 5, 6, 7, 8, 9, 10, 11, 12. Note

107 94 Chapter 5. Similarity Search and Query Caching Figure 5.1: Possible relations between cell and range query that the query intersects a small part of cells 5 and 9. However, we do not have any information about the organization of peers inside the cell. That means that we cannot identify which peer belongs to which fraction of the cell. Therefore, for practical reasons all the peers contained in both cells 5 and 9 will be contacted for retrieving relevant answers. Most of those peers might not contain data objects that belong to the query range. Consequently, the eciency of the search in this case depends mainly on the size of the cell. The more this size is small, the more is the number of peers relevant to the query Q in the cell Nearest Neighbor Queries A nearest neighbors query Q is described by a Query Point Q p and a Similarity Threshold ST. It consists in nding the closest objects to Q p within the similarity threshold ST. The query point Q p is described by the feature values {v p1,..., v pn }. It corresponds to one point in the feature space as shown in gure 5.2b. In this example, the closest data objects to Q p reside in the cell 7 that contains the query point, and also in its neighbor cells 6, 10 and 11. To dene target cells of a nearest neighbor query, we have to nd all the cells that contain at least one object which is close to Q p within the similarity threshold ST. Let ϕ be a cell described by its content C ϕ ={O 1, O 2,...O k } where O i is a data object. A cell ϕ is a target cell of the nearest neighbor query Q if O i C ϕ where Distance(O i, Q p ) < ST. Note that Distance(O, Q p ) can be dened by an Euclidian Distance (ED) between the query point Q p and the object O dened by: Denition 15 (Query-Object Distance) Let D O = {v o1, v o2,...v on }, D Qp = {v p1, v p2,...v pn } be the descriptions of the data object

108 Section 5.3. Query Processing 95 (a) Range query (b) Nearest neighbor query Figure 5.2: Query Forms O and the query point Q p respectively. The Euclidian distance between the data object O and the query point Q p is given by: ED(O, Q p ) = n (v oi v pi ) 2. i=1 Now we discuss how the target cells of a nearest neighbors queries are computed. Given a nearest neighbors query (Q p, ST ), the idea is to create a range query RQ around the query point Q p and within the similarity threshold ST. The range query RQ is represented by {r q1, r q2,...r qn }, where r qi is a range value associated to the i th feature. Each range value r qi is described by r qi = [v pi ST, v pi + ST ], where v pi is the query point value for the i th feature. The peer selects the target cells using the range query RQ as described in section Figure 5.3 shows dierent examples of relationship scenarios between a nearest neighbor query Q and a set of cells which aect the choice of target cells. We call ϕ p the cell that contains the query point Q p. In the rst example, the range query RQ is included in the cell ϕ p, thus only one target cell is dened, while in example the range query RQ intersects neighbor cells to ϕ p. 5.3 Query Processing This section presents the query processing in HON. We rst describe the routing index tables that are installed in each peer. Then we describe how the query in routed from the requesting peer to the nal destination.

109 96 Chapter 5. Similarity Search and Query Caching (a) Example 1 (b) Example 2 Figure 5.3: Possible relations between cell and nearst neighbors query Routing Indexes Three types of routing indexes are dened in HON: Super-Peer Index, Peer-Index and Cluster Index. These indexes are described in the following: Super-Peer Index: Each peer in HON maintains a Super-Peer Index that contains information about its clusters and super peers. The Super-Peer Index is a two columns table. The rst column describes the clusters to which the peer belongs, while the second column contains the IP addresses of relevant super peers. Super-Peer Index is used to route queries from peers to super peers. Figure 5.4 shows an example of a Super-Peer Index maintained by peer P 8. This peer belongs to two clusters C 1 and C 2 and stores their descriptions. The description of a cluster is a set of range values that features take inside the cluster. For example, Cluster C 1 is described by C 1 = {f 1 [5, 10], f 2 [10, 20]} meaning that in cluster C 1, feature f 1 ranges from 5 to 10 and feature f 2 ranges from 10 to 20. For both clusters C 1 and C 2, peer P 8 stores the IP addresses of their relevant super peers P 1 and P 2 respectively. Peer-Index: Each super peer maintains a Peer-Index containing the list of peers that belong to its cluster. A Peer-Index is a two columns table. The rst column contains the description of the cells composing the cluster. The second column contains the IP addresses of the peers of each cell. Peer Index is used by super peers to route queries inside clusters. Figure 5.4 illustrates an example of a Peer-Index of super peer P 2 managing cluster C 1. The Peer-Index stores the description of cells ϕ 4 and ϕ 8 contained in cluster C 1 and the list of their peers. In case where a super peer manages more than one cluster, it maintains a Peer-Index for each cluster.

110 Section 5.3. Query Processing 97 Figure 5.4: Routing Indexes Cluster-Index: A Cluster-Index is maintained by each super peer. It contains the descriptions of clusters of the same overlay. This index helps to route queries between clusters. The Cluster-Index is composed of two columns. The rst one contains the description of clusters and the second one contains the IP addresses of relevant super peers. Figure 5.4 shows an example of a Cluster-Index maintained by super peer P 2. It contains the description of clusters C 1 and C 3 and the IP addresses of their relevant super peers P 1 and P 3 respectively Query Routing Query routing in HON is based on the routing indexes presented previously. Let Q be a query and T C= {ϕ 1,..., ϕ k } be the set of its target cells. We call peers belonging to target cells Target Peers and clusters holding target cells Target Clusters. When a peer P issues the query Q, it computes the set of its target cells T C. The computation of target cells depends on the query type and is presented in section 5.2. Depending on the target cells, the query can be of one three dierent types: Intra-Cluster Query, Inter-cluster Query or Overlay Query. Below we describe these three types of query.

111 98 Chapter 5. Similarity Search and Query Caching Intra-Cluster Query: The query Q is an Intra-Cluster Query when its target cells belong to one or more clusters of the peer P. In this case, the query can be satised in local clusters. The peer P extracts from its Super Peer-Index the IP addresses of super peers that are responsible for the target cells. Then, it sends the query Q to the relevant super peers which forward it to target peers using their Peer-Index tables. Inter-Cluster Query: The query Q is an Inter-Cluster Query when its target cells do not belong to local clusters. Thus, the peer P sends the query to one of its super peers. When the super peer receives the query Q, it uses its Cluster-Index to route the query to the super peers of target clusters. The query Q received by super peers of target clusters is forwarded to target peers using Peer-Index tables. Overlay Query: The query Q is an Overlay Query when only a part of target cells belongs to local clusters. In this case the query Q is a combination of two sub-queries where the rst one is an Intra-cluster Query and the second one is an Inter-cluster Query. Thus, each sub-query is processed separately as presented above. The result of the query Q is the combination of the sets of results obtained from both sub-queries Similarity Search Nearest neighbors queries are used to provide an ecient similarity search in HON. The search takes the form of dierent actions, depending on if the target cell of a query object is found or not. Let a query object Q be described by the coordinates [v Q1, v Q2,...v Qn ] in the feature space and represented by the query point Q P. When a peer initiates the query Q, it rst computes its target cell ϕ P containing the query point Q P. Two cases can be distinguished: Case1: If ϕ P is found, then the query is processed by ooding target peers. When a target peer receives the query Q, it computes the distance between its contained objects and the query Q. Two cases can be distinguished: 1. Case1.1: If the distance between an object and the query Q is equal to 0, then the object matches the query Q and is returned to the requesting peer. 2. Case1.2: If no object satises the query, a similarity threshold ST given by the requesting peer is used to return similar objects to the query Q in the cell ϕ P. If the distance between an object and the query Q is lower than ST, the object is returned to the requesting peer. The similarity threshold ST can be increased

112 Section 5.4. Caching in HON 99 Figure 5.5: Example of similarity search if no relevant objects to the query Q are found in the cell ϕ P. This process is repeated till nding the most similar objects to the query Q in the cell ϕ P. Case2: If ϕ P is not found, i.e., ϕ P is empty, the requesting peer tries to nd the closest regions in the feature space to ϕ P by generating a Similarity search Query SQ. The query SQ is a nearest neighbors query dened by the query point Q P of the query Q and a similarity threshold ST. The similarity threshold ST is initiated in such a way that the query SQ includes the adjacent cells of ϕ P. If the query SQ does not intersect any cluster meaning that all its target cells are empty, then it is extended with a larger similarity threshold value and propagated to the next adjacent cells. The same process is repeated till the query SQ intersects at least one cluster in the feature space. Once the target cells are dened, the query SQ is processed in the same manner as described in Case1. Figure 5.5 illustrates an example of a query Q mapped to an empty cell. A similarity search query SQ is then generated and propagated to the adjacent cells. The query SQ in this example does not intersect any cluster because all the adjacent cells are empty. Therefore, the query SQ is extended with larger range values and propagated to the next adjacent cells. The second propagation of the query SQ intersects Cluster2 and Cluster3. Only the cells of Cluster2 and Cluster3 covered by the query SQ are considered as target cells. These cells are then queried to return the most similar objects to the initial query Q.

113 100 Chapter 5. Similarity Search and Query Caching 5.4 Caching in HON The ooding within target cells might penalize the search performance if the cell has a high size and contains a high number of peers. Therefore, a large portion of irrelevant peers are involved in the query processing which increases the search cost. To limit the ooding overhead inside cells, we propose a caching mechanism that helps storing queries results for future use. In this section, we rst give a brief overview of caching techniques. Then, we present the caching model that we introduce in HON to improve the search eciency. The presentation of the cache is divided into two parts. First, we formalize the cache service and dene the admission policies. Second, we present the dierent replacement techniques and discuss their impact on the query recall Overview A cache is a mechanism used to speed up data retrieval and transfer. It consists in storing copies of the original data which are expensive in terms of access time. Once the data is stored in the cache, future use can be made by accessing the cached copy rather than re-computing the original data, so that the average access time is lower. Caching was rstly used by the central processing unit (CPU) of a computer to reduce the average time to access memory or disk. Two types of cache are used by the CPU: Memory Cache and Disk Cache. A memory cache or CPU cache is a memory bank that bridges the main memory and the CPU. It is a smaller and faster memory which stores copies of the data from the most frequently used main memory locations, as shown in gure 5.6a. By contrast, a disk cache is a section of memory on the disk controller board used between the disk and the CPU. A larger block of data than what is immediately required is copied from the disk into the cache as shown in gure 5.6b. The data are then retrieved from the cache rather than from the disk which is slower to access. A cache needs to be maintained and managed by dening dierent policies of Admission and Replacement. The admission policies denes whenever the query can or cannot be served from the cache, while replacement policies consist in monitoring the use of the cache storage space by removing expired data. We start by dening the two main admission policies: 1. Cache Hit: occurs when the relevant answer to a given query is stored in the cache. As shown in gure 5.7a, the answer is directly returned from the cache to the requester. 2. Cache Miss: occurs when the query does not nd any answer stored in the cache.

114 Section 5.4. Caching in HON 101 Figure 5.6: Memory and Disk Cache As shown in gure 5.7b, the query is sent to the original source of data. Then the returned answer from the orginal source is stored in the cache for future use. The cache size is usually limited and needs to be managed by Replacement Policies. When the storage space of a cache is exhausted, some stored data are removed according to specic rules that vary from a replacement policy to another one. Replacement algorithms were a hot topic of research and debate in the 1960s and 1970s. They were required by and applied to operating systems. Later on, these algorithms have been adapted to many other areas such as Web applications. The most ecient caching algorithm discards the information that will not be needed for the longest time in the future. Since it is impossible to predict how far in the future information will be needed, simulations and experiments on the system behavior and query traces are required to dene an ecient replacement policy. In the following we give a brief description of some well known replacement techniques. For all the further algorithm denitions, we call an Item each data unit stored in the cache. The type of an item depends on the application area. For example, it might be a memory page in operating systems, or a web page in Web applications, etc. Therefore, the content of the cache is seen as a set of Items. 1. First-In-First-Out (FIFO): the idea is to keep track of all the items in a queue. The most recent arrival at the back and the earliest arrival in front. When an item needs to be replaced, the oldest item located at the front of the queue is selected. FIFO is a cheap and intuitive algorithm. However, it is rarely used because it performs poorly in practical applications. 2. Least Recently Used (LRU): this algorithm keeps track of item usage over a short period of time. Then, it replaces the item that has not been used for the longest

115 102 Chapter 5. Similarity Search and Query Caching Figure 5.7: Admission Policies period of time. This replacement policy assumes that, in general, an item which has not been accessed for longest is least likely to be accessed in the near future. Therefore, it is discarded and replaced by a new item having a higher probability to be referenced. 3. Not Frequently Used (NFU): this algorithm replaces the less popular items in the cache meaning the items that are not frequently used. A counter is set for each item to keep track of how frequently it has been used. The eciency of caching in reducing access and execution times led researchers to introduce the cache in many other areas such as Web applications and P2P systems, that has become a signicant source of various types of information. Each type of cache application denes its own admission and replacement policies depending on the system characteristics and user needs. Here we give an overview about Web Caching and P2P Caching Web Caching Many eorts have been directed towards extending the original caching approach to web caching methods [98, 8, 21, 35, 94, 96, 2] reducing the latency time and network trac. A Web Cache or Proxy Cache is installed between one or more Web servers (also

Section 5.4. Caching in HON 103 Figure 5.8: Web Caching known as origin servers) and one or more clients as shown in gure 5.8. It stores copies of Web pages retrieved by the user for some period of time for future use.

116 Section 5.4. Caching in HON 103 Figure 5.8: Web Caching known as origin servers) and one or more clients as shown in gure 5.8. It stores copies of Web pages retrieved by the user for some period of time for future use. When a query is issued, it is rst sent to the cache. If relevant answers are stored in the cache, they are directly returned to the user. Otherwise, the query is sent to the original server. Setting one cache between clients and servers might lead to some problems such as having a single failure point and increasing the charge load on the cache. Thus, a set of caches can be installed on several machines (called proxies) to collaborate and act as a single cache. In this way, charge load is distributed among several proxies and a cache failure can be tolerated by using any other cache in the system. The proxies holding caches are interconnected to form a collaborative architecture and facilitate their interaction. Two main categories of web caching architectures can be distinguished, the hierarchical web caching and the distributed web caching. In a hierarchical web caching architecture [21], the caches are organized in several levels. The bottom level contains client caches and the intermediate levels are devoted to proxies and their associated caches. When a query is not satised by the local cache, it is redirected to the upper level until there is a hit at a cache. If the requested document is not found in any cache, it is submitted directly to its origin server. The returned document is sent down the cache hierarchy to the initial client cache and a copy is left on all intermediate caches to which the initial user requests were submitted. Several distributed caching approaches have been proposed to address one or more problems associated to hierarchical caching. In [96], the authors propose an extension of the hierarchical caching where documents are stored on leaf caches only. The upper level caches are used to index the contents of the lower level caches. When a query cannot be satised by the local cache, it is sent to the parent cache that indicates the location of the required documents. In [35], Li Fan et al. propose a scalable distributed

117 104 Chapter 5. Similarity Search and Query Caching cache approach, called Summary Cache, in which each proxy stores a summary of its cached documents directory on every other proxy. When a requested document is not found in the local cache, the proxy checks the summaries in order to determine relevant proxies to which it sends the request to fetch the required documents P2P Caching Similar techniques to web caching have been suggested to improve the performance of P2P systems by reducing user delay and network trac. However, P2P caching is in some way dierent from web caching systems. The main dierence is due to the fact that in traditional web caching the stored data is kept on well identied static web servers whereas in P2P systems query results are combination of partial results issued from one or more peers which can frequently connect to and disconnect from the network. Thus, caching in P2P systems is a challenging task that has to take into account the following requirements. 1. the available answers to a given query in the network change dynamically because peers might join the network with new data at any time. In this case relying on a cache storing data for a long period of time leads to a partial retrieval of existing answers. As a result, a P2P cache has to be continuously controlled and refreshed to avoid decreasing the system performance. Consequently, the replacement policy has to be based on removing the oldest data. 2. When a peer leaves the network, its indexes stored in the cache become not valid since the peer cannot be reachable. Therefore, state information indicating whenever a peer is online or oine has to be kept in the cache to verify the validity of the stored information. 3. The placement of the cache in a P2P network has be chosen carefully since peers are autonomous and might refuse holding a cache to serve other peers. In addition, there are no static servers in P2P environment which makes the choice of the cache localization a challenging issue. The cache should not be placed in a central location to not penalize the distributed nature of P2P systems. Besides, installing a cache in each peer provides redundancy and consumes highly the storage space. Cache-based P2P systems proposed in the literature can be classied into three approaches. In the rst approach, a cache is placed in each peer using a Gnutella network as shown in gure 5.9. Each peer caches query strings and the results that pass through it. Sripanidkulchai et al. have studied in [107] the popularity of queries on

118 Section 5.4. Caching in HON 105 Figure 5.9: Distributed P2P Caching Gnutella which follows a Zipan distribution. Markatos et al. in [79] have also studied the trac of Gnutella and shown that queries tend to be frequently and repeatedly submitted. Moreover, Gnutella peers join and leave the network very frequently. Thus, the query responses may become out-of-date very quickly. To address this problem, the Gnutella caching mechanism caches query responses for only a small amount of time. The eectiveness of Gnutella caching has been studied in both [107] and [79]. Another approach is to use a centralized server to cache data. In [93], Patro and Hu have shown that caching at the gateway of an organization can be far more eective than caching at individual peers behind the gateway because it can keep the query traces of all the peers. In this manner, if a query Q is issued, then it is more likely to nd similar queries stored in the cache. In the third approach, the cache is distributed among selected peers. Wang et al. have analyzed in [116] the Uniform Index Caching (UIC) that stores query results in the peers along the returning path. The experiments show that UIC causes large duplicated and unnecessary cache results among neighboring peers. Therefore, Wang et al. [116] have attempted to cache the responses in some selected peers. The distributed caching proposed by [116] is based on a hashing method. Peers are organized into M groups, where each group has a group ID. When a query is issued, its hash code is calculated as follows: query ID = hash(query) Mod M. The query and the corresponding results are sent to the peers whose a group ID is equal to the query ID.

119 106 Chapter 5. Similarity Search and Query Caching Figure 5.10: Caching in HON Caching Schema To improve the search performance by reducing the ooding overhead inside cells, we apply a caching mechanism in HON. We place a cache in each non empty cell of the feature space. Therefore, the cache will keep trace of all the queries sent to the same cell. Queries sent to the same cell are similar and storing them in the same cache might increase caching eciency. Unlike traditional caching assumptions, a cache in HON does not store the retrieved data objects. We assume that a cache stores the addresses of the peers that answered each query. In this manner, further queries generating a cache hit are not ooded to all peers in the cell but only to those that contain the relevant answers. Figure 5.10 shows an example of a cache Miss that leads the query to be ooded to all the peers of the cell. In addition, it shows an example of a cache Hit where the ooding overhead is limited to a subgroup of peers. A cache is assigned to one peer in the cell called active peer (AP). Peers with no associated cache are called passive peers (PP). Each passive peer is registered with the active peer of its cell as shown in gure A peer can be passive in a cell and active in another one. Moreover, a peer can belong to several cells and may be chosen to be an active peer for more than one cell. When a given query is sent to the target cell, the relevant active peer cheeks its cache. If there is a cache Hit, the query is served from the peers whose addresses are stored in the cache. Otherwise, if there is a cache Miss, the query is sent to all the peers contained in the target cell. The idea of placing the cache in each cell helps to collect all the similar queries in the same cache. Two queries that are mapped to the same cell are considered as similar. Therefore, they share common target peers that are supposed to hold the relevant answers. In this way, similar queries are always sent to and stored in the same cache which helps increasing the success hit and avoid redundancy.

120 Section 5.4. Caching in HON 107 Figure 5.11: Cache Placement Cache Denition A cache consists of a set of cached items, called Query Segments. They are used to record both queries and relevant peers used to compute the results of queries. Denition 16 (Query Segment) Given a query Q = {v Q1,..., v Qn } dened over the set of features {f 1,..., f n }. A Query Segment S, is a tuple S Q, S P, where S Q is a submitted query and S P = {P i } i=1,m is the set of peers that provide results for S Q. An empty query segment is dened by S=,. A query segment can be valid or invalid depending on whether the corresponding peers are connected or not. The validity of the query segment is dened as follows: Denition 17 (Query Segment Validity) A query segment S S Q, S P is called: (1) valid if P S p P is connected. (2) invalid if P S p, P is not connected. Several relationships can be dened between two query segments for query processing and cache maintenance purposes. They are dened as follows:

121 108 Chapter 5. Similarity Search and Query Caching Figure 5.12: Cache Content Denition 18 (Query segment relationship) Two query segments S 1 S 1Q, S 1P and S 2 S 2Q, S 2P are: (1) Similar, if S 1Q is similar to S 2Q. (2)Equivalent, if S 1Q is similar to S 2Q and S 1P = S 2P. (3) Related, if S 1P S 2P. (4) Disjoint, if S 1P S 2P =. A cache is then dened as a set of query segments. The query segments of two distinct components of the cache are not equivalent. When the system is rst brought on line, the cache consists of empty query segments. Formally, a HON cache is dened as: Denition 19 (Cache) A Cache C is dened as a set of tuple { S i, S ACi, S T Si } where S i is a query segment, S ACi is a counter used to record the number of times that the query segment S i is accessed, and S T Si is a timestamp used to indicate when S i is last accessed. Two query segments S i and S j recorded in two distinct entries of a cache cannot be equivalent. 5.5 Cache Management The granularity of cache management is at the query segment level. A query segment is replaced and invalidated as a whole. In this section, we rst discuss query admission policy and then we present the cache consistency and replacement which are the two key components of cache management.

122 Section 5.5. Cache Management Cache Admission Policy When the cache receives a query Q, it computes the similarity between Q and the queries in the query segments of the cache using a predened similarity threshold. The result is the most similar query segment to the query Q from which is extracted a set of peers {P } capable of answering the query Q. The requesting peer contacts all the peers in {P } for results and selects the set of peers S p that satisfy the query Q. The segment S Q, S p is then used to update the cache as follows: 1. If S Q, S p C where S Q, S p is equivalent to S Q, S p, then the segment S Q, S p will not be stored in the cache. 2. If S Q, S p C where S Q, S p is not equivalent to S Q, S p, then the segment S Q, S p will be stored in the cache. The cache management in HON is based on two key points. First, on similarity that denes whenever a query can be served from the cache or not. Second, on the state of peer that can be online or oine. In the following, we give the denitions of cache miss and hit. Denition 20 (Cache hit) There is a cache hit if at least one query segment in the cache has a query that is similar to Q and if at least one peer of the returned peers is connected. A formal denition of the cache hit is given as follows: S Q, S p C S Q is similar to Q AND P S p P is connected. The similarity threshold for a cache hit is predened. This threshold has to conserve the precision of answers to not decrease the search performance. Denition 21 (Cache miss) There is a cache miss if none of the query segment in the cache contains a query similar to Q, or if there are query segments in the cache with similar queries to Q but the corresponding peers are not connected. A formal denition of the cache miss is given as follows: S Q, S p C, S Q is not similar to Q OR S Q, S p C S Q is similar to Q and P S p P is not connected. We focus in our work on studying the success hit and the recall. Meaning, how successful is the use of the cache to answer the queries and how ecient it is in providing

123 110 Chapter 5. Similarity Search and Query Caching the maximum of the existing answers in the network. A challenging issue in HON caching is the use of similarity in dening the cache hit and miss. The similarity between two queries does not mean that their answers are stored in the same set of peers. Therefore, a cache hit might lead sometimes to irrelevant peers or partial set of the peers that contains query answers. In this manner, we may loose a part of the information and generate partial result to queries which decrease the recall Cache Replacement Policy In our work we investigate two replacement policies to study their impact on the search performance and particularly on the recall. As described previously, the main idea in P2P caching is to store the data in the cache for a short period of time, because of the dynamicity of the network. The rst replacement technique we use in HON is LRU (Least Recently Used) which removes the data that has not been used for the longest period of time. The second policy is NFU (Not Frequently Used) which removes the less popular data in the cache. Before applying any replacement policy, we start rst by removing invalid segments, meaning the segments having all their contained peers oine. Then, we apply one of the two replacement policies mentioned before. For the NFU algorithm, we need to associate to each segment stored in the cache additional information. First, we use a counter S AC indicating the number of times the query segment is accessed. The peers belonging to the segments with high S AC counters are considered as the most used peers. For this reason, it is important to keep them in the cache for future use. Second, to ensure the consistency of the access number of query segments, we associate to each segment a timestamp S T S to avoid having false popular segments. In other words, a segment S may have the highest access-number S AC, not because it is the most used one in the current interval time [0, ρ], but because it is the oldest one. For this reason, the access number S AC of a segment S must be reset to zero if (t S T S ) reaches a threshold ρ (t is the current time). In a formal manner we write: for each S C if ((t- S T S )= ρ){s AC = 0}. We study in our work the impact of the replacement policies LRU and NFU in HON. Both LRU and NFU have advantages and disadvantages. Depending on the cache size, LRU helps to store the data in the cache for a short period of time. LRU makes a considerable dierence with NFU, especially when new peers join the network providing new content. Since LRU removes the oldest data, it helps to refresh the content of the cache and take into account the new content that might join the network

124 Section 5.7. Bibliographic Remarks 111 at any time. Therefore, the replacement policy LRU helps in increasing the recall. By contrast, NFU policy keeps the most popular data on the cache. This helps to increase the success hit by providing answers to a large portion of user queries. However, NFU might ignore the network dynamicity and provide answers without taking into account new contents. We have studied both algorithms and run some experiments to show their eciency. Using dierent data distribution and cache size, we have shown that installing a cache in HON limit the ooding and reduce signicantly the query scope meaning the number of peers involved in the query processing. More details about our results are given in chapter Conclusion We have presented in this chapter two main parts of the query processing in HON. First we have introduced the dierent query models consisting in range queries and nearest neighbors queries. Then, we have dened the architecture of the index tables used to route the queries between peers and explained how queries are processed based on similarity measures. The second part of this chapter consisted in describing the caching mechanism that we have introduced to limit the ooding inside the cells. We have dened the cache service and the related admission and replacement policies. To study the performance of the query processing and caching, we have run a set of experiments studying dierent metrics such as query recall, query scope and success rate. We have shown through these experiments that HON achieves a high success rate and recall. In addition, the caching limits signicantly the ooding and performs the similarity search. A detailed presentation of our results is given in chapter Bibliographic Remarks Many web caching approaches have been proposed to reduce the latency time and network trac [98, 8, 21, 35, 94, 96, 2]. Chankhunthod et al. [21] propose a hierarchical web caching architecture. Povey et al. [96] propose an extension of the hierarchical architecture that consists in storing documents only in the leaves of the hierarchy. Fan et al. [35] propose a scalable distributed architecture of web caching where no centralized resources considered. Armon et al. [8] address the issues of how to operate a Cache Satellite Distribution System, how to design it, and how to estimate its eect. Paul et al. [94] investigate in detail the advantage and disadvantage of a distributed

125 112 Chapter 5. Similarity Search and Query Caching architecture of caches which are coordinated through a central controller. Ren et al. [98] introduce a novel caching scheme called semantic caching. It can be used in the mobile computing environment, heterogeneous systems and general client-server systems. Caching has been introduced in P2P systems to improve their performance by reducing user delay and network trac. Sripanidkulchai et al. [107] and Markatos et al [79] have studied the eectiveness of Gnutella caching [40] where a cache is maintained by each peer in the system. Wang et al. [116] propose a distributed architecture of caches that store query results in the peers along the returning path. Patro et al. [93] propose a centralized architecture of caching. They show that caching at centralized location keeps the query traces of all the peers, thus can be far more eective than caching at individual peers.

126 Chapter 6 Evaluation methodology Contents 6.1 Prototype Design Physical Layer Elements Distribution Routing Layer Descriptors Protocols Application Layer Join Event Departure Event Search Event Simulation Setup Parameters Metrics Search Quality Search Cost Caching Impact Maintenance Cost Failure Cost Simulation Results

127 114 Chapter 6. Evaluation methodology Control parameters Search Success Rate Recall Query Scope Clustering Impact of the Search Cost Caching Impact on Search Performance Data Distribution Impact on Success Hit Caching Impact on Query Scope Caching Impact on Recall Replacement Policies Scalability Tolerance to Failures Conclusion

128 Section 6.1. Prototype Design 115 We have presented in the previous chapters the design of a cluster-based P2P system including the underlying architecture, clusters and overlays creation, query models based on similarity, query routing protocol and search optimization using caching. In this chapter, we present the implementation of a prototype to measure the performance of HON. We focus on evaluating dierent aspects. First, we study the similarity search quality performed by the organization of data and peers in the feature space and evaluate caching impact on the search eciency. Second, we address maintenance cost and fault tolerance issues of HON. Routing and localization methodologies are implemented in HON by maintaining partial routing tables in each peer, making the system very sensitive to membership changes. Thus, maintenance overheads and fault tolerance capabilities are important in aecting the performance of HON. The remainder of this chapter is organized as follows. The next section presents the prototype design. Section 2 introduces the simulation parameters and metrics used to evaluate HON. Simulation results are discussed in section 3. Finally section 4 concludes the chapter. 6.1 Prototype Design We have developed using the Java programming language a prototype to simulate HON and evaluate its performance. The HON prototype is composed of three layers: Physical, Routing and Application. The physical layer contains the basic elements and parameters of HON, the routing layer denes a set of protocols between peers, and the application layer is used to simulate peer join, search and departure Physical Layer The physical layer is the basic component of the HON prototype. It is divided into two parts: Elements and Distribution. Elements represent the structure of peers and feature spaces while the Distribution part implements data distribution functions Elements Two elements can be distinguished as shown in gure 6.1: Feature Space and Peer. Feature Space: is described by a set of features {f 1, f 2,...f n }. Its structure is composed of two indexes: 1. CellsIndex : contains the descriptions of all cells constituting the feature space. As presented previously, a cell description is a set of range values associated to each

129 116 Chapter 6. Evaluation methodology Figure 6.1: Physical Layer feature. Figure 6.2a shows an example of CellsIndex vector where each column corresponds to a feature and each row corresponds to a cell. 2. AdjacentCellsIndex : keeps information about adjacency between cells in the feature space as shown in gure 6.2a. Peer: is dened as a set of structures, operations and decisions. A peer contains the following structures: 1. PeerID: a unique identier (numeric value) of a peer. 2. PeerType: indicates the type of a peer which can be SimplePeer or SuperPeer. 3. PeerFunction: indicates an additional role that a peer may play. The function of a peer is set to PassivePeer, ActivePeer or MirrorPeer. An active peer holds a cache and a mirror peer stores a backup copy of a super peer content. 4. State: indicates the status of a peer. State=1 when the peer is online while State=0 when the peer is oine. 5. DataIndex : contains information about data objects of a peer. It is represented as a two columns vector as shown in gure 6.2b. The rst column contains the description of the cells the peer belongs to. The second column contains the description of the data objects for each cell. 6. CacheIndex : this structure is maintained in active peers. It is a three columns vector in which the rst column contains the descriptions of the queries, the

130 Section 6.1. Prototype Design 117 (a) Feature Space Structures (b) Peer Structures Figure 6.2: Index Structures second column contains the identiers of the relevant peers and the third column contains the usage frequency of the query. Figure 6.2b illustrates an example of a CacheIndex. 7. RoutingIndexes: contains three types of routing indexes: (1) the Super-Peer Index is used to route queries from peers to super peers, (2) the Peer-Index is used by a super peer to route queries inside clusters and (3) the Cluster Index contains the descriptions of the clusters of an overlay and is used for routing inter-cluster queries. Routing indexes are described in detail in section In the HON prototype, the peers carry out a set of Operations and Decisions. 1. Operations: are a set of tasks performed by peers, including feature selection,

131 118 Chapter 6. Evaluation methodology data indexing, mapping, signature computation, density computation, and index table construction. 2. Decisions: dene operations that are performed when a specic event occurs. There are two main decisions: (1) Updating decisions are required decisions used to ensure that routing tables of peers contain coherent information. These decisions are made whenever a peer changes its interests, joins or leaves the system. (2) Clustering decisions, which are based on cell density values, are used to create, merge or split clusters. For instance, when the density of an active peer cell reaches a predened density threshold value, the peer starts the clustering process Distribution The distribution of data objects in the feature space has an important role in dening the behavior of the system. It has an impact on many performance measures, including the maintenance cost, the similarity search and the network overhead. Below, we examine two types of data distribution: Uniform and Zipan. Uniform: using a uniform distribution, data objects are distributed randomly over cells. Therefore each peer has equal chance to be mapped to any cell of the feature space. To generate a uniform distribution, we use the Random Java class. An instance of this class is used to generate a stream of pseudorandom numbers. If two instances of Random class are created with the same seed, and the same sequence of method calls is made for each, then they will generate and return identical sequences of numbers. To avoid generating identical sequences of numbers, only one instance of Random class is used by all peers to provide a uniform distribution of data objects. Zipan: A Zipf's law is an experimental law which describes item occurrences where few items occur very often while many others occur rarely. For example, the relative frequency with which web pages are accessed follows a Zipf's law, where 20% of web pages is requested by 80% of queries. Our goals is to explore the applicability of a Zipf's law to queries and data distribution in HON. Therefore we need to dene a Zipan distribution. We start rst by giving the formal denition of a Zipf's Law: F i = 1/i α, (i=1,...,n) where F i is the occurrence frequency of the i th ranked item, α is a parameter close to 1 and N is the number of distinct items. A Zipan distribution is a set of values that

132 Section 6.1. Prototype Design 119 follow the above Zipf's law. We dene a specic Zipan distribution in HON. It follows a strict Zipf's law (α=1) and is characterized by α, M, and K and given by: F i = K/i α, (i=1,...,m) where M is the number of events, α is equal to 1 and K is the frequency or popularity of the most frequently occurring events in the distribution. Assuming that the feature space is composed of M cells, an event i is dened as the mapping of a data object to a cell i. Two distinct data objects mapped to the same cell result in the same event. Here we use the 20/80 rule, where 80% of data objects are mapped to 20% of cells. Therefore, the parameter K is computed by: K = DN 0.8 where DN represents the number of data objects. This number is dened according to the type of the Zipan distribution. In our approach, we dene two types of Zipan distribution: 1. Global Zipan Distribution: data objects of all peers in the system follow the same Zipan distribution. Therefore, the DN parameter is equal to the total number of data objects in the whole system. 2. Local Zipan Distribution: data objects of each peer follow a dierent Zipan distribution. Therefore, a dierent DN parameter, which is equal to the number of data objects, is selected for each peer Routing Layer The main task of this layer is to route packets or descriptors messages among peers and dene message exchange protocols. The protocols describe the way in which peers communicate. Figure6.3 depicts the components of the routing layer Descriptors HON communication protocol is composed of 7 descriptors: Join, Accept-Join, Query, Query-Hit, Download, Disconnect and Update. The structure of a descriptor header is given in table 6.1.

133 120 Chapter 6. Evaluation methodology Figure 6.3: Routing Layer Table 6.1: Description of HON Protocol Descriptor ID A unique descriptor identier on the network. It is used to make sure that the same descriptor does not pass twice by the same peers creating a loop. Loops may occur when updating operation is not achieved after peers failure or departure without any notication. Descriptor Type Indicates the type of the descriptor(0:join, 1:Accept-Join, 2:Query, 3:Query-Hit, 4:Download, 5:Disconnect and 6:Update). Hops Number of times the descriptor has been forwarded. IdSource Identier of the sender of the descriptor. IdDestination Identier of the descriptor destination. Join: is a descriptor initiated by a peer when it joins the network. This descriptor is a signature mask representing the cells to which the peer data objects are mapped using the threshold value T. Accept-Join: this descriptor is sent by a cluster super peer when it receives a Join descriptor from a connecting peer. It contains the IP address of the super peer and the description of its cluster.

134 Section 6.1. Prototype Design 121 Query: contains the description of a query. Recall from section 5.2 that a query is a set of range values (range queries) or a data object value, which is a point in the feature space for similarity search and nearest neighbors queries. Query-Hit: is a descriptor sent to a requesting peer from a peer containing the required object. It contains the IP address of the responding peer. In our simulation, we do not take into account the download cost. We use the returned IP to update the cache and to count the number of messages sent to process a query. Download: is a descriptor sent at the last step of data search to get the required data object from the target peer. Disconnect: when a peer disconnects from the network, it sends a Disconnect descriptor to all the peers in its index tables. Update: this descriptor is used by a new super peer to send an update request to the peers of its cluster. It contains the description of the cluster and IP address of the sender Protocols In this section we describe connecting, searching and disconnecting protocols in HON. Connecting: a new peer connects to the network by broadcasting a Join descriptor. When a relevant super peer receives a Join descriptor, it updates its index tables and sends back an Accept-Join descriptor. When the new peer receives one or more Accept- Join descriptors, it builds its Cluster-Index using the information received from the super peers. Searching: each super peer maintains routing information about super peers of other clusters. When a peer initiates a query, it sends a Query descriptor to one of its super peers. When the super peer receives the Query descriptor, it forwards it to super peers of target clusters. Each super peer of a target cluster locates the target cells and oods their contained peers. A Query-Hit is sent from a peer holding the required object to the requesting peer. When the requesting peer gets the Query-Hit, it sends back a Download descriptor to simulate the download process.

135 122 Chapter 6. Evaluation methodology Disconnecting: when a peer leaves the system, it sends a Disconnect descriptor to its super peers. When a super peer receives a Disconnect descriptor, it simply removes all information that are related to the disconnected peer. In case a super peer leaves the network, it rst sends a Disconnect descriptor to one of its mirror peers described in section The mirror peer takes over the management of the cluster of the disconnected super peer and sends an Update descriptor to the member peers Application Layer The goal of the physical layer and the routing layer of the prototype is to create the system with specic parameters and a predened number of peers. Once the system is built, the application layer starts a mixture of events to simulate the behavior of HON. Three events can be distinguished: Join, Departure and Search (see gure 6.4) Join Event This event simulates peers join to evaluate the system performance measuring maintenance costs, updating operations, and the eciency of the clustering algorithm. Two events of peer join are separated by other types of events to simulate real operation traces. Join events represent 10% of occurring events in the network Departure Event Peers departure is a crucial event in P2P networks. It helps to measure the adaptability of the system to dynamic changes. Departures with notications allow testing the eciency of updating operation while departure without notication helps to evaluate the system tolerance to failures. Note that, a departure of a peer without notication is considered as a peer failure. A peer departure event is simulated by setting the State of a peer to 0. Departure events represent 10% of occurring events in the network Search Event The search event is used to evaluate the similarity search eciency. Search events represent a percentage of 80%, that is highest percentage of HON events.

Section 6.2. Simulation Setup 123 Figure 6.4: Application Layer 6.2 Simulation Setup The simulation part uses several parameters and metrics to evaluate HON.

136 Section 6.2. Simulation Setup 123 Figure 6.4: Application Layer 6.2 Simulation Setup The simulation part uses several parameters and metrics to evaluate HON. Parameters represent a set of measurable factors, such as Threshold, that determines the system behavior. Metrics are measurement functions that facilitate the quantication of some particular characteristics of the system such as the success rate of queries Parameters Two dierent types of parameters are used in the simulation of HON: Control parameters and Workload parameters. Control parameters given in table 6.2 represent the system parameters dened at the physical layer of the prototype architecture. The values of these parameters are specied before starting a simulation, and can change for each simulation to evaluate the system in dierent situations. Default parameters values are dened if no particular specications are given for simulation. The workload parameters are related to the occurring events in the system. They are specied for the application level operations and described in table Metrics We dene the following metrics to evaluate similarity search, caching and clustering eciency, system scalability and tolerance to failures. Four categories of metrics are dened in our simulation: Search quality, Search Cost, Caching Impact, Maintenance Cost and Failure Cost Search Quality To evaluate the quality of the similarity search in HON, we dene the success rate and the recall metrics given by:

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent