Detecting and Recovering from Overlay Routing Attacks in Peer-to-Peer Distributed Hash Tables

Size: px

Start display at page:

Download "Detecting and Recovering from Overlay Routing Attacks in Peer-to-Peer Distributed Hash Tables"

Antony Walton
6 years ago
Views:

1 Detecting and Recovering from Overlay Routing Attacks in Peer-to-Peer Distributed Hash Tables A thesis for the degree of Master of Science in Computer Science Keith Needels keithn@csh.rit.edu Department of Computer Science Rochester Institute of Technology February 22, 2008 Committee: Professor James Minseok Kwon, Chair Professor Alan Kaminsky, Reader Professor Warren R. Carithers, Observer

2 Abstract Distributed hash tables (DHTs) provide efficient and scalable lookup mechanisms for locating data in peer-to-peer (P2P) networks. A number of issues, however, prevent DHT based P2P networks from being widely deployed. One of these issues is security. DHT protocols rely on the users of the system to cooperate for lookup requests to successfully reach the correct destination. Users who fail to run the protocol correctly can severely limit the functionality of these systems. The fully distributed nature of DHTs compounds these security issues, as any security mechanism must be implemented in a noncentralized fashion for the system to remain truly P2P. This thesis examines the security issues facing DHT protocols, and we propose an extension to one such protocol (called Chord) to mitigate the effects of attacks on the underlying lookup message routing mechanism when a minority of nodes in the system are malicious. Our modifications require no trust to exist between nodes in the network except during the joining process. Instead, each node makes use of locally known information about the network to evaluate hops encountered during the lookup routing process for validity. Hops that are determined to be invalid are avoided. These modifications to the Chord protocol have been implemented in a simulator and then evaluated in the presence of malicious nodes. We present the results of this evaluation and compare them to the results obtained when running the unmodified Chord protocol. ii

3 Table of Contents 1. Introduction Peer-to-Peer Protocols Overlay Networks Napster and Gnutella Distributed Hash Tables Chord Pastry Content Addressable Networks (CANs) Peer-to-Peer Protocol Security Issues and Related Work Data Attacks Identifier Attacks Routing Attacks Chord Secure Routing Design Threat Model Design Overview The Backtracking Algorithm The Hop Verification Algorithm Maintaining Statistical Data Joining the Network Updating Finger Table Entries Simulator Design Using the GUI Utility Experiment Setup Viewing Experiment Results Writing Tests in Java Evaluation Dropped lookup requests Incorrect Random Routing Malicious Sub-ring Routing Effect of the Standard Deviation Parameter Effect of the Pruning Parameter Conclusion References iii

4 1. Introduction The popularity of peer-to-peer (P2P) networks took off beginning in 1999 thanks to the success of Napster [9], and it has been a hot research topic ever since. Although there have been applications before Napster that could be considered peer-to-peer, giving the average Internet user the ability to easily obtain music and movies for free has made P2P technology well known throughout the world. Research in this area has resulted in a powerful class of P2P lookup protocols called distributed hash tables. While these protocols are scalable and efficient, they suffer from security vulnerabilities that prevent them from being widely deployed in open networks. A peer-to-peer network can be defined as a network where there are no central servers. Instead, each user of the system is both a client and a server, and is referred to as a peer. Peers connect directly to each other to transfer data. In a true peer-to-peer network, peers also locate data without using a central server, without using any kind of hierarchical organization, and without making some peers more important than others. There is no single point of failure. If a peer fails, other peers can continue to use the system. If a single peer is using all of its bandwidth, other peers are not affected. While illegal file sharing is perhaps the most well known application of peer-to-peer networks, there are many useful legal and ethical uses for these decentralized systems. These uses include overlay multicast, data backup, distributed file systems, distributed databases, instant messaging, DNS, and so on. Many large scale distributed systems that can benefit from an architecture where there is no single point of failure can make use of P2P technology. Unfortunately for Napster, it was not a true peer-to-peer network since file lookups were handled by a central server. This directory server was an easy legal target, and Napster was shut down. In the wake of Napster s demise, many new peer-to-peer systems were developed. These systems were fully decentralized, but most of them relied on flooding techniques to locate peers containing desired data, which is not very efficient and is not guaranteed to find sought after files. To solve these problems, yet another class of P2P systems were developed, called Distributed Hash Tables (DHTs). DHTs are fully decentralized systems that locate data efficiently, and are today a popular research topic in the field of distributed systems. A few of these systems will be discussed in Section 2. The fundamental purpose of a DHT is to find the peer (also called a node) responsible for a resource given a key for that resource. Since it is not practical to keep track of every node in the network, each node should only be responsible for keeping track of a small subset of other nodes. Finding the node responsible for a key should be done by forwarding lookup requests through a structured overlay network. All of the nodes that a particular node keeps references to are the links that node has in the overlay. The number of hops needed to complete a lookup should also be small. While DHTs are much more efficient and scalable than the flooding systems that were popular immediately after the fall of Napster, there are some serious issues that prevent 1

5 them from being deployed in large, open networks, one of which is security. With DHTs, we must rely on other peers to correctly forward our lookup requests in order to find the peer responsible for a key. Unlike physical network routing, with overlay routing in an open DHT system, anybody can become a router. An individual attacking a DHT has many types of attacks available to them. Attackers can modify, drop, and misroute lookup requests. They can take responsibility for certain data in order to deny its availability or provide modified data to other nodes. They can forward incorrect overlay routing table updates. The list goes on and on. DHTs, in their original forms, are an easy target for attackers. Large open P2P networks that use DHTs cannot exist until the fundamental security issues of the underlying DHT protocols are addressed. The purpose of this thesis is to examine security issues facing DHT protocols and present an extension to the Chord DHT protocol to mitigate some of these available attacks. Our goal is to allow Chord to make use of readily available information that is obtained through the normal operation of the protocol to evaluate the lookup routing process and respond to attacks. This is done in a fully distributed fashion. We assume that a node trusts no node besides itself, except the bootstrap nodes that it uses to join the network. Our results show that our proposed extension to Chord is able to correctly complete lookups over 90% of the time with up to half of the network consisting of compromised nodes performing naïve attacks, and our extension offers significant improvement over the base protocol in the face of sophisticated attacks. In order to understand the security issues that DHT protocols face, we first must understand P2P/DHT protocols, which is the goal of section 2. In section 3, we will survey the vulnerabilities that are present in DHT protocols and solutions that have been proposed. In section 4, we detail our extension to the Chord protocol to avoid some of these attacks. We give an overview of the architecture of the software simulator used for testing these proposed changes in section 5, and we evaluate these changes in the simulator in section 6. We summarize and conclude in section 7. 2

6 2. Peer-to-Peer Protocols The purpose of this section is to provide background information on Peer-to-Peer networks. It is necessary to understand how these protocols work in order to understand the security issues involved with them. Although several protocols from the last century could be considered P2P (such as Usenet), we have chosen to focus on P2P protocols starting with Napster, which was released in This is when P2P protocols really became a popular research topic. While a few basic security issues will be laid out in this section, Section 3 will contain a detailed overview of these issues and research that has been performed Overlay Networks A peer-to-peer network is a type of overlay network, meaning that nodes in the network are not linked together physically but are instead connected with virtual links over the underlying physical network (normally, the Internet). Peers typically communicate with the TCP protocol, and a link in a P2P network can be thought of as a TCP connection over the physical network and not a direct physical link. One major difference between an overlay network and a physical network is that overlay networks can be organized in any way desired by the designer. Physical links are generally restricted only to machines that are within close proximity to each other with a physical connection (such as an Ethernet connection) between them. It is not possible to create and change physical links between nodes on the fly. It is, however, very easy to create an overlay link between two nodes simply by initiating a TCP connection. Peer-to-peer protocols attempt to create an overlay network with a structure that allows any node to find other nodes responsible for desired data quickly. These protocols must define how the overlay network is to be organized, how to handle nodes that join and leave the network, how to route lookup requests through the network, and so on. To limit overhead, we only want each node to have a small number of links in the overlay network, but at the same time we want lookups to occur quickly. We also want to do this in a fully distributed manner where there are no single points of failure and no nodes that are more important than others in the operation of the system. There have been many proposed peer-to-peer protocols. These protocols take various approaches to organizing an overlay network. Each has its own strengths and weaknesses, and some sacrifice being fully distributed for various reasons. The next few sections will give brief overviews of five of these protocols: Napster [9], Gnutella [7], CAN [10], Chord [13], and Pastry [11]. We will focus on Chord since it has been chosen as the target protocol of this thesis. 3

7 2.2. Napster and Gnutella Although Napster [9] is not the first protocol to make use of decentralized, distributed resources, it is the protocol that made P2P technology famous by making the multimedia files of many individuals accessible to the world. While the files shared on Napster were fully distributed, the directory of these files was not. The Napster overlay network consists of a Napster file directory server with every Napster peer connected directly to it and only it. When a user joined the network, they would send this directory server a list of all files that they were sharing. When a user wanted to find a file, they would send their search phrase to the directory server, and the directory server would return a list of peers that had this file. Peers would directly connect to one another to transfer the file. The use of a central directory server means that Napster is not considered to be a true Peer-to-Peer system. This single point of failure eventually did fail when legal attacks forced it down, and the Napster network ceased to exist. After Napster failed, developers looked for a way to fully distribute the file lookup mechanism to prevent the network from being vulnerable to a single point of failure. One of the more well known protocols developed was called Gnutella. Gnutella did away with the central directory server that Napster contained. Instead, each node was a directory server for its own files. Nodes in the overlay were connected to one another in a more or less arbitrary fashion. To perform a search, a node would send its search request to each node it was connected to in the overlay. Each node receiving the lookup request would then send the request to every node it was connected to, and so on. A node that contained the desired data would inform the searching node. This method is referred to as flooding. To keep a lookup request from flooding indefinitely, each lookup request had associated with it a time-to-live counter that each node would decrement before flooding the request to its neighbors. When this time-to-live counter reached zero, it would no longer be flooded. There are several disadvantages to the flooding approach taken by the original Gnutella protocol. First, it is a very inefficient method of searching. A search request might result in thousands of messages being sent to and from thousands of nodes. Another side effect is that if a file exists in the system, there is no guarantee that it will be found. If the file resides on a node outside of the time-to-live radius, it will never be discovered. This means Gnutella is fundamentally not scalable. As the network grows, the search capability of a node does not necessarily grow, as search is limited by the time-to-live counter. Later releases of Gnutella were able to address these limitations to some extent by using a hierarchy of regular nodes and supernodes. The regular nodes would communicate only with supernodes, and supernodes would communicate with each other. Gnutella s initial scalability issues fueled even more research into Peer-to-Peer networks. This research resulted in what are known as Distributed Hash Table (DHT) protocols. These protocols use structured overlay networks, unlike Gnutella which allowed the network to be arranged in any way imaginable. Location of desired data is guaranteed and is guaranteed within a bounded number of hops. 4

8 2.3. Distributed Hash Tables Unlike flooding protocols such as Gnutella, Distributed Hash Tables are structured. This means that nodes that want to participate cannot create arbitrary links in the overlay network. Each DHT protocol defines how nodes should form connections in the overlay, how those nodes should deal with new nodes joining, and how they should deal with nodes leaving and failing. More importantly, the protocols also define how lookup messages should be forwarded through the system. Three popular DHTs that have received a great deal of research attention include CAN [10], Chord [13], and Pastry [11]. Although DHTs can vary greatly from one to another, there are a few things that almost all of them share in common. The entire purpose of a DHT, like any other P2P system, is to locate the node(s) responsible for a desired data item. Each data item has associated with it a key, which is known beforehand, such as a file name. Each key is associated with a node or a group of replica nodes that are responsible for maintaining the desired data or a reference to where the desired data might be found. Unlike Napster and Gnutella, most DHTs only support exact key searches and do not support keyword searches, and this is a popular open research topic. All functionality other than locating the node responsible for a key, such as actually retrieving the resource being sought, are the responsibility of higher layers of the Peer-to-Peer application. Each node in the DHT network is responsible for storing a subset of the overall key-data pair set. Each node maintains overlay links to a small number of nodes in the network for the purpose of routing lookup requests. Lookups are performed by forwarding the lookup request through the overlay network as defined by the DHT protocol. Typically, the number of links maintained by each node is O(log n) and the number of hops for a lookup request to complete is also O(log n), where n is the number of nodes in the system. Nodes and keys in DHTs are both mapped to identifiers, usually by using a hash function with a well distributed random output of values. In Chord and Pastry, an identifier is simply an integer, and a hash function such as SHA-256 can be used to hash nodes and data keywords to their identifiers. In CAN, identifiers are points in a multi-dimensional coordinate system. Identifiers are used to determine which node is responsible for which key. In Chord, a key is stored on the node with the first identifier with a higher value than the key s identifier. In Pastry, a key is stored on the node with the identifier numerically closest to the key s identifier. In CAN, a key is stored on the node whose zone (centered by the node s identifier coordinates) contains that key s identifier coordinates. For routing lookup requests, each node contains a routing table that contains some small subset of the nodes in the system. These routing tables are used to forward a lookup request progressively closer towards its destination. With Chord, for example, a lookup request is forwarded to the node in the routing table with the identifier that most closely precedes the key s identifier. 5

9 While these are common aspects of most DHTs, the details of each protocol vary, and so does the routing table size and average lookup hop count. Chord and Pastry, for example, have a routing table size of O(log n) and lookup requests take O(log n) hops to reach their destination. CAN, on the other hand, has a constant sized routing table but lookup requests take O(n 1/k ) where k is a constant system parameter. The next three subsections will provide more detail on Chord, CAN, and Pastry, with most of the attention focused on Chord since it is the target system of this thesis Chord Chord s identifiers are integers. The identifier for a key is obtained by hashing that key with some hash function that is used by all of the nodes in the system which returns integers of some bit length m. This hash function can be any well distributed hash function, and SHA-1 is used in the original Chord paper, which has a bit length m value of 160. A node is assigned an identifier by hashing its IP address. Nodes and keys are then arranged in an identifier ring modulo 2 m. Each key s value is stored on the first node with an identifier equal to or following that key s identifier in the ring in the clockwise direction. In order to find nodes that are responsible for keys, each node has to store some routing information in a table. In Chord, this routing table is called a finger table. The Chord finger table for a node with identifier id contains m entries, numbered from 0 to m-1. For finger table entry i, the node stored in that entry is the first node whose identifier succeeds id + 2 i (mod 2 k ) in the clockwise direction. It is possible (and often probable) to have duplicate entries in the finger table. Figure 2.1 shows a sample finger table with an illustration of how the finger table is derived for a node with identifier 770. Node 770 s last finger table entry should be the node that succeeds This node is Node 275, so a reference to Node 275 is stored in the last finger table entry in Node 770 s finger table. The rest of the finger table entries are filled in with the same process for i = 0 through 8. Note that the first several finger table entries all point to the same node. This is because those entry pointers are all very close to node 770, and they all fall between 770 and

10 Figure 2.1. An illustration of a sample finger table. As Figure 2.1 illustrates, each node only has information about a subset of the nodes in the overall system. As the system gets much larger, the number of unique nodes in each node s finger table becomes a smaller fraction of the overall number of nodes. The size of the finger table has been shown by [13] to be O(log n) where n is the number of nodes in the system. The advantage of the finger table is that when performing a lookup we can jump about half of the remaining distance between the node doing the routing and the node responsible for the key. This divide and conquer approach to routing lookup requests has been shown by [13] to use O(log n) hops for each route. The algorithm for routing a lookup request from a node is simple: forward the request to the last finger table entry that precedes the identifier of the key. The node preceding the destination node will detect that the key falls between itself and its successor and return information about its successor to the node performing the lookup. Figure 2.2 shows an example of the route a lookup request might take through a Chord network. In this figure, Node 770 is performing a lookup request for Key 665, which it finds stored at Node

11 Figure 2.2. An example of the route taken by a lookup in a Chord network. For a new node to join a Chord network, it needs to know of any one node that is already in the network. Finding a node that is already in the network is done out of band. The joining node will then use this node as a bootstrap node to perform a lookup on its own identifier. The node returned by this lookup will be the new node s successor in the Chord ring. The new node will send a message to its successor notifying it that it is now that node s predecessor and the successor will inform its former predecessor that the newly joining node is now its successor. The joining node will then use its successor to perform the appropriate lookups to fill in its finger table. Since nodes will be joining and leaving continuously, each node needs to periodically re-perform these lookups using a method called fix_fingers() in order to keep its finger table up to date Pastry Pastry is in many respects similar to Chord. As with Chord, nodes and keys are hashed to integer identifiers that are placed on a ring and range between zero and the maximum hash output value. The identifier length used in [11] is 128 bits. Unlike Chord, the node responsible for a key is the node whose identifier is numerically closest to the key s identifier. 8

12 Nodes keep track of identifiers in base 2 b, where b is a configurable parameter and usually has a value of 4. The first major difference between Pastry and Chord is the organization of each node s routing table. A Pastry routing table is arranged into log 2 b N rows, where N is the number of nodes in the system. Each row has 2 b -1 possible column entries. The value of an entry at row i and column j is any node that shares the routing node s first i digits in its identifier and has j as the next digit. Each node also keeps track of additional nodes in a leaf set and a neighbor set. The leaf set contains L nodes, where L is another configuration parameter. The leaf set consists of the L nodes with the closet numeric identifier to this node. The first L/2 nodes precede this node on the ring and the last L/2 nodes succeed this node. The neighbor set consists of the physically closest nodes to this node (as opposed to numerically), where closest is defined by a proximity metric such as ping. When routing a lookup request, the first place a node looks is into its leaf set. If the sought after identifier lies somewhere between the identifier of the first node in the leaf set and the last node in the leaf set, then we know which node is responsible for the key and can forward the lookup request to its destination. If the identifier is not in range of the leaf set, a node will use its routing table instead. The lookup request will be forwarded to the entry that shares the longest common prefix with the key s identifier and the node s identifier plus the key s next digit. If no such node exists, the lookup request is simply forwarded to the closest node to the destination from among all nodes in the routing table, leaf set, and neighbor set. The expected number of hops for a lookup to complete is log 2 b N. To join the network, a node will send a lookup request for its own identifier through a bootstrap node. The joining node will create its routing table by copying the row used by each node along this route that it used in the routing process. Filling in broken routing table entries will be done reactively as the missing entries are detected. The peers that a node can keep in its routing table are flexible in Pastry. Each routing table entry can be any node that meets the prefix requirement. This can be used to exploit locality, and each node can attempt to fill its routing table with the entries that offer the best performance. This is in contrast to Chord, where there is only one correct entry for each routing table row Content Addressable Networks (CANs) A CAN is quite different from Chord and Pastry. In a CAN, node and key identifiers are points on a d-torus in a d-dimensional space, where d is a configuration parameter. Each node is responsible for a zone, which is a bounded area of the overall CAN space surrounding the node s identifier point. All keys that hash to points within a node s zone are the responsibility of that node. 9

13 For routing, each node needs to know only about the nodes with bordering zones. This is a fixed, small number of nodes. For routing, nodes forward lookup requests to any node whose zone is closer to the destination. In many cases, this might be the node whose coordinates make the most progress to the destination, but when routing a node may take into consideration a tradeoff between distance gained and locality. The number of hops required to reach the destination is O(n 1/d ). Since d is fixed for any one CAN, this means that as network size increases the number of hops on a CAN route increases faster than it does with Chord or Pastry. 10

14 3. Peer-to-Peer Protocol Security Issues and Related Work Distributed hash tables face many security issues that Napster and Gnutella did not. Since the Napster directory was centrally managed, all of the security mechanisms for performing lookups needed only be at once place. The central management, however, proved to be Napster s fatal weakness. With Gnutella, the lack of a network structure meant that all one needed to do was create overlay links to as many nodes as possible in order to be assured that flooded lookup attempts would actually be correctly sent through the system. DHTs are both fully distributed and structured. Users must rely on other nodes in the system to follow the structured protocol correctly for the system to work. In the physical networks that consist of the Internet, routers are controlled by trusted corporations and other entities that are unlikely and unable to attempt to attack the overall system. With an overlay network, on the other hand, end users control the virtual routers in the system. The relative small size of an overlay network and the ability of a user to control multiple nodes in the overlay give attackers a great opportunity to compromise the system. Survey papers have been written that enumerate the attacks available to a malicious user of a DHT system [12, 14]. These papers provide useful information that should be considered by anyone trying to secure a DHT. This section will review those attacks, and present papers that have attempted to address those attacks. We will also discuss how those approaches motivate and differ from the approach taken in this thesis Data Attacks A simple attack that can be performed on DHT protocols is an attack on the data stored in the system. An attacker can deny the existence of data that its nodes are responsible for and can modify any legitimate data he stores. The attacker can also introduce compromised data into the system. Data integrity is an application level security issue. The sole purpose of a DHT protocol is, given a key, to find the node responsible for that key. The behavior of that node after it is found is not the responsibility of the lookup protocol used to find it. However, DHT protocols can help in the response to these attacks once they are detected by associating multiple nodes with each key, a technique known as replication. Replication is needed not only in case of an attack, but also in case of node failures. Many DHT protocols include replication features. In Chord, replicas are stored on the set of nodes that immediately succeed the node that the protocol specifies should store a given key. That way, if the responsible node fails, the node that is now responsible for that key already contains the associated data at the time of the failure. Pastry stores replicas on the closest nodes to the responsible root node on the ring. CAN uses multiple 11

15 hash functions to generate multiple identifiers for each key, and this results in a random distribution of replica identifiers throughout the network. From a security standpoint, it can be argued that the replication approach taken by CAN is the most resistant to attacks, as is done by [14]. If replicas are all stored in a cluster of contiguous nodes, a malicious node in the area could potentially deny access to the entire replica set during the lookup process. By spreading the replica nodes out over the system with multiple hash functions (or by simply hashing the key multiple times with the same function) we can reduce the likelihood of an attack of this type being successful. This type of replication can be easily adapted to both Chord and Pastry, although doing so will result in more overhead. Since an attacker can deny the existence of data that he or she should be responsible for, when performing a lookup we need to check multiple replicas to be sure that the data really does not exist. Looking up multiple replicas in systems using multiple hash functions can be done in parallel so that the process does not add significant waiting time to the lookup process. Verifying the integrity of the received data is, again, outside the scope of a DHT protocol and outside of the scope of this thesis Identifier Attacks If an attacker can position the nodes he controls in the network in such a way as to control all replica nodes for a data item, then replication may be rendered ineffective. This type of attack is possible when nodes are allowed to choose their own identifiers. An attacker can simply compute the identifiers of all of the replicas of a key and create nodes with those identifiers. An attacker can also place itself in strategic positions in order to force a victim to use the attacker s routers for all routing table entries. Any truly secure DHT protocol cannot allow nodes to choose their own identifiers. Identifiers must be assigned in a secure and verifiable fashion. We also cannot allow a node simply keep generating new identifiers quickly, as this would allow them to obtain identifiers near the keys they wish to attack given enough time. One simple solution to this might be to force nodes to use the hash of their IP address as their identifier. This allows other nodes to easily verify the legitimacy of a node s identifier and to ignore messages from nodes that are not using the correct identifier. However, in some cases an attacker may have a large range of IP addresses at his disposal, especially if IPv6 is being used. In this case, the attacker could hash IP addresses until he finds one that is close to an identifier he seeks and then use that IP address. Even when this is not the case, it is often the case that multiple users may be running nodes behind a NAT router, thus having the same IP address. We can hash the port that the P2P application is running on as well, but since users can choose their ports, this gives the user more available identifiers. In Chord, a node s identifier is the hash of its IP address, port, and virtual node number. Since each user can run many virtual nodes, this also gives an attacker access to an even wider array of available identifiers. 12

16 The allocation of IP address blocks is technically centrally managed by ICANN. This means that any application using IP addresses over the Internet can never be truly fully distributed. Working solutions to the identifier assignment problem can be achieved if we are willing to give in to another centralized concept: certificate authorities (CAs). A certificate authority can take a public key from a user and bind it to a random identifier chosen by the certificate authority. Nodes can verify the authenticity of other node s identifiers by checking the CA s signature. This has the added benefit of providing a public key infrastructure that can be used for exchanging messages between peers. The disadvantage is, of course, that a CA is a single point of trust and a single point of failure. A related attack to the identifier attack is the Sybil attack [4]. A Sybil attack is an attack where a single attacker joins a peer-to-peer network with numerous identities, giving that attacker control of a large portion of the network. If an attacker gains control of a large enough portion of the network, redundancy features that can be used to access denied or corrupted data can be rendered ineffective. An attacker who controls a large enough fraction of the network will be in control of almost all of the data in the overall system. The attacker will also be in control of most of the routers in the system, and can disrupt lookup requests travelling through the network. This attack can occur when a system does not take measures to associate distinct entities with distinct identities. We would like each entity to be associated with a maximum of some small, constant number of identities. In a perfect world, we would also like identity assignment and verification to be performed in a completely distributed way, such as with a web of trust. Unfortunately, [4] shows that no system that uses a fully distributed identity verification method will be completely invulnerable to a Sybil attack. While some papers have put forward distributed solutions that prevent Sybil attacks to a certain extent (for example, [2, 3]) the only solution that completely works is to use a central certificate authority. Trusted certificate authorities are proposed for DHTs by [14, 12, and 1]. Since it may be unreasonable to expect the trusted authority to verify real world identities, [1] proposes charging a fee to obtain a certificate to limit the number of identities that an attacker is willing to obtain. Another idea put forward by some is to force nodes to solve puzzles. This idea is rejected by [1] and [4]. The argument against puzzles is that they must be easy enough for the slowest machines to be able to solve them in a reasonable amount of time and yet hard enough to prevent an attacker with a large amount of resources from obtaining many certificates quickly, which does not seem to be a reasonable requirement. In this thesis, we are not proposing a defense against Sybil attacks. Our defense is for a system with some minority fraction of the nodes compromised. Our main goal is to prevent routing attacks in a system where an attacker (or group of attackers) manage to compromise a subset of the legitimate nodes in the system. We will assume some Sybil attack defense mechanism is in place, such as a certificate authority charging money for 13

17 certificates. The centralized nature of the certificate authority is unfortunate, but unavoidable as shown by [4] Routing Attacks Building on section 3.2., we will now assume that malicious nodes cannot choose their location in the overlay network and that an attacker cannot completely overwhelm the network by creating an unlimited number of identities. Even with these security mechanisms in place, an attacker controlling even a small fraction of randomly placed nodes can seriously disrupt the system. While these nodes can compromise the data they are supposed to be storing, replication allows us to find a good alternate node provided there are enough replicas. The real problem posed by the attacking nodes is their ability to compromise the DHT lookup routing protocol. There are two ways to route through a DHT based peer-to-peer network: recursively and iteratively. Recursive routing means that a lookup request is sent from hop to hop through the overlay network until it reaches its destination, which can then respond either directly to the node performing the lookup or by sending a response backwards through the lookup s path. Iterative routing is when the node performing a lookup contacts each node on the route one by one and asks for the next hop towards the destination. The disadvantage of iterative routing is that we must send a query and receive a response from every node on the network, so lookups take about twice as long as they do with the recursive method when the destination directly responds to the node performing the lookup. The advantage is that iterative routing gives the node performing a lookup complete control over the routing process. Both recursive and iterative routing can be compromised if a malicious node is encountered on the path to a lookup s destination. A malicious node can drop the lookup request, forward it to the wrong node, or respond with the wrong destination. With iterative routing, we are also vulnerable to an attack where malicious nodes can keep sending us from one incorrect malicious node to another indefinitely without ever reaching the destination. With recursive routing, this indefinite routing attack would be treated the same way as a dropped packet as the lookup request was sent out and no response was received. It is important to note that since all DHTs have to be fault tolerant, they all must deal with dropped lookup requests to an extent as dropped lookup requests will occur occasionally with non-malicious node failures. A malicious node may choose not to behave the same way at all times or when handling lookup requests from different parts of the system. While a normal node that has failed would be removed from the network, a malicious node can behave just well enough to remain in the network and then drop all lookup requests that it receives. A lookup request needs to reach only one malicious node before the lookup is compromised. If the average hop count is h and the fraction of malicious nodes is f, then the probability of a route not containing any malicious nodes is (1-f) h [14]. In Chord, the average hop count in an n node network is approximately ½ log 2 n. With a 1,000 node 14

18 network, this means we can expect an average hop count of around 5. If 25% of nodes in the system are compromised, the probability of a lookup request avoiding any malicious node is , which is 24%. So in this case, an attacker only needs to control 25% of nodes to disrupt 76% of lookups. The effects of a routing attack may be exacerbated in systems that do not have a constrained routing table. A constrained routing table is a table where each entry only has one possible correct value. Chord has a constrained routing table. For a node with identifier n, the only correct entry for finger table entry i is the node that succeeds the value n+2 i (mod m) on the identifier ring. Pastry, on the other hand, does not have a constrained routing table. For a particular routing table entry in Pastry, any node that meets the prefix requirements is a valid node for that entry. Pastry tries to fill these entries with the matching nodes that have the best locality measurement in order to optimize performance. This can allow attackers to fake locality and increase the odds that their nodes are used as routing table entries by others, as shown by [1]. Also shown by [1] is the fact that it is easier with Pastry for an attacker to provide malicious nodes as routing table updates, especially for the top rows, since it likely that an attacker that has control of any significant fraction of the system will control at least one node for each short prefix. Aside from the routing tables being constrained, the routing table entry selection should be constrained as well, as pointed out in [12]. Otherwise, a malicious node can simply use only the malicious nodes that appear in its constrained routing table when routing. CAN, for example, allows each node to decide which node to route to next based on a tradeoff between progress towards the destination and round trip time to the next hop. Since next hop selection is not constrained, we cannot verify that our routing request is being routed correctly. There are several design principles proposed by [12] for securing DHT protocols. The first of these is to define verifiable system invariants and verify them, and the second is to allow the node performing a lookup to observe lookup progress. The idea here is to use constrained, iterative routing. We should verify that those constraints are being met as we are routing. This is one of the major principles behind the proposals that we are making in this thesis. We propose here a way for verifying system invariants and for reacting to situations in which those invariants are not met. One solution for avoiding routing attacks is proposed in [1]. This is a solution that works with the Pastry DHT. The goal is to successfully retrieve a set of replicas for a given key, where the replicas are a subset of the neighbor set for the root node responsible for a key. This is a contiguous set of nodes. A node performing a lookup will use its own neighbor set to compute the average numerical distance between node identifiers in the identifier ring. This value is then compared to the average distance between node identifiers in the replica node set that is returned from a lookup request. If the average distance between identifiers in the replica set is too large compared to our own computed average then it is determined that a malicious replica set was received. 15

19 If a node performing a lookup determines that a replica set is malicious, numerous lookup requests are then sent through the node s neighbor set. These neighbor nodes will use a separate, constrained routing table to route the lookup requests through the network. The original Pastry protocol does not use constrained routing tables, so a separate table is kept. Since each node s constrained routing table is different and not based on performance metrics, when a lookup request is sent through different neighbors it should take a diverse set of routes towards the destination. The set of replica sets received in response is combined, and all of these nodes are contacted and asked to provide their neighbor set. Any new nodes found are then asked to provide their neighbor set as well, and as long as new neighbor nodes are provided, this process is repeated up to three times. When this is completed, the closest nodes found to the key s identifier are determined to be the correct replica set. This method was shown to find the correct replica set over 99.9% of the time when up to 30% of the nodes in a 100,000 node system are compromised. Our proposed system also makes use of the average node identifier distance in the network, but it differs in that we use this information to actively avoid routing attacks during the routing process instead of reacting to them when they are detected after the lookup request has been completed. We use the average node identifier distance to verify system invariants as we are observing the lookup process two principles proposed by [12]. These ideas are explicitly rejected for Pastry by [1] which claims that this would add too many extra hops and not be very accurate. Our results with Chord, however, show a significant increase in routing success over the base Chord protocol. [1] relies on performing a large number of parallel lookups for the same identifier to find the responsible replica set, while our system does not. A mechanism for defending against a Byzantine join attack is proposed by [5]. This proposed system, called S-Chord, modifies Chord routing tables to make use of swarms instead of individual peers. Swarms consist of the set of all nodes that are within (C ln n) / n of the swarm s point location, where C is a configurable constant and n is the number of nodes in the system. Each lookup request is forwarded from all of the nodes in one swarm to another. If the number of Byzantine peers joining in a time period is below some configurable threshold, then the correct successor swarm for a key may be found with high probability. This defense mechanism, however, requires that each node keeps track of O(log 2 n) nodes and each lookup request consists of O(log 2 n) messages. Another proposed mechanism for secure routing with Pastry is proposed by [8]. The idea is to move untrusted nodes into separate Pastry rings that interface the main Pastry ring via two anchor nodes. Messages going through the main Pastry ring can then bypass the untrusted rings. The mechanism for deciding which nodes should be trusted and which should not is left as future work. With the ability to perfectly detect untrustworthy nodes, the percent chance of a lookup request completing successfully is equal to the percentage of nodes in the system that are trustworthy. An actual trust system that can be used is not provided by [8] however, and creating one is an open problem. Our proposed modifications to Chord do not rely on any trust system. 16

20 4. Chord Secure Routing Design This section describes the changes we are making to the Chord protocol in order to avoid routing attacks Threat Model In order to design a defense, we first need to understand the attacks we are defending against. The purpose of this thesis is to propose a method of avoiding routing attacks. We will assume that attackers cannot choose their node identifiers. This can be done by using a certificate authority as shown in Section 3.2. We will assume that some fraction of the overall set of nodes is compromised and that all of these nodes can collude with each other. We will assume that the attacking nodes are a minority of nodes in the system. This is an important assumption: we are not designing a defense to Sybil attacks. Here are the basic capabilities that we will assume an attacker has: Attackers can drop lookup requests. Attackers can forward lookup requests to incorrect nodes. Attackers can direct lookup requests to other malicious nodes in any manner they wish. Attackers can be selective in which lookup requests they respond to correctly and which they do not. Basically, an attacker can do anything they want with a lookup request that is sent to their node. To evaluate the performance of this system, we will test against three different types of attacks. These attacks are designed to represent the most effective lines of attack available to a malicious node. Dropping Lookup Requests. This is a simple type of attack. When a malicious node performing this type of attack receives a lookup request, the node simply does not respond. The system must be designed to recover from nodes that drop lookup requests. Randomly Misrouting Lookup Requests. With this type of attack, the attacker does not drop the lookup request but instead tries to send the victim to some random next hop. This may be another misrouting node that sends the victim off in yet another random direction, preventing the victim from ever reaching the destination. This is more difficult to defend against than the lookup dropping attacker, since it may appear that the attacking node is actually cooperating by giving a next hop. Performing a sub-ring attack. In this type of attack, a group of attackers are colluding to try to cause lookup requests to end up at a malicious node. These attackers have two finger tables one is the correct finger table that it uses for itself, and the other is a finger table that consists of the first succeeding malicious node of each node of its correct finger table. When an attacker receives a lookup request, that lookup request is forwarded using the malicious node finger table. The lookup request is therefore captured by the attackers and will only be 17

21 forwarded through malicious nodes, and will reach a malicious destination, which will often not be the correct destination. This attack is the most difficult to detect since it appears that the attackers are cooperating and routing correctly. Each hop will make progress to the destination, but the ultimate destination will always be a compromised node Design Overview The main idea behind the proposed system is to use locally known statistical data about the average numerical difference between consecutive node identifiers to detect routing attacks during the routing process and to recover from detected attacks. We store the identifiers of the successors and predecessors of nodes in our finger table for the purpose of computing the average numerical distance between node identifiers. We use a pruning mechanism to remove distance samples that are likely the result of malicious nodes in our finger table. As we are routing towards our destination, we use the computed average distance to determine whether the hops we encounter are likely valid or invalid based on their distances from routing table reference points called finger pointers. When an invalid hop is detected, we backtrack to the previous node on the route and request a next hop that makes less progress towards the destination in an attempt to avoid the node that provided us with the invalid hop. In order to control the routing process, we will use iterative routing as opposed to recursive routing. A recursive lookup occurs when a node performing a lookup sends the request out into the network and lets the other nodes forward it towards its destination. This gives the user no control over the route their lookup request takes. With iterative routing, the node performing the lookup contacts each node on the route towards the destination one by one and requests the next hop. This gives the node performing the lookup control over the routing process. Finger pointers are the identifier values that a node looks up in order to fill in a finger table entry. For a node with identifier id and finger table row i, the value of the finger pointer is id + 2 i. Since a finger table entry s pointer clearly falls somewhere between two nodes, the difference between the pointer and the identifier for the node in the entry should be less than the distance between two nodes. We will compare this distance to the average distance between nodes that we will compute from our own finger table. If the distance is too large, the hop fails verification. We backtrack around nodes that either do not respond to our requests or provide a hop that fails verification The Backtracking Algorithm Normally, the hop we take during the Chord routing process is the hop that most closely precedes the identifier of the key we are seeking. In our system, when a faulty node is detected during the routing process, we will fall back to the previous node on the route and use its next closest preceding node to the destination. This offers less progress, but 18

22 gives us a way to route around the faulty node. If a node runs out of hops to give because it has no more nodes in its routing table that precede the destination, we will fall back to the previous node on the hop and use its routing table and repeat the process. All nodes that are determined to be faulty/malicious or that have run out of hops to give will be stored in a temporary black list that is created for each lookup request, so that we never use this node again during the lookup request. To prevent the need to query the same node multiple times, we request the entire finger table of a node the first time it appears on the path towards the destination and cache it for the duration of the lookup attempt. There is a limitation in the Chord protocol that we must address. With Chord, every lookup request must be routed through the node that immediately precedes the node responsible for the key being sought. This means that if the node preceding the destination node is faulty, backtracking by itself cannot find a way around this node. To address this, each node is made aware of the identifier of the predecessor of every node in its finger table, and this identifier is stored as an extra column in the finger table. As long as we can find a non-faulty node with the destination node in its finger table, we can identify that node as the destination node by verifying that the key identifier we are seeking falls between that node s identifier and its predecessor s identifier. This allows us to bypass faulty nodes immediately preceding the destination. Since nodes often have many finger table entries for nearby identifiers, as we get closer to the destination we have a good chance of being able to find a node with the correct destination in its finger table, allowing us to bypass faulty nodes preceding the destination. We need to be careful about under what circumstances we bypass nodes on the route to the destination. If we request a node s finger table and see that it contains a reference to what appears to be the destination node, we might be tempted to use this reference and bypass the rest of the routing process. The issue with this method of bypassing is that nodes further away from the destination are more likely to have out of date successor information, and the destination may have changed and that node has not yet called its fix_fingers() method. Therefore, we only want to use bypassing as a last resort. We will not immediately bypass the rest of the nodes on a route when a node on the route knows of the destination. Instead, we will only bypass if that node has run out of any other hops that we can use. We illustrate the concept of backtracking and bypassing in Figure 4.1. In this situation, the first and second hops have passed verification. The node reached during the second hop has a reference to the destination node in its finger table, which we have verified by checking to see that the identifier of the key we are looking for falls between that node and its predecessor. We do not immediately bypass, and instead route to the closest preceding node (hop 3.) Hop 3 provides a next hop that fails verification, so we fall back and ask for the second closest preceding node (hop 4.) Hop 4 also provided a next hop that fails, so we fall back again and try to use the third closest preceding node to route towards the destination (hop 5), which again provides a next hop that fails verification. 19

23 This time when we fall back, the node we are using has no more preceding hops, so we now bypass and reach the destination with hop 6. Figure 4.1. An illustration of backtracking and bypassing. The modified closest_preceding_node() algorithm is shown in Figure 4.2. Since we aren t always returning the closest preceding node, we have renamed this algorithm next_hop(). This function takes four input variables. The input variable id is the identifier being sought. The input variable index specifies which preceding node we want to obtain from the finger table. An index value of 1 means we want to find the closest preceding node, a value of 2 means find the second closest preceding node, and so on. The index variable will have a value of greater than 1 when we are trying to route around the node that was the closest preceding node because it either provided a hop that failed verification or ran out of preceding hops that we could use. The input variable nodeid is the identifier of the node that we are obtaining the next hop from and the input variable fingertable is the finger table of that node. The local variable uc in next_hop(), which is short for unique count, is the number of unique preceding nodes that have been counted while looking for the index-th closest preceding node. The first thing the next_hop() algorithm does is check to see if the identifier we are looking for falls between the node we are using and its successor. If it is, and the value of index is 1, then we can return the successor of the node we are using, which is the destination node. If index is greater than 1, it means we are trying to find a further away preceding node than this node s successor, which is impossible, so we return null. 20

24 n.next_hop(id, index, nodeid, fingertable) uc = 0 if id in range (nodeid, fingertable[1]): if index == 1: return fingertable[1] else: return null bypassnode = null for i = m down to 1: if id in range (fingertable[i].predecessor, fingertable[i]]: bypassnode = fingertable[i] if fingertable[i] in range (n, id]: if (i == m or fingertable[i]!= fingertable[i+i]): uc = uc + 1 if (uc == index): return fingertable[i] return bypassnode Figure 4.2: Revised closest preceding node algorithm. The next step is to go through the finger table, from the last entry down to the first entry, just as you would in the unmodified closest_preceding_node() algorithm described in [13]. If we look at a finger table entry and determine that the identifier we are looking for falls between it and its predecessor, then we will save a reference to that node and return it as the destination node later only if we cannot find any other valid next hop. As soon as we find a node that precedes the identifier we are looking for, we will start increasing our unique counter (uc). We only increase the unique counter after that when a finger table entry is different from the finger table entry that follows it. Once the unique counter is equal to the index we are looking for, we can return the finger table entry that we were checking. If we get all the way through our finger table and determine that we do not have enough entries for the requested index, then we simply return null. A return value of null indicates to the caller that this node has no more next hops to give. A return value that precedes the requested identifier indicates to the caller that the returned node is the next hop to the destination. A return value that succeeds the requested identifier indicates to the caller that the returned node is the destination that is responsible for the key being sought. This next algorithm we will describe, shown in Figure 4.3, is the main routing algorithm, a modified version of find_successor(). This algorithm makes use of the find_next_hop() algorithm that was already described and of a verify_hop() algorithm that will be described later. 21

25 Our find_successor() algorithm keeps track of a stack of Chord nodes that are used during the routing process, called the routing stack. We will use the word router to describe a node whose finger table is being used for routing purposes. This stack contains the identifier, finger table, and current backtracking index of every router on it. We show this as a stack of <id, fingertable, index> which are tuples that represent these three pieces of information. We also keep track of a blacklist, which is a list of routers that have either given us hops that failed verification or that have run out of nodes on their finger table to give us as next hops. We also keep track of the number of hops so far using the variable attempts and allow the user to input a limit to the number of hops that will be used at a maximum. The first thing pushed onto the router stack is the routing table information for the node that is performing the lookup. n.find_sucessor(id, hoplim): routerstates = new STACK of <id, fingertable, index> tuples blacklist = new LIST of nodes attempts = 0 routerstates.push(n.identifier, n.fingertable, 0); while!routers.isempty() and attempts < hoplim: currouterstate = routerstates.pop() currouterstate.index++ nexthop = n.next_hop_forward_bypass(id, currouterstate.index, currouterstate.id, currouterstate.fingertable); if (nexthop!= null and verify_hop(currouterstate.id, nexthop.id) and!routerstates.contains(nexthop)) or currouterstate.id = n.id: else: if id in range (n.id, nexthop.id): return nexthop else if blacklist.contains(nexthop.id): routerstates.push(curstate) else: routerstates.push(currouterstate) routerstates.push(<nexthop.id, nexthop.fingertable, 0> attempts++ blacklist.add(currouterstate.node.id) return null Figure 4.3. The algorithm for finding the successor node of an identifier. Our main routing loop will run as long as there are routers still available in the routing stack and as long as we haven t exceeded the maximum number of hops. During each iteration, we pop the top router off the stack, increment its backtrack index, and perform a lookup with its information. The first time a router is popped off the stack, its index value will be set to 1, meaning our next_hop() algorithm will use the closest preceding node in that rouer s finger table. If it is ever popped off the stack for a second time, that value will be set to 2, meaning next_hop() will return the second unique preceding node 22

26 in that router s finger table. This is how backtracking to route around problem nodes is performed. Once we have performed a next hop lookup, we make sure the lookup actually returned a next hop, use verify_hop() to check to see that the hop is valid, and check to see that the next hop isn t already on the routing stack. If it passes these tests (and we allow our own node to always pass these tests) then we perform another test. If the node return succeeds the identifier we are seeking, then it is the destination node, and we return it. If the node returned is on the black list, then we simply push the router we were using back onto the stack, and during the next iteration its index will be increased and a prior finger table entry will be used. If the next hop is not the destination and is not on the black list, then we will contact that node and request its finger table. We then add this next hop to the routing stack, with an initial index of 0. If a hop fails verification, then we add the router that provided the hop to the blacklist. We do nothing to the routing stack. The previous router on the path will be used on the next generation, and it will be used with a higher index value, and thus we will use a finger table entry that offers less progress. If our router stack ever empties or we exceed the maximum number of hops, the lookup has failed and we return null to indicate that this has happened. Figure 4.4. A sample route using the modified algorithm Figure 4.4 shows an example of how routing works in the modified system. In this example, the dark nodes are behaving and the white nodes are compromised. The source 23

27 node and the node responsible for the key being sought are both labeled. The first hop uses the closest preceding node on the source node s finger table. This node then provides a reference to a next hop that fails the verification algorithm. Our algorithm will then go back to the source node s finger table and call next_hop() again with an index value of 2, which returns a reference to the second closest preceding node, shown as hop 2. The rest of the routing process completes without incident and the destination node is found The Hop Verification Algorithm As you will recall, a finger table is a table with m entries, where m is the length in bits of the identifiers used in the network. Entry number i in the finger table points to the first node that succeeds the value (id + 2 i ) mod 2 m we call this the pointer value for entry i. We know that the pointer value points to a space in the Chord ring that falls between the finger table entry and its predecessor. In order to verify that a hop is legitimate, we verify that the distance between a finger table pointer and the identifier of the node for that entry falls within what we would expect the typical range to be between nodes in the Chord ring. This typical value is something we compute by averaging together locally known distance values that we obtain from the data stored in our finger table, which is described in the next subsection. The verify_hop() algorithm is shown in Figure 4.5. n.verify_hop(firstnodeid, secondnodeid, indexused) fingerpointer = (firstnodeid + pow(2, indexused)) % 2 m distance = secondnodeid fingerpointer % 2 m acceptabledistance = AVG_DISTANCE + (sd_mod * STD_DISTANCE) if (distance > acceptabledistance): return false else: return true Figure 4.5. The verify_hop() algorithm The input parameters are the identifier of the node that gave the hop, the identifier of the node at the end of the hop, and the index we used to look up this hop in the routing table of the node providing the hop. The distance variable that is computed is the difference of the end node of the hop and the finger pointer that is used to point to it from the node that provided the hop. If this value is higher than the value we wish to accept, then we reject the hop, otherwise we accept it. The acceptable distance is computed using three variables. AVG_DISTANCE is the average distance between nodes computed from locally known distance samples. STD_DISTANCE is the standard deviation of the distance samples. The sd_mod parameter is a system parameter that controls how many standard deviations over the average we wish to allow a node/finger pointer distance to be. SD_MOD provides a way to balance false positives vs. false negatives. The acceptable distance is, as shown, the average distance plus the standard deviation scaled by SD_MOD. 24

28 By forcing the next hop to fall within an acceptable distance of the routing node s finger table pointer value, we are tightly restricting where in the Chord ring the next hop may be. If an attacker does not have control of any nodes in that acceptable area, the attacker cannot fool us into using another attacking node as the next hop. Since nodes cannot arbitrarily place themselves wherever they wish in the Chord ring, it becomes much more difficult for a malicious node to have our lookup request forwarded to any node except the correct next node on the route Maintaining Statistical Data In order to compute the average distance between nodes in the system, we will store some extra information about each node in our finger table. Figure 4.6 shows what a row in our finger table looks like. The index, node identifier, and node remote reference are columns that appear in the normal Chord protocol s finger table. We are also storing the identifier of the node s predecessor s identifier and the identifiers of the nodes in its successor list. Knowing these identifiers allows us to generate samples for computing the average distance between nodes and the standard deviation of that average. All of this data will be obtained from a node when we are calling our fix_fingers() method to update a finger table entry. Index Node Identifier Node Remote Reference Node Predecessor Identifier Figure 4.6. Row format for the modified finger table. Node Successor List Identifiers It is possible for a node to lie to us about the identifiers of its predecessor and successors if we do not take extra precaution. However, since we are operating under the assumption that nodes are not allowed to decide their own identifiers and that identifiers are verifiable, we can make sure that the provided identifiers are valid by storing some extra information. For example, if a certificate authority is being used, we can store signed node certificates, which contain a node s identifier, IP address, public key, and a signature of the granting authority. We also need a mechanism to prevent a node from using certificates of nodes that are not currently in the network. This can be done by requiring that nodes occasionally request the current date and time from its predecessor and successors signed with their secret keys. This data can be requested by nodes when they update their finger table, and can be decrypted with that node s public key, allowing us to verify that the node has recently been in the network. In the worst case, we can at least ping each of these nodes occasionally to make sure they really are in the network. Even with these protection mechanisms in place, the attacker can still provide real node certificates for nodes that are not actually its predecessor and successor. This would cause the victim to calculate an average node distance that is higher than the actual 25

29 average that should have been calculated. To prevent this type of attack, we prune our data set. The distance between node IDs on the Chord ring is exponentially distributed, as it is for any ring based DHT system with randomly assigned node IDs [1]. A useful property of exponential distributions is that the average difference of consecutive values is equal to the standard deviation of those differences. This means that the average distance between nodes and the standard deviation of these distance values should be approximately equal. Since malicious nodes can provide successor and predecessor node identifiers that are not consecutive to its own identifier, we would expect to derive distances from those malicious nodes that are greater than they really are which would cause our computed standard deviation to be greater than our computed average. We prune our sample set by throwing out the highest values until the computed standard deviation is close enough to the average. We do this with another system parameter, called the pruning parameter. We remove the highest distance sample values until the standard deviation of what is left is less than the average distance scaled by the pruning parameter. This algorithm is shown in Figure 4.7. n.calculate_statistics (distancesamples (LIST of <BigInt>), pruningparameter) done = false average = stdeviation = 0 distancesamples.sortascending() while (!done): average = AVG(distanceSamples); stdeviation = STDEV(distanceSamples); if stdeviation > average * pruningparameter: distancesamples.remove(distancesamples.size() 1) else: done = true return (average, stdeviation) Figure 4.7. The calculate_statistics() algorithm The justification for throwing out the largest distances when pruning (to attempt to get an average and standard deviation that are reasonably close) is because nodes can lie and provide nodes that are further away than its actual successors/predecessor, but it cannot lie about a node that is too close, because no such node exists that is closer than the correct successors and predecessor. The effectiveness of pruning will be evaluated in Section Joining the Network In order to successfully join the Chord network in the presence of attackers, we need to make a few changes. We must know of some set of uncompromised bootstrap Chord nodes that are already in the system in order to join it securely. If we try to join with 26

30 compromised bootstrap nodes, they can simply put us into any Chord network they would like. These nodes must be found out of band, and this is the only situation in which we require that trust exist between nodes. We use a set of bootstrap nodes instead of a single node because even with our security mechanisms in place some incorrect lookups occur in the presence of attackers. In order to populate our initial finger table, we will ask each bootstrap node to perform a lookup of our finger pointer identifiers (nodeid+2 i for i from 0 to m-1). We will use the node with the ID closest to the finger table pointer for that finger table entry. The reason for this is, again, because nodes can lie about nodes that are too far away, but cannot lie about nodes that are too close Updating Finger Table Entries The next issue we have to deal with is the case where a finger table update call (fix_fingers()) gets an incorrect update for a finger table entry. To avoid this, we modify fix_fingers(). If we receive a new finger entry that is closer than the old one to the finger pointer during an update, we accept it and make the change. However, if a new finger is further away than the old one, then we will check the old entry and the nodes in its successor list to make sure that all nodes with preceding identifiers to the new node s identifier have actually left the network. If none of these nodes are in the network, we accept the update, otherwise we reject it and use the closest succeeding node that we were aware of as the finger table entry. 27

31 5. Simulator Design In order to test and evaluate the proposed changes to the Chord protocol, we built our own Chord simulator from scratch. The simulator is written in Java and requires Java version 1.5 or later to compile and run. Individual tests can be set up and run with an included GUI utility, and batches of tests can be written in Java by using the ChordController class. The source code is available online at Our simulation works in an iterative fashion rather than being multithreaded. Running a separate thread for each node is not reasonable with large Chord networks, and we are able to achieve the same functionality with an iterative simulation. Each node has a tick() method that is repeatedly called by the simulator, and the tick method calls the functions that need to be periodically called for the Chord protocol to run correctly. The original Chord protocol as described in [13] is implemented in the class ChordNode. Three malicious node classes have been written that extend ChordNode to test the effectiveness of three types of attacks against the unmodified Chord protocol. The first is MDropperChordNode, which is a node that drops all lookup requests. The second is MRandomChordNode, which is a node that works with other MRandomChordNodes to forward lookup requests around the Chord ring randomly without ever reaching the final destination. The last is MColludingChordNode, which is a node that colludes with other MColludingChordNodes to form a sub-ring of the main Chord ring in order to try to capture lookup requests and forward them through the sub-ring. The changes that we propose are implemented in the class SecureChordNode, which extends ChordNode. The same types of attacks have also been implemented in the modified protocol in order to test the effectiveness of these attacks against our changes, and the nodes that perform them are called MSDropperChordNode, MSRandomChordNode, and MSColludingChordNode. These classes all extend SecureChordNode and implement the same attacks as the nodes that extend the base ChordNode class. This class structure is illustrated in Figure 5.1. It should be noted that the nodes of the default protocol are not compatible with the nodes of the extended protocol. ChordNodes cannot be mixed with SecureChordNodes due to protocol differences. 28

32 Figure 5.1. An overview of the various Chord node classes During a simulation, a ChordController object is responsible for managing all of the nodes in the system. ChordController objects store nodes in ChordRing objects. A ChordRing acts as a data structure for the nodes and contains convenient operations that may be performed on the entire system. A ChordController may have also have a ChordGUIUtil and a StatKeeper object associated with it. The ChordGUIUtil displays graphical results for an experiment and the StatKeeper collects statistics during an experiment and displays them when requested. We designed and developed all classes ourselves except for BigSquareRoot, which is a free utility class for calculating the square root of Java BigIntegers taken from [6]. This class is necessary for computing the standard deviation of the average node distance. There are two main mechanisms for running tests. The first is to use the included ChordGUIUtil class, which contains a main method that displays a GUI allowing the user to run individual tests. The GUI utility will create a network with the specified number of nodes and the specified parameter, fill in the finger tables of all those nodes, and then simulate a specified number of tests. Output is displayed to the screen. The second test mechanism involves coding tests in Java that make use of the ChordController class. This allows batches of tests to be set up and run, and the output is in comma separated value format which allows the data to easily be imported into a spreadsheet application. We will explain how to use both of these mechanisms in the next two sections Using the GUI Utility The included GUI utility is a convenient way to run simple experiments and to graphically observe how the unmodified Chord protocol and the secured Chord protocol operate. The tests you can run with this utility are very useful, but to have full control over your tests you will need to code your own tests to make use of the ChordController class. 29

Figure 5.2. A screenshot of the GUI utility To use the Chord GUI utility, you will need to obtain the simulator source code from http://www.csh.rit.

33 Figure 5.2. A screenshot of the GUI utility To use the Chord GUI utility, you will need to obtain the simulator source code from and extract the provided archive file. On a machine with Java 1.5 or later installed, you can compile this code by typing: >javac *.java And you can run the GUI utility by typing: >java ChordGUIUtil When you execute the utility, the GUI window will appear. The utility consists of four main components: The left side of the screen contains the experiment setup panel. This panel allows you to enter parameters for an experiment and then run it. The middle of the screen is the Chord display panel. After an experiment is complete, you can graphically view the routes that lookups took during the experiment. The right side of the screen is the lookup panel. After an experiment is complete, this will display some information about each lookup that occurred during the 30

Detecting and Recovering from Overlay Routing. Distributed Hash Tables. MS Thesis Defense Keith Needels March 20, 2009

Detecting and Recovering from Overlay Routing Attacks in Peer-to-Peer Distributed Hash Tables MS Thesis Defense Keith Needels March 20, 2009 Thesis Information Committee: Chair: Professor Minseok Kwon