Localhost: A browsable peer-to-peer file sharing system

Size: px

Start display at page:

Download "Localhost: A browsable peer-to-peer file sharing system"

Reginald Clarke
5 years ago
Views:

1 Localhost: A browsable peer-to-peer file sharing system Aaron Harwood and Thomas Jacobs December 17, 2005 Abstract Peer-to-peer (P2P) file sharing is increasing in use on the Internet. This thesis proposes Localhost, a P2P file sharing system that allows users to find files in the system by browsing a hierarchical directory structure, not unlike a file system. The hierarchical directory structure is global among all the peers in the system. Localhost stores the hierarchical directory structure simply by placing semantics on the files in the system. Localhost allows users to collaboratively build the hierarchical directory structure by way of a popularity based system. The popularity based system allows every directory node in the structure to have multiple alternate versions, and lets users choose which version they prefer the version with the highest number of viewer preferences is used as the default version. We model the popularity based system to find the range of model parameters that enable it to operate in a stable and progressive manner. 1

2 Contents 1 Introduction Peer-to-peer Key historical developments Methods of finding files Constructive collaboration Our contribution Legal and ethical issues Thesis organisation Related work Hierarchical directory structure systems Constructive collaboration systems Shared file system systems Localhost overview Interpreting files as directory nodes Browsing the hierarchical structure Downloading files from a directory node Editing and submitting a new version of a directory node Viewing different versions of each directory node Enabling technologies for Localhost peer implementation BitTorrent Kademlia Azureus Localhost peer design and implementation System implementation overview The web interface Directory node storage Directory node retrieval Global namespace Directory node display File retrieval Results and discussion Comparison to systems similar to Localhost Theoretical behaviour of the popularity based system Observation of the Localhost system in use Further work Conclusion 24 1

1 Introduction Over the last two decades, the Internet has facilitated a number of widely accepted applications for such activities as global communication, collaboration, and data distribution.

3 1 Introduction Over the last two decades, the Internet has facilitated a number of widely accepted applications for such activities as global communication, collaboration, and data distribution. and the World Wide Web are now invaluable applications used by practically all Internet users. In the last five years, peer-to-peer (P2P) file sharing systems have steadily grown in usage [14]. Internet traffic data collected by CacheLogic [4] indicates that there is now a larger volume of P2P traffic on the Internet than web traffic [5], as figure 1 shows. CacheLogic collected the data by installing deep packet inspection devices at a number of Internet Service Providers (ISPs) around the world. These devices monitor packets and classify them not only based on their network port number, but by inspecting the data contents of each packet. Figure 1: Internet traffic by percentage of data volume, categorised by application. 1.1 Peer-to-peer The term peer-to-peer (P2P) can be defined as a network in which a significant proportion of the network s functionality is implemented by peers in a decentralised way, rather than being implemented by centralised servers [32]. A peer is a single program that is run on a number of hosts, which interconnect, to form a P2P network. Decentralised typically means that the implementation of functionality is spread across all or most of the peers in the network. Centralised typically means that functionality is implemented using programs that are not peers, that are running on a relatively small number of hosts compared to the number of peers in the network. There are a number of applications of P2P. File sharing is the most widely used application the P2P category of figure 1 consists only of seven P2P file sharing systems. Other applications of P2P include Internet telephony [34], instant messaging [19], grid computing [37], and decentralised gaming [38]. Internet telephony can be implemented without use of a P2P network, but using P2P networks can have advantages. One popular P2P Internet telephony and instant messaging program is Skype [39]. Skype allows users to make voice calls and send instant messages to each other. In a non-p2p system, all of the voice and message packets are routed through a central server. In Skype, the packets are sent 2

4 directly from one peer to another, or routed through other peers in the Skype network when a direct connection between the two peers is not possible. As a consequence, the network can cheaply scale to millions of users because there is no need for costly centralised infrastructure. Currently, there are around two million users simultaneously using Skype at any given time [39]. P2P file sharing systems consist of program(s) that are used to create and maintain P2P networks to facilitate the transmission of files between users. They allow users to download files from other users of the P2P network, and often also allow users to designate a set of files from their PC s file system to be shared. Sharing a file makes the file available to other users of the P2P network. There are two key parts of a P2P file sharing system. The first part is the file distribution system. The file distribution system provides the means to transmit files between peers. It is the protocol used to dictate how peers in the system should behave in order to download and upload files. The second part is the file finding system. The file finding system is the means for users to find the files that are available on the P2P network. P2P file sharing systems typically provide the file finding system by maintaining some form of index of the files. P2P file sharing systems differ in how and where they implement these two parts. Some maintain the file index in a centralised way, and others in a decentralised way. P2P file sharing systems implement the file distribution system in a decentralised way. This definition of P2P file sharing systems satisfies the definition of P2P given earlier the most significant part of network s functionality, the transfer of files, is done directly between the peers in the network, without the use of centralised servers. 1.2 Key historical developments There are more than one hundred P2P file sharing systems listed online [40]. In this section we cover the most important ones in terms of technological development. In 1999, a P2P file sharing system called Napster [6] was released. Napster was the first popular P2P file sharing system; at its peak it boasted a registered user base of 70 million and 1.57 million simultaneous users [16]. The Napster approach uses a decentralised file distribution system, and a centralised index for users to find files in the network. Specifically, the Napster approach uses several intercommunicating central servers to maintain a filename-based index of files from all of the peers logged into the system at any one time. Each peer logs in to the system by connecting to one of the central servers and sending it the list of filenames of the files that the peer has to share. The central servers maintain only the names of the shared files and the IP addresses of the peers storing those files, not the file contents. The file contents are transmitted directly between peers. In 2001, Napster was deemed to be illegal and was ordered to be shut down by the courts in response to a lawsuit from several major recording companies [33]. In 2001, a P2P file sharing protocol called Gnutella [28] was released. The Gnutella protocol is implemented in a variety of peers, including Limewire [17], Shareaza [31], and Morpheus [22]. The Gnutella approach is an important development in P2P technology because it was the first popular P2P system to have a decentralised index. The Gnutella protocol works by having each peer connect to a small set of remote peers. When a user wants to find a file, the user forms a query string from desired keywords, and the query string is flooded by the Gnutella peer. To flood a query string, the Gnutella peer sends the query string to each remote peer it is connected to. Each remote peer then forwards the query string to all the peers it is connected to, and those remote peers in turn forward the query string, and so on. If a peer has files that satisfy the query string, the peer sends a reply directly to the original querying peer. The graph of peers is cyclic, and precautions are made to prevent infinite forwarding, such as including a Time To Live field on packets. The original Gnutella protocol suffered from scalability problems, because each query string generates several gigabytes of traffic [29]. The problem s impact has been reduced in a later Gnutella protocol [2] by the classification of peers into leaf peers and ultrapeers. Ultrapeers are peers that have high compute and bandwidth capacity and behave as peers do in the original Gnutella protocol. Leaf peers connect only to ultrapeers, and never connect to each other. Each ultrapeer has a number of connections to leaf peers, and maintains an index of the files available on all 3

5 of the leaf peers that are connected to it. Using this arrangement, query strings need only be flooded through the high bandwidth ultrapeers, and not the leaf peers, reducing the traffic in the system. In 2001, a P2P file sharing system called BitTorrent [3] was released. The BitTorrent P2P file sharing system was the first P2P file sharing system to de-integrate the file distribution system from the file finding system. The BitTorrent P2P file sharing system consists of the BitTorrent peers that make up the file distribution system, and a number of websites that index the files that are available in the BitTorrent network that allow users to find the files [26]. The index is maintained on hosts that are not peers, and not at all by the peers in the system, so we can say that the BitTorrent system has a centralised index. Files in the BitTorrent network can also be found by use of other Internet applications, such as Internet Relay Chat (IRC). The BitTorrent file sharing system has a number of novel features. The first novel feature is that the indexes on the websites are maintained by a relatively small (18, in one case [26]) group of moderators, rather than being moderated by every user on the network. This has advantages, which are described in the next two subsections. The second novel feature is that the file distribution protocol dictates that files are divided up into pieces, so a BitTorrent peer can upload a piece of a downloading file to other interested BitTorrent peers as soon as it receives that full piece. This allows the peers to assist the distribution of the file by providing upload bandwidth to remote peers, even before the peer completes the download. The third novel feature is that the file distribution protocol employs a tit-for-tat policy that rewards peers that upload to remote peers by increasing the uploading peer s download speed. P2P file sharing systems Soulseek [35] and DirectConnect [10] were the first popular P2P file sharing system that allowed the user to find files by browsing each individual node s shared files. The notion of browsing is important for this thesis and is expanded upon in the next subsection. 1.3 Methods of finding files In order for a user to download files from a file sharing system, the user needs to find files that are available on the network. There are a number of possible methods to find files that are available on the network. The two we look at are query string search and browsing. Query string search is characterised as a process in which the user describes a request by forming a query string that consists of one or more keywords and the system presents a set of filenames that match or satisfy the query string. Usually, the set gives further details of each file, such as file size, and file type. The user can then select files from the set to download. Browsing is another method to find files. For browsing to be possible, a collection of files must be organised into a browsable structure, such as a tree or directed graph. An example of browsable structure is a file system on a Personal Computer (PC). Typically, PC file systems are organised as tree structures. When the user is browsing for a file in a PC file system, the user starts from some point in the tree, typically the root, and can see the list of all the files and subdirectories in the current directory. The user can select the file from there, or the user can choose to enter a subdirectory to list all of the files and subdirectories in the subdirectory, and so on. Another example of a browsable structure is the World Wide Web, which is organised as a directed graph. The majority of P2P file sharing systems use query string search. Napster, Gnutella-based systems, emule [11], and KaZaA[13] use query string search as their only means for finding files in their networks. Several other P2P file sharing systems, such as DirectConnect and Soulseek, in addition to query string search, allow the user to browse each individual peer s shared files. However, these systems do not support a browsable namespace that is global among all peers because they do not directly provide a way of collaboratively organising files into a single, integrated, coherent categorical or hierarchical structure. Consequently, over 25 terabytes of files are fragmented across more than 8000 individual listings, with each listing having its own way of organising its files [25]. 4

6 1.4 Constructive collaboration In a P2P file sharing system, a scheme must exist that defines how files can be added to the network. Most of the popular P2P file sharing systems, such as those mentioned in the previous subsection, have the scheme that lets users simply designate a folder in their PC s file system and have all of its contents shared. Two major problems that occur in P2P file sharing systems that use this scheme are pollution and poisoning [7]. Pollution of a P2P network refers to the accidental injection of unusable copies of files into the network, by non-malicious users. Poisoning is where a large number of fake files are deliberately injected into a P2P network by malicious users or groups. Fake files are specifically created by malicious users or groups to seem like certain files, but consist of rubbish data or are unusable in some way. Both of these problems reduce the perceived availability of files to users and reduce the usefulness of the system to users, because finding usable files is more difficult. A study [13] found that a significant proportion of files on the KaZaA network are unusable, due to poisoning and pollution. A number of P2P file sharing systems employ a file rating system to attempt to combat these problems. File rating systems let users rate each file s quality - the theory is that enough users will find the fake and unusable files and rate them poorly, allowing other users to identify them before downloading them. These file rating systems have been shown to be largely ineffective [13]. The scheme used in the BitTorrent file sharing system is that any user can submit files to the index websites, and the file is checked by the moderators of the website before being added to the website s index. If the file is found to be fake or of unusable quality, it is not added to the index. Although pollution and poisoning levels are difficult to measure, sources indicate that the BitTorrent system is virtually pollution and poisoning free because of this scheme [26]. The scheme was found to be a practical one, by the same paper ([26]), as the authors were surprised that a mere 18 moderators are able to effectively manage the numerous daily content injections with such a simple system. The paper went on to state the drawback of this system: Unfortunately, this system relies on a central server and is extremely difficult to distribute. In this thesis we address this drawback by developing a new file finding system for the BitTorrent file distribution system, to form a new P2P file sharing system. 1.5 Our contribution The major contributions of this thesis are: The conversion of an existing file distribution system into a file sharing system by placing semantics on the downloadable files. A popularity based system that aims to allow constructive collaboration in a decentralised P2P file sharing system. Analysis of the popularity based system via simulation. 5

7 Motivated by the problems covered in the previous subsections, namely: poisoning and pollution making indexes in P2P file sharing systems less usable, the non-coherent index of Soulseek and DirectConnect file sharing systems, and the BitTorrent file sharing system s dependence on centralised websites to index the system s files, we designed and implemented a P2P file sharing system, which we call Localhost. The Localhost system in operation consists of Localhost peers running on a number of Internet hosts. The Localhost peer contains the BitTorrent file distribution system, and creates a hierarchical index of the files in the system, by imposing semantics on some of the files. The semantics classify files into regular files and directory nodes. A directory node is a file that contains references to regular files and other directory nodes. The directory nodes form a hierarchical structure in which the regular files are indexed. The hierarchical structure is used by users to find the regular files available in the system. To facilitate constructive collaboration between users in order to build a cohesive hierarchical structure, we developed a popularity based system. This system lets users choose to view any one of multiple alternate versions of each directory node. Users are initially shown the version that has the highest number of users that have chosen that version to view. New alternate versions of directory nodes can be created by any user, by adding files from their PC and/or new directory nodes to an existing version of a directory node. Lastly, we modelled and simulated the popularity based system to predict its behaviour and find the range of parameters to the model that provide acceptable behaviour. 1.6 Legal and ethical issues P2P file sharing systems have largely been used to distribute copyrighted material, without consent of the original copyright holder. The legal and ethical issues of P2P file sharing are not the focus of this thesis. This thesis focuses on the technical issues of P2P file sharing. 1.7 Thesis organisation The rest of this thesis is organised as follows. Section 2 covers work that is similar to this thesis. Section 3 overviews what Localhost does, without covering technical details. Section 4 introduces the technologies used in implementing Localhost. Section 5 details how Localhost implemented, detailing design decisions along the way. It builds on the technologies introduced in section 4. Section 6 contains the discussion and results of this thesis. Section 7 summarises and concludes. 2 Related work We have covered a number of P2P file sharing systems earlier, in subsection 1.3, so this section covers a number of systems that are similar in other ways to Localhost. 2.1 Hierarchical directory structure systems The Open Directory Project (ODP) [24] is a human-edited directory structure which indexes websites. It indexes websites in a hierarchical structure, and is itself a website. The nodes in the hierarchical structure are categories, and the leaves are website links. The top level nodes are broad categories, such as Arts, Business, Computers, and News. The ODP is constructed and maintained by a global community of volunteer editors. 6

8 2.2 Constructive collaboration systems Wikipedia [41] is a user-edited online encyclopedia. The system allows collaboration among its users to build its content. Any user can change and update the contents of any article in the encyclopedia. The system maintains a history of changes that allow any user to roll the article back to a previous version, in case of unwanted additions, such as vandalism. 2.3 Shared file system systems Wayfinder [25] is a P2P file sharing system that provides a global namespace and automatic availability management. It allows any user to modify any portion of the namespace by modifying, adding, and deleting files and directories. Wayfinder s global namespace is constructed by the system automatically merging the local namespaces of individual nodes. Farsite [1] is a serverless distributed file system. Farsite logically functions as a centralised file server but its physical realisation is dispersed among a network of untrusted workstations. OceanStore [15] is a global persistent data store designed to scale to billions of users. It provides a consistent, highly-available, and durable storage utility atop an infrastructure comprised of untrusted servers. Cooperative File System [9] is a global distributed Internet file system that also focuses on scalability. Ivy [23] is a distributed file system that focuses on allowing multiple concurrent writers to files. 3 Localhost overview This section gives a top-level overview of Localhost, without covering technical details and how it achieves its behaviour. We use the abstraction Localhost Distributed System (LDS) to refer to the system that is created by Localhost peers running on a number of Internet hosts. The LDS maintains a globalnamespace hierarchical directory structure of files that can be downloaded by Localhost peers. No one peer is responsible for storing the complete hierarchical directory structure; that responsibility is distributed amongst all the peers in the LDS. There is no central server or peer that has more importance than other peers in the LDS. Figure 2 shows one Localhost peer running on a host and the peer s interaction with a web browser and the rest of the LDS. Figure 2: Overview of one Localhost peer running on a host. The namespace of the hierarchical structure is global among all Localhost peers. Every new version of a directory node that any peer creates is viewable by all Localhost peers. Peers can create new versions of any directory node in the hierarchical structure, including the root directory node, so each directory node in the hierarchical structure may have any number of alternate versions. Each peer communicates with a web browser that is running on the same host as the peer to display directory nodes to the user. Each user can view any version of each directory node, and the last version that they view is taken as their preference for that directory node, so each user can have at most one preference for each directory node. When a user views a different version of a directory node, the peer informs the LDS of the user s preference, which the LDS stores. When a user requests a directory node for the first time (i.e. the user 7

9 has not viewed that directory node before), the peer gives the user the most popular version of that directory node. The popularity of a version is defined as the number of users who s preference is for that version. When a version has zero user preferences, it disappears. The hierarchical directory structure is built up over time by new versions of the root directory node and its subdirectory nodes being created by peers. When a directory node includes a reference to another directory node, we call the referenced directory node a subdirectory node. Initially, the hierarchical structure begins with a single version of the root directory node that includes no files or subdirectory nodes. New versions of the root directory node are created by peers, with files and subdirectory nodes included in them. When a peer creates a new version of a directory node with a subdirectory node included in it, a single empty version of that subdirectory node is created as well. New versions of that subdirectory node can then be created by the same peer, or other peers in the system. When a Localhost peer views a directory node, or completes the download of a file, the peer makes the file available to be downloaded from itself, which helps the distribution of the file by providing another complete copy of the file to the system. This makes the system more scalable than if every peer was forced to download a file or version of a directory node from only the peer that added that file or version of the directory node. Each directory node and regular file in the hierarchical structure can be uniquely identified by its path. The path of the root directory node is "/". The path of all other directory nodes is formed by taking the path of the root directory node, "/", and concatenating the names of the directory nodes that are in the chain of directory nodes from the root directory node to that directory node. When concatenating the names, each directory node name has "/" appended to it, in order to separate the names in the path. For example, if the root directory node included a subdirectory node called Videos, and that Videos subdirectory node included a subdirectory node called Trailers, then the path of the Trailers subdirectory node would be "/Videos/Trailers/". The path of a regular file is formed by concatenating the name of the file onto the path of the directory node of which it is inside. From a top-level view, Localhost does the following: Interprets certain files as directory nodes and facilitates displaying them in a web browser. The subdirectory nodes and files that are listed in a directory node are presented as links in the web browser s display. Allows browsing of the hierarchical structure by responding to a user clicking a subdirectory node link by downloading the most popular version of that subdirectory node, and serving it to the web browser for display. This becomes an iterative process, because the displayed subdirectory node can include subdirectory node links of its own. Responds to a user clicking a file link by downloading the file. Allows the user to create new versions of directory nodes by adding and/or removing files and/or subdirectory nodes to/from an existing version of a directory node. The new version is then submitted to the LDS, where it becomes viewable by all of the users of the LDS. Allows the user to select any version of a directory node to view. The LDS maintains the references to all the versions of each directory node, and counts of how many users are viewing each version of each directory node. 3.1 Interpreting files as directory nodes In order to build the hierarchical directory structure, the Localhost peer interprets certain files in the LDS as directory nodes. These files contain a listing of directory node names and/or file names. The Localhost peer serves these files to a web browser, along with formatting information and details of the six most popular versions of the directory node. This allows the web browser to render the directory page for the directory node. An example directory page is shown in figure 3(a). The directory page has 8

the path of the directory node as the heading of the page. The directory node names and file names in the directory node are displayed as links on the directory page.

The details include the description of that version, and the number of users who s preference is for that version.

10 the path of the directory node as the heading of the page. The directory node names and file names in the directory node are displayed as links on the directory page. Details of the six most popular versions of the directory node are displayed on its directory page. The details include the description of that version, and the number of users who s preference is for that version. In the example in figure 3(a) there are only two most popular versions shown because there are only two versions of the directory node. The directory page also includes an edit link, and a versions link. The edit link is described later in this section. When the user clicks the version link, the Localhost peer returns a web page that lists the details, as above, of all of the versions of the directory node. (a) Directory page of /Videos/Trailers/. (b) The same directory page in editing-mode-format. Figure 3: Screenshots of a web browser displaying a directory page. 3.2 Browsing the hierarchical structure When a user clicks a subdirectory link on a directory page for the first time, the Localhost peer finds and downloads most popular version of that (sub)directory node from the LDS. After downloading the (sub)directory node, the peer does three things. First, it serves the (sub)directory node to the web browser to be displayed. Second, it informs the LDS that the user s preference is now for this version of the (sub)directory node. Third, it makes the (sub)directory node available to be downloaded from the peer by other peers in the LDS, to aid distribution of the (sub)directory node. 3.3 Downloading files from a directory node When a user clicks a file link on a directory page, the Localhost peer downloads the file to a user-specified location on the user s PC. When the file has finished downloading, the peer makes the file available to be downloaded from the peer by other peers in the LDS. 9

11 3.4 Editing and submitting a new version of a directory node Each directory page includes an edit link on it. When clicked, the Localhost peer puts the directory node into editing mode and serves the editing-mode-format of the directory page to the browser. Entering editing mode copies the currently displayed version of the directory node to create a new version of the directory node, which can be edited and then submitted to the LDS. An example of a editing-modeformat of a directory page is shown in figure 3(b). The editing-mode-format of the directory page is served even if the directory page is re-requested by the web browser. This continues to happen until editing mode is exited. The editing-mode-format of a directory page displays the word editing in the directory page s title, and provides options to edit the new version of the directory node. The first option, Add file, allows the user to add a file from their PC s file system to the new version of the directory node. The second option, Create empty folder, allows the user to create an empty subdirectory in the new version of the directory node. The third option, Add folder, allows the user to add an entire folder structure from their PC s file system to the new version of the directory node. Every file and subdirectory node link on the directory page has a Delete link next to it to allow the user to remove it from the new version of the directory node, even if they were not the user that added it. Finally, once the desired changes have been made, the user can type a description of the new version in the text box provided, and click the Submit This As New Version button. The Localhost peer submits the new version of the directory node to the LDS, which makes the new version viewable by all of the users in the LDS. If the user wants to cancel the editing without a new version being submitted, they can click the Cancel Editing link, which exits out of editing mode and serves the directory page without editing-mode-format. 3.5 Viewing different versions of each directory node The LDS maintains the details of each version of each directory node. The details of a version consists of the description, a count of the number of users with preference for that version, and a reference that allows peers to download the directory node. As described above, each directory page contains details of the six most popular versions of the directory, and the versions link, which leads to a page with details of all of the versions. Each of the descriptions is a link which when clicked causes the Localhost peer to download and return that particular version of the directory node to the web browser. This is also the case for the aforementioned web page that lists all the versions of the directory node. Once the version of the directory node has downloaded, the Localhost peer informs the LDS of the user s new preference. If a user has previously specified their preferred version of a directory node, then when the (sub)directory node is requested as described in subsection 3.2, the user s preferred version is returned, rather than the most popular version. 4 Enabling technologies for Localhost peer implementation The work done in this thesis builds from a number of technologies, which we detail in this section. These technologies include BitTorrent - a P2P protocol, Kademlia - a Distributed Hash Table protocol, and Azureus - an implementation of BitTorrent which uses Kademlia. 4.1 BitTorrent The BitTorrent protocol is designed and used for P2P file distribution [8]. It was proposed by Bram Cohen, who also released a peer that implements the protocol. A number of other peers have also been released that implement the BitTorrent protocol. The protocol s basic premise is to use the otherwise wasted upload bandwidth of downloaders to help distribute files. Following the BitTorrent system, a file is broken up into pieces, which are transmitted between peers. A file s piece size is usually between 32 kilobytes and 128 kilobytes, inclusive. 10

12 Figure 4: An example scenario of BitTorrent protocol operation. A user that wishes to publish a file or collection of files uses a program to create a torrent file. A torrent file contains the name(s) of the file(s), the SHA-1 hash of every piece of every file, the torrent file s infohash and web address of one or more trackers to be used. The infohash is the SHA-1 hash of all of the files data, and is used to uniquely identify a torrent file. A tracker is a server that maintains a list of IP addresses of peers in the swarm. The swarm is the set of peers currently involved in transmitting pieces of the file to each other. The term torrent refers to the collection of file(s) that the torrent file was created from. The torrent file is distributed to other users by some means external to the BitTorrent peer, such as via web sites. The user publishing the file must then act as a seed for the file(s). In BitTorrent terminology, a seed is a peer that has the complete file(s). Initially, there is one seed in the swarm the publisher of the file(s). After peers in the swarm complete the download, they become seeds for the file(s) as well. When a peer acts as a seed for the file(s), it goes through basically the same steps as for downloading the file, which are described in the following paragraph. There are only minor policy differences in its behaviour for seeding verses downloading. A user interested in downloading a particular file in the system must first obtain the torrent file for that file. The torrent file is given to their BitTorrent peer. The BitTorrent peer then proceeds to download the file as follows. The peer first connects to the tracker to request a set of IP addresses of remote peers that are in the swarm. This is shown in figure 4 by dotted lines. The set returned from the tracker is a random subset of the full list the tracker maintains. The request for a set of remote peer IP addresses allows the tracker to add the requesting peer to its list of peer IP address of peers that are in the swarm. After the peer has a partial list of peers in the swarm, it picks a certain amount of them at random, and attempts to connect to them. The certain amount for most peers ranges from four to around thirty, depending on user configurable settings and peer implementation. The peer also listens on a network port, by default 6881 TCP but also user configurable, to allow remote peers that are also attempting to connect to other remote peers to connect to the peer. The peer should start to receive connections after it gives the tracker its IP address, because remote peers will receive this IP address from the tracker and start to connect to the peer, assuming there are other peers in the swarm. Each peer in the swarm aims to maintain the certain number of connections to remote peers, without consideration of which peer initiated the connection. The solid lines in figure 4 show an example interconnection between peers. When a peer connects to a remote peer, the two peers exchange a bitmap that indicates what pieces of the file each do and do not have. This allows each peer to work out what, if any, pieces the local peer has that remote peer does not have, and what, if any, pieces the remote peer has that the local peer does not have. Pieces are transmitted from those peers that have them to those peers that do not. The result of this behaviour in each peers is the following. The initial seed peer connects to remote peers, and transmits pieces of the file(s) to the remote peers. These remote peers do the same again, 11

13 that is, connect to remote peers, and transmit the pieces they have to them, all while still receiving other pieces from the initial seed peer. As soon as a peer receives a full piece, it can transmit the piece to the remote peers it is connected to. This allows the pieces to be propagated through the swarm. Figure?? shows an example situation of piece transfer. Note that, assuming zero peer failures, the initial seed peer only needs to transmit each piece of the file once into the swarm, and it is possible for every peer to receive a complete copy of the file, even in a swarm of thousands of peers. To contrast this with conventional web serving, a situation with thousands of requesting clients requires that the complete file is transmitted to every client. This insight demonstrates BitTorrent s scalable nature. File distribution systems have experienced problems caused by free-riders in the past [20]. Freeriders are users who download the file, but do not upload the file. Having free-riders results in low download speed for some peers or even the inability for some peers to find any peers to download the file from. BitTorrent has a real-time tit-for-tat feedback scheme to reward local peers uploading to remote peers [8]. The scheme dictates that each peer only uploads to a subset of remote peers that it is connected to the remote peers that it is getting the highest current piece transfer rate from, out of all of the remote peers the peer is connected to. With most peers in the swarm following this rule, it is in each peer s best interest to upload to a remote peer that the peer is receiving pieces from [8]. If the peer does not, the peer is likely to stop receiving pieces from the remote peer, because the remote peer is not getting a high enough download rate from the peer. The peer will have to find another remote peer to download from if this happens. In addition to the certain number of remote peers each peer connects to, each peer also optimistically connects to other remote peers that it has the IP addresses of. This is so the peer has the possibility of finding a remote peer that it can download pieces from at a higher rate than other currently connected remote peers. This scheme results in unprecedented download speeds for a P2P file distribution system [12]. 4.2 Kademlia Kademlia [21] is a Distributed Hash Table (DHT) protocol. A Distributed Hash Table (DHT) based system provides services similar to that of a hash table, but distributes storage and lookups among a number of peers. There have been a number of DHT protocols developed. The first four DHT protocols, Chord [36], CAN [27], Pastry [30], and Tapestry [43], were all developed in DHT based systems support a number of operations. The two major operations are: put(key, value). Stores the data string value under key key in the DHT. value = get(key). Retrieves the data string value from the DHT that is stored under the key key. Some DHTs allow multiple values to be stored under, and retrieved from, a single key. DHT based systems provide the abstraction of a hash table, which is accessed by these two operations. The work done in this thesis builds on the abstraction by using these two operations. The following describes how DHT based systems provide the abstraction. DHT protocols distribute the key-value pairs among the peers by way of the following. Typically, keys are a 160-bit values, and are usually calculated by hashing some data. The DHT protocol partitions the namespace of keys among the peers in the DHT. Each peer in the DHT is responsible for a certain subset of the key namespace, and so is responsible for a storing a certain subset of key-value pairs. The peers in the DHT can join and leave the network freely. The DHT protocol provides an algorithm for peers to enter the DHT, so that the correct key-value pairs can be transferred to the peer when becomes responsible for a set of keys upon entering the DHT. The DHT protocol also provides an algorithm for peers to leave the network and transfer their key-value pairs to other peers so the key-value pairs are not lost. Despite the apparent chaos of periodic random changes to the membership of the network, DHTs make provable guarantees about performance [42, 36, 21]. 12

14 DHT based systems operate in a completely decentralised way. DHT protocols are able to provide the two operations described above by making the peers form a DHT overlay network. The DHT overlay network is formed by each peer maintaining a set of contacts. A contact is the peer ID and IP addresses of a remote peer in the DHT. Each peer has a peer ID, which is a number chosen from the namespace of keys. The set of contacts each peer maintains does not include every possible contact in the DHT. The specific DHT protocol used dictates which contacts each peer maintains. Using these contacts, DHT overlay networks such as those used in Chord and Kademlia allow each peer to locate the remote peer responsible for a certain key in O(log n) time. Once the correct peer has been located, gets and puts can be done by contacting that peer. In a Kademlia DHT, each peer s peer ID is chosen randomly from the namespace of keys. Each peer is responsible for the set of keys that are closest to its peer ID. In Kademlia, closeness is defined by the XOR of two values where the two values are a peer ID and a key. Kademlia divides the key namespace progressively into subtrees, by taking each bit of the key in turn, and forming a new subtree for both possible values of the bit, as shown in figure 5. Kademlia s contact maintenance policy dictates that all Figure 5: Example of a Kademlia DHT overlay network with 3-bit keys. peers maintain one contact in each subtree in which it itself is not contained. This results in each node maintaining only O(log n) contacts. Figure 5 shows an example of possible contacts the peer with ID 001 can have the peer is required to maintain exactly one contact from each dotted oval. In this example, the contacts it is maintaining are 110, 011, and 000, as shown in figure 5 by the thin lines. Although not illustrated in figure 5, every peer follows the same rule for contact maintenance. Peers find the remote peer responsible for a certain key by querying successively closer peers to the key, starting with themselves, as shown in figure 5 by the thick arrowed lines. In this example, the peer 001 is seeking the IP address of the peer 101. The first step is to query itself to find the peer from its contacts that is closest to the target peer. In this case all three contact peers 110, 011, and 000 are of equal closeness to 101, so Kademlia chooses the peer that has the most common leftmost bits with the target ID, which is 110. Then the peer 110 is queried to find the next closest peer to 101 from its set of contacts. The peer 100 is found. That peer is queried, and the IP address the target peer is found because that peer has the target peer as a contact. Each step halves the distance to the target peer, because an entire subtree is eliminated each query. This gives the expected log(n) hops to find the IP address of the target peer. Peers require the IP address of one peer already in the DHT overlay network in order to join the DHT overlay network. The preceding description of Kademlia is a simplified one. Further details on Kademlia are not included in this thesis. We refer the interested reader to [21] for more information on Kademlia. 13

15 4.3 Azureus Azureus is a Java implementation of the BitTorrent protocol. It is a very popular BitTorrent peer, as it has been downloaded from the open source repository over 30 million times. Azureus allows a number of files to be downloaded and seeded concurrently. As of version , Azureus also includes an implementation of the Kademlia protocol. All Azureus peers join the same DHT, by contacting a certain peer that is set up for the purpose that aims to always be online. Azureus uses the Kademlia DHT to implement a feature called decentralised tracking. Decentralised tracking is an optional replacement for trackers. When decentralised tracking is enabled, an Azureus peer puts a data value that consists of its IP address and BitTorrent network port number into the DHT, for each file it is downloading or seeding. The key that the value is put under is the infohash of the file, which comes from the torrent file. Each Azureus peer with decentralised tracking enabled performs a get for each of the files it is downloading or seeding, where the key is the infohash of the file. The values returned from this get allow the BitTorrent protocol section of the peer to connect to remote peers in the swarm. The put and get are analogous to, and a replacement for, the tracker communication done by the standard BitTorrent protocol. Version of Azureus also introduces a torrent file download feature. The torrent file download feature allows torrent files to be downloaded from remote Azureus peers to a Azureus peer using the User Datagram Protocol (UDP). To download a torrent file, the torrent file s infohash is required. Azureus uses the port 6881 (UDP) to transfer torrent files and make connections to other peers to operate the DHT protocol. The Kademlia implementation in Azureus allows each peer to store only a single value under each key. When a peer performs a put(key, value) for a key that the peer has performed a put(key, value) of the same key earlier, the earlier value is overwritten. Multiple peers can each store a different value under the same key. A single peer can store different values under different keys. 5 Localhost peer design and implementation We developed the Localhost peer, which implements the functionality described in this section. The peer and its source code are available for download at The peer was developed as a modification and extension of the Azureus source code base. Azureus was chosen as the source code base to develop the Localhost peer from for a number of reasons. The first is that Azureus is the only BitTorrent peer with a torrent download feature, which is required to implement the Localhost peer. This feature would have had to have been developed, adding to development time of the Localhost peer. The second is Azureus popularity. Developing the Localhost peer from a program that has shown to be usable by a large number of people is a good starting point to make Localhost a usable system. 5.1 System implementation overview Figure 6 gives a modular overview of the Localhost system. The Localhost peer consists of Azureus BitTorrent and Kademlia modules, and a module that contains logic which interacts with a web browser and these two modules. The logic and interactions are described in the following subsections. 5.2 The web interface The Localhost peer has a minimal HTTP server built into it, which is designed to serve a web browser that is running on the same host, to provide the web interface to the program. The HTTP server is shown in figure 6. The HTTP server listens and accepts connections on network port 8880 (TCP). The port 8880 was chosen arbitrarily from the set of port numbers that aren t well-known port numbers. 14

16 Figure 6: Modular overview of the Localhost system. After the peer has been started on a host, it can be accessed by a web browser that is started on the same host and pointed to: where the path is the path of the file or directory node to be retrieved. If the peer finds that the path is a directory node, then the peer downloads the directory node from the LDS and displays the directory page as described in section 3.1. If the peer finds that the path is a file, then the peer downloads the file from the LDS to a user specified location on the user s file system. The path is also used to give commands to the peer. The commands are given in the format: where path is the path of the specified directory node, which is the directory node that the command is to be executed on. The command is the command name, as detailed below. The argname=argument specifies the argument to the command. Multiple arguments are separated by ampersands. The commands used in the web interface are listed in table 1. The four commands addfile, createfolder, addfolder and delete act on a new version of the directory node that is created when the directory node is placed into editing mode. 5.3 Directory node storage The conversion of a file distribution system into a file sharing protocol by placing semantics on files to interpret some of them as directory nodes is a central idea in this thesis. There are a number of possible designs that can be used to implement this. This subsection compares the considered possibilities. The first design considered was a recursive torrent files in torrent file design. In this design, there is one torrent which represents the root directory of the hierarchical structure. This torrent contains only torrent files. Each of those torrent files represents either a file or subdirectory node in the root directory node, depending on whether the torrent file represented a file or collection of files. The torrent 15

17 Command versions getversion edit addfile createfolder addfolder delete submitversion canceledit Description Returns a webpage that is dynamically generated by the program, which lists, for each version of the directory node, its description and number of users who s preference is that version. Each version description is a link to the getversion command, with the version description and infohash supplied as the argument. Takes the infohash of a version and its description as arguments. Returns the specified version of the directory node to the web browser, by downloading it if required to. The Localhost peer records the specified version as the user s preference, and informs the LDS of the user s preference. Places the specified directory into editing mode, and returns the editing-modeformat directory page for that specified directory node, as described in section 3. Makes a new version of the specified directory node for the following four commands to operate on, by copying the currently viewed version. The editingmode-format directory page includes links to the following six commands, providing the correct arguments to the commands. Takes a path to a file on the PC s file system as an argument. Adds the file located by the file path on the PC s file system to the specified directory node. Takes a name as an argument. Creates a new empty subdirectory node inside the specified directory node, giving it the name supplied. Takes a path of a folder on the PC s file system as an argument. Adds the folder, including all of its contents recursively by creating more subdirectory nodes, from the file system to the specified directory node. Takes either a file name or subdirectory node name as argument. Removes that file or subdirectory node from the specified directory node. Submits the new version of the directory node that was created and modified by the preceding five commands, to the LDS. Exits the specified directory node from editing mode without submitting the new version of it to the LDS, losing the changes made. Table 1: Commands used in the web interface of Localhost. 16

18 files that represent a collection of files are the subdirectory nodes. Each subdirectory node would also be a collection of torrent files, each of which are of the same nature as stated earlier, thus recursively building up the hierarchal directory structure. The major problem with this strategy is that it suffers from a direct dependency problem. Recall that a torrent file contains various hashes that are calculated from the contents of the file(s) that the torrent file represents. Changing the contents of any file alters the torrent file that represents the file, because the hashes contained within the torrent file will have changed. Changing the contents of any file therefore also alters the torrent file of the directory node that the file is contained in, because the directory node contains the torrent file that represents that file. This in turn changes the torrent file of the containing directory node s containing directory node in the same way. This process happens all the way up to the root directory node, where the root directory node s torrent file has to be changed. This situation results in the root directory node having to change every time any file or directory node in the entire hierarchy is changed, which is impractical. To solve this problem, the direct dependence between a directory and its subdirectories needed to be removed. A solution along this line of thinking was to add a level on indirection to the design by replacing the torrent files in the torrents with web URLs that pointed to torrent files hosted on websites. This solution made the system non-decentralised, so was not a viable as a solution to keep with the decentralised aim of the system. Furthering the indirection idea, the solution in use in the final design is to include only the names of the other directory nodes and name of files in each directory node. The DHT is used to locate the infohash of the torrent file that represents a version of a directory node, using the directory node names, as described in the following subsection. Each directory node is simply a file, stored in the Extensible Markup Language (XML) file format. The XML file contains the list of file names and other directory node names (which can be considered to be subdirectory nodes). The XML file only contains the list of the file names and subdirectory node names in the immediate directory node, i.e. the XML file does not contain the list of file names included in the subdirectory nodes that are included in itself. This list is stored in the XML file format so the Localhost peer can modify the list. The Localhost peer needs to modify the list to add and remove subdirectory node names and file names when creating new versions. 5.4 Directory node retrieval When a Localhost peer receives a request from a web browser to retrieve a directory page, the peer needs to return the correct version of the directory node to the web browser. Algorithm 1 details the logic used by the Localhost peer to serve a request for retrieving a directory node. Most of the logic in algorithm 1 exists to let the peer keep state information across multiple browser requests. The logic makes the peer record which version of each directory node the user prefers. The logic also makes the peer record which directory nodes are in editing mode. The PathHash used in the algorithm is a string used as a key in the DHT and is created by taking the SHA-1 hash of the path of the directory node being retrieved, for example, SHA-1("/Videos/Trailers/"). The ViewingPreference used in the algorithm is a string used as a value in the DHT that consists of the version s description, and its infohash, for example, "Version with game trailers;126ff7a15f7f4f9025a12eae0ff3547c227c355e". The result of each peer storing their ViewingPreference under the key PathHash is that all the infohashes of all of the versions of a directory node are stored under the key that is the SHA-1 hash of the directory node s path. To retrieve a version of a directory node, the Localhost peer SHA-1 hashes the directory node s path to get the PathHash, and performs a DHT get(pathhash) to retrieve the descriptions and infohashes of all of the versions of that directory node. The infohashes are then used to download that version, by use of Azureus torrent file download feature, and decentralised tracking feature. The Localhost peer uses a cache to avoid re-downloading versions of directory nodes that it has downloaded previously. When the Localhost peer fetches a version of a directory node, as done in algorithm 1, the Localhost peer first checks its cache. At this point in time, the Localhost peer knows 17

19 Algorithm 1 Logic used by the Localhost peer to serve a request to retrieve a directory node. if a specific version is requested via the getversion command then Fetch that specific version of the directory node and return it to the web browser. Perform a DHT put(pathhash, ViewingPreference) to notify the LDS that the user is viewing that version. Record that version as the chosen version. else if the directory is in editing mode then Return editing-mode-format of the directory page. else if a version has been recorded as the one chosen to view then Fetch that version of the directory node and return it to the web browser. else Perform a DHT get(pathhash) to retrieve the ViewingPreferences of the directory node. Tally the ViewingPreferences by combining together ones with identical infohash and description (i.e. the ones that are the same version). Find the version from the tally that has the highest number of viewers. Fetch that version of the directory node and return it to the web browser. Perform a DHT put(pathhash, ViewingPreference) to notify the LDS that the user is viewing that version. Record that version as the chosen version. end if the infohash of the version of the directory node, from a previous DHT get(pathhash), as listed in algorithm 1. The Localhost peer s cache system is stored as folders in the file system of the PC that the Localhost peer is running on. Each version of a directory node s XML file is stored in the cache in a folder named infohash, where infohash is the infohash of the torrent file for the version of the directory node. An earlier design of the peer had each XML file stored in the cache in a folder that was named by taking the SHA-1 hash of the directory node s path. This meant that only one version of a particular directory node could be stored at a time. This was inadequate, because users need to be able to switch between different versions of a particular directory node quickly, to review the differences and make their choice. If the directory node XML file is found in the cache, the cached version is used. If the directory node XML file is not found in the cache, the directory node XML file is downloaded from remote Localhost peers that have a copy of it, placed into the cache, and used. To download the directory node XML file, the Localhost peer needs the torrent file for the XML file. The torrent file is downloaded from a remote Localhost peer using the torrent file transfer feature of Azureus. To download the torrent file, the Localhost peer requires the infohash of the torrent file, which it has at this point in time from a previous DHT get(pathhash). The Localhost peer downloads the torrent file by performing a DHT get(infohash), where infohash is infohash of the torrent file, to locate peers that are seeding and downloading the file these peers will also be willing to transmit the torrent file. The torrent file is downloaded from these peers and is then used to download the directory node XML file. The directory node XML file is downloaded using the distributed tracking feature of Azureus, and the result of the previous DHT get(infohash) to locate peers that are seeding the file. After the torrent file and XML file have been downloaded, the Localhost peer acts as a seed for the XML file. The XML file is seeded using distributed tracking, so seeding it involves performing a DHT put(infohash, IPaddress) to add the peer s address details to the DHT, where infohash is the infohash of the torrent and IPaddress is the peer s IP address. A DHT get(infohash) is also performed, to locate remote peers that are attempting to download the file, so that the peer can connect to them. When a Localhost peer is seeding a file, Azureus torrent file transfer feature allows the peer to transfer the file s torrent file via UDP to any remote peer that requests that torrent file. Algorithm 1 states that a DHT get(pathhash) is performed to retrieve the ViewingPreferences of 18

20 a directory node. The number of ViewingPreferences for any directory node in the system submitted grows by O(n), where n is the number of users in the system. To put a limit on the time taken to perform the DHT get(pathhash) operation to collect the viewing preferences, the operation is limited to operate for four seconds. Four seconds was chosen as a trade-off between taking no time, which would collect no viewing preferences, and taking too long, which would make users wait to view directory nodes too long. 5.5 Global namespace When the Localhost peer is started on a user s PC, the peer launches a web browser which is pointed to This URL requests the peer to retrieve the directory node of path "/", which is the global root directory node of the hierarchical structure. Having this URL requested by every Localhost peer gives each user a starting point to browse from, and ensures that each peer uses the same string ("/") to hash when finding the root directory node. 5.6 Directory node display Each directory node XML file contains a reference to an Extensible Stylesheet Language (XSL) file. The XSL file is used to define how the directory page is displayed in the web browser. When an XML file is returned to the web browser, the reference to the XSL file makes the web browser make a request for the XSL file. The Localhost peer has an XSL file which it returns to the web browser in response to this request. This sequence is shown in figure 6 for one directory page transfer. The XSL file is used by the web browser to visually format the XML file, as shown in figure 3. It was theorised that it would be difficult for a directory node to move on to the next best version, because the most popular version of a particular directory node is returned to a user who is has not chosen to view any other version. The solution implemented for this problem was to add a listing of the six most popular versions on the directory page. Six was chosen as the number of versions to show so that the directory page was not too long, and to include a reasonable number of other versions. Users should more easily be able choose other versions to view, because the versions are visible as the users are viewing the directory page, rather than going to the versions page first. 5.7 File retrieval The files listed in the directory nodes are downloaded in the same way as the directory nodes are downloaded the files are downloaded using the BitTorrent protocol, using the decentralised tracking and torrent file transfer features of Azureus. When a file is finished downloading, the Localhost peer becomes a seed for that file, helping to distribute the file. 6 Results and discussion We developed a website [18] to release the Localhost peer, and placed it online on the 23rd of August The website describes the peer, and allows users to download a program that installs the peer onto the user s PC. The website includes a link to the web-output of modified Localhost peer running on a host, so that users can have a preview of how the system looks, and what files are on the system, before they download and run the peer on their own PC. The modifications to the peer give the users the notice that they have to download and run the peer on their own PC in order to download files from the directory structure when the user clicks a file link. The modification also disallows editing and creating new versions of the directory nodes. We created new versions of subdirectory nodes in the root directory node, such as Pictures, Software, and Audio, with a range of (legal) files added to them, to encourage users to use the peer. The modified peer was set to view those versions so at least one peer s preference was for those versions. 19

21 6.1 Comparison to systems similar to Localhost Table 2 compares Localhost to various P2P file sharing systems. Gnutella and Localhost are the only systems listed in the table that have decentralised indexes. Three of the systems covered in the table have the ability to find files using a browsable structure. The DirectConnect and Soulseek systems allow each peer s files to be browsed individually, the standard BitTorrent system has a number of separate browsable indexes, and Localhost has a single browsable index of all of the files in the system. Localhost has a number of advantages over some of the other P2P file sharing systems. The first is that Localhost is a completely decentralised system. Both the transfer of files and the maintenance of the index are done by the peers in the system, without use of any centralised server. Therefore Localhost does not have any single point of failure 1. The second is that the single browsable structure indexes all of the files in the system, rather than only a subset of them as with the standard BitTorrent file sharing system websites. Localhost s has a number of disadvantages compared to other P2P file sharing systems. The first is that are that it has limited real world use. Since Localhost s release, the users of the Localhost system have not created enough new versions of directory nodes for us to draw conclusions about the behaviour of the popularity system. The second is that its browsing speed is relatively slow. From our observations of a running peer, the process of downloading a directory node and displaying it usually takes between 10 and 50 seconds on a 1.5 megabits/second home broadband connection. The standard BitTorrent system s speed of browsing is faster than this, as the index is maintained on websites, the pages of which usually are able to be viewed in under 10 seconds of requesting them. System Index placement Index type Strengths Weaknesses Napster Centralised Query string searchable Pollution/poisoning level, centralised index Gnutellabaseablnents Decentralised Query string search- No centralised compo- Pollution/poisoning required level, centralised systems index Soulseek / Centralised Query string searchable, Peers are browsable Browsable index DirectConnect and individual peer browsable isn t cohesive Standard BitTorrent system Centralised A number of separate browsable categorical indexes Localhost Decentralised Global browsable hierarchical structure Pollution/poisoning level, download performance Pollution/poisoning level (theoretical), no centralised components, collaboratively created index Table 2: Comparison of P2P file sharing systems Centralised indexes Limited real world use, browsing speed 1 There is of course the one bootstrap peer that is used by all of the peers in the system to join the DHT overlay network. Each peer requires the IP address of any one peer that is already in the DHT, to join the DHT. There are options other than using the one single bootstrap peer, such as including a list of known remote peer IP addresses with the peer when the peer is downloaded, or word of mouth transfer of IP addresses of other remote peers. 20

22 6.2 Theoretical behaviour of the popularity based system In this section we model the popularity system and run a simulation to study its behaviour. The simulation looks at the number of user preferences for each version of one directory node, over time. We wish to have one dominant version of the directory node for all points in time in the simulation, and for that dominant version to be of higher quality than the other versions of the directory node. The dominant version is the version that stays the most popular version for a period of time. The simulation looks at two properties of the popularity system: The popularity system s ability to change the dominant version of the directory node from one version to other, higher quality, versions. If the system is able to change from a poor quality version (e.g. a version that lists fake or unusable files) to a higher quality version, it should have less of a problem with pollution and poisoning. The stability of the directory node. If two or more versions quickly alternate between being the most popular version of the directory node, then the directory node is not considered stable. An unstable directory node does not have a single dominant version. The simulation is driven by time ticks, where time tick represents one minute in the real system. The simulation simulates 10,000 minutes, and 2000 users enter the system during this time. Each user s entry time is a random time from time 1 to time 10,000. Each user stays in the system for an average of 120 minutes, and then leaves. This value was selected based on the uptime of BitTorrent peers in the standard BitTorrent file sharing system [26]. At time tick 1, the directory node has one version, and the system has one user in it. The model considers that for every user, there is a constant probability of 0 < P c < 1 that the user will create a new version of the directory node. In the model, users that create a new version of the directory node do so when they enter the system, and then they choose to view that version. The model considers that for every time tick, each user has a constant probability of 0 < P v < 1 of changing their viewing preference of the directory node. When a version has no user preferences, it can never been chosen again, and is considered dead. Each version is given a quality, which is a real value between 0 and 1. Each version s quality value is chosen randomly using a uniform distribution, and stays constant throughout the simulation. In the model, the version that a user changes their preference to is chosen by a random selection that favours each version according to its quality. This is done by placing all the versions along a number line, with each version taking up the length of its quality, so that the number line s length is the sum of all the qualities. A number is chosen at random, using a uniform distribution, from the length of the number line, and the version on which that number falls is the version that is chosen. The simulation measures the popularity system s ability to change the most popular version of a directory node to a better version. The simulation does this by counting number of time ticks taken from when a version that is better than the current most popular version is created, to when that version becomes most popular. This is done for every version that is better than the current most popular version. These counts are summed to give the measure of the system s ability to change. A lower value represents a higher ability to change quickly. The simulation measures the popularity system s stability by counting how many times each version becomes the most popular version. The count is increased every time a version becomes the most popular version, even if that version has become the most popular version before. The counts from all versions are summed, to give an overall value that measures the system s stability. The lower the value is, the more stable the system is. The simulation was run for the values of P c from 1 16 to , and P v from 1 16 to Figure 7 shows the results of this. Figure 7(a) shows that the system becomes unstable for values of P v larger than 1 64, as indicated by the peaks in the graph. This is because as users change their preferences more frequently, no version 21

(a) The popularity system s stability. (b) The popularity system s ability to change the directory node s most popular version to be a better version.

This instability is increased with larger values of P c, because there are more versions that can potentially become the most popular.

23 (a) The popularity system s stability. (b) The popularity system s ability to change the directory node s most popular version to be a better version. Figure 7: Results of the simulation, with parameters P v and P c varying. stays the most popular for long. This instability is increased with larger values of P c, because there are more versions that can potentially become the most popular. Figure 7(b) shows that with a high P c and a low P v, that is, with users infrequently changing their viewing preferences and a high percentage of users creating new versions, the system takes longer to make the higher quality versions the most popular version. The figure shows zero time ticks taken for 1 better versions to become the most popular, for values of P v where P c is 1024, because in these cases only about 10 versions were created over the 10,000 minutes, and only a few versions, at most, were alive at any one time. The two aims of the popularity system are that it is stable, and that it has the ability to change the most popular version of a directory node to be a higher quality version. From inspection of figures 7(a) and 7(b), the range of values of P c and P v that achieve these two aims to a reasonable degree are P v < 1 64 and P c < The reasonable degree includes having less than 100 changes of most popular version over the 10,000 time ticks, and less than 20 total time ticks to bring new, higher quality versions to be the most popular. Of course, the value of P c will have to be high enough for new versions to be created, and the value of P v will have to be high enough to allow some versions to be viewed by users other than their creators. 6.3 Observation of the Localhost system in use In this subsection we provide some observations on how the Localhost system worked after being released, and look at some of the feedback received from users. A number of third party websites linked to the Localhost website. The peer was downloaded over 10,000 times since its release, according to our web server log statistics. Despite this, only a relatively small number of new versions of directory nodes were created by users. One of which was a version of the root directory node that was created with the description Videos. This version includes a directory node also called Videos, and in that Videos directory node is a directory node called Trailers. In the Trailers directory node is a number of video files of game trailers. The directory page for this is shown in figure 2. Another was a new version of the Pictures subdirectory node that included a number of 22

Page 1. How Did it Start?" Model" Main Challenge" CS162 Operating Systems and Systems Programming Lecture 24. Peer-to-Peer Networks"

Page 1. How Did it Start? Model Main Challenge CS162 Operating Systems and Systems Programming Lecture 24. Peer-to-Peer Networks How Did it Start?" CS162 Operating Systems and Systems Programming Lecture 24 Peer-to-Peer Networks" A killer application: Napster (1999) Free music over the Internet Key idea: share the storage and bandwidth