Broadcast Updates with Local Look-up Search (BULLS): A New Peer-to-Peer Protocol

Broadcast Updates with Local Look-up Search (BULLS): A New Peer-to-Peer Protocol G. Perera and K. Christensen Department of Computer Science and Engineering University of South Florida Tampa, FL 33620 +1 813 974 3652 {gpererao, christen}@cse.usf.edu ABSTRACT Peer-to-Peer (P2P) networks based on Gnutella locate files by flooding the network with query messages (a flooding query search). In this paper, a new P2P search paradigm is presented. The network is flooded with the list of shared files and corresponding updates instead of by queries. Novel P2P applications such as power management and ethical file sharing are now possible with this new method. A new protocol named Broadcast Updates with Local Look-up Search (BULLS) enables new applications and reduces overhead traffic by enabling a local look-up of queries (i.e., queries are not broadcast). Nodes periodically broadcast changes in their list of files shared and build a table containing the list of shared files by each node. BULLS and Gnutella are represented using finite state machines (FSM). Flow models are developed to determine the overhead traffic in messages per second. For a representative P2P network scenario, BULLS can reduce Gnutella s overhead traffic by 19%. Categories and Subject Descriptors C.2.4 [Computer Communication Networks]: Distributed Systems; C.2.2 [Computer Communication Networks]: Network Protocols General Terms Design, Performance Keywords Peer-to-Peer, P2P, performance evaluation, protocol design, Gnutella, BULLS 1. INTRODUCTION Unstructured Peer-to-Peer (P2P) networks such as Gnutella [4] distribute content (files) in a decentralized manner, are selforganized, and are robust. P2P file sharing applications including Limewire, Kazaa, and BitTorrent comprise the majority of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACMSE 06, March 10 12, 2006, Melbourne, Florida, USA. Copyright 2006 ACM 1-58113-000-0/00/0006 $5.00. Internet s traffic [2, 7]. Much of this P2P traffic is overhead from the flooding of query messages and the associated queryhit response messages from searches for popular files. Users in a Gnutella file sharing P2P network search for files by flooding the network with queries. A file search requires the user to know the entire name of the file searched for or a substring contained in the filename. Queries in Gnutella are thus substring searches over filenames. In Gnutella searching for a targeted file is equivalent to one substring search. For multiple targeted files with no common substrings in their filename, a query for each file must be made. Valid searches in Gnutella have a substring with length greater than three and do not contain any wildcards. In addition to these search restrictions, a user at a node cannot determine what files are been shared in the network. That is, it is not possible to have knowledge of the entire set of files shared in the network. There are two reasons for this. The first is that a node lacks a method to make available to all other its list of shared files. The second is that it is not possible to make a single query for all the shared files in the network. Thus, multiple queries are needed and a large number of must be queried. If this could be done, the overhead traffic in queries and queryhits would be very high. If it were possible for all to have the knowledge of the files shared by all other, then new and significant applications could be implemented. Such novel applications include Power management: Nodes sharing redundant content could be powered down and energy savings can be achieved. Ethical file sharing: Since make explicit their files shared it is unlikely they would want to share illegal content. Affinity groups: Users can establish social connections based on the knowledge of the similar content shared by other users (e.g., based on common musical tastes). In this paper we explore a new protocol that enables all in the P2P network to acquire the knowledge of the files shared by all other. The remainder of this paper is organized as follows. Section 2 describes the Gnutella and BULLS protocols. Section 3 presents the flow models for Gnutella and BULLS. Section 4 compares the performance of Gnutella and BULLS in terms of the overhead traffic rate. Section 5 describes the related work. Section 6 contains the conclusion and describes the future work.

Enter network Request neighbors Responses received File found INITIALIZE SEARCH SELECT Receive query and file found Repeat query msg, send query response msg Receive query and file not found Repeat query msg Receive response Connect to neighbors File search Send query msg No responses received File not found Download file Update data structure Figure 1. Gnutella FSM Depart network IDLE 2. GNUTELLA AND BULLS PROTOCOL In this section the Gnutella and BULLS protocols are described using finite state machine (FSM) representations of the critical protocol behavior. The notation for the FSM diagrams show states as vertical lines and transitions as horizontal arrows indicating the directions of the transition. Transitions are initiated when the input or condition specified above the arrow is met. The output or actions are specified below the arrow and occur simultaneously while making the transition. The dotted arrows are the initial and final transitions of the diagram. The initial transition does not have an originating state and final transitions do not have a destination state. The FSMs cover the protocol behavior related to the exchange of messages (i.e., the query and queryhits for Gnutella and the updates for BULLS). Message exchange is the overhead traffic of a P2P network. Low data rate protocol operations such as exchange of ping and pong messages are omitted from the FSMs. The final file download (one per queried file found) is the only non-overhead traffic that is useful. 2.1 Description of Gnutella Gnutella is an unstructured P2P file sharing network that uses permanent TCP/IP connections between neighbor (neighbors) and download files via HTTP [4]. Before a node connects to the network it downloads from a bootstrapping node a set of IP addresses from which neighbors are selected. A bootstrapping node in Gnutella is always available and caches the IP addresses of that have or are connected to the network. Each Gnutella node in the network maintains permanent TCP connection to approximately six neighbors (node degree of six) [4]. When a user (at a node) wants to find and download a file, a query message is broadcast. A node that has not already received a query through other neighbors, will forward it (repeat the query to all neighbors except for the neighbor sending the query). The response to a query is a queryhit message. The queryhit is routed back to the requesting node by the who forward the query. The requesting node may receive many queryhits (e.g., it if is searching for a popular file). The user of the requesting node manually chooses a node from which to download the file that he or she is searching for and does so via HTTP. The main functionality of Gnutella can be summarized by two operations. File search by query broadcast and selection of the file to download from the multiple queryhits responses. Figure 1 shows a FSM representation of the Gnutella protocol. Four states are defined, INITIALIZE, IDLE, SEARCH, and SELECT. The states and their transitions are: INITIALIZE: A node enters this state by requesting neighbor addresses from a specialized bootstrapping node. On receiving a response from the bootstrapping node, it establishes a permanent TCP/IP connection using the neighbor addresses received and transitions to IDLE. IDLE: In this state a node can 1) initiate a file search by sending a query message and transition to SEARCH, 2) receive a query for which it has a file, repeat the query to all of its neighbors and respond with a queryhit 3) receive a query for a file it does not have and repeat the query message, or 4) quietly depart the network. SEARCH: In this state the node waits to receive query responses (queryhits) and it can 1) transition to IDLE if no responses are received or 2) transition to SELECT if one or more responses are received. The time spent waiting for responses before transitioning to SELECT or IDLE is not considered in this paper because is a does not impact the overhead traffic amount. SELECT: In this state a node from which to download a file is selected. The user manually chooses the node from the set of that responded with a queryhit. Once the node downloads the selected file, it updates the shared file list and transitions to IDLE. There are two transitions that impact the amount of overhead traffic generated. The first is the transition from IDLE to SEARCH resulting in the broadcast of a query message to all. The second is the transition from IDLE that results from a file found (a queryhit is received). Queryhits messages are routed back the same path the query arrived using a queryhit routing table. Each node keeps a routing table with the query id and node id from which the query was received. Queryhit messages are routed back through the same path the query traveled from because they have the same id as the query they are responding to. This generates significant traffic for popular files (i.e., files that have many replicas). 2.2 Description of BULLS BULLS is a protocol for unstructured P2P networks offering the same functionality as Gnutella, but allowing to know what files are shared by all other. All connect to the network in Gnutella style. A BULLS node stores a global

INITIALIZE Enter network Request neighbors, download data structure Local look-up successful File found SEARCH SELECT Change in data structure Send update msg Receive update msg Update data structure, cache, repeat update msg Receive depart msg Repeat depart msg, update data structure Receive response Connect to neighbors, send shared file list File search Local look-up No responses received File not found Download file Update data structure Figure 2. BULLS FSM Depart network Send depart msg IDLE directory data structure that contains the information of the files shared by each node in the network. Once a node has established a permanent TCP/IP connection with its neighbors, it floods the network with the complete listing of its shared files. The file shared listing is repeated by via an update message. Since there is one update message for each entry (filename) in the listing of shared files, the network is flooded as many times as files shared. Similarly, the network is flooded with an update message each time a file is downloaded or a shared file has been added or deleted by the user. The main functionality of BULLS can be summarized by two operations, 1) a local look-up file search (no overhead traffic is generated in the network) and 2) broadcast of update messages. All repeat the update messages received, cache the updates, and receive and repeat depart messages. 2.2.1 Data Structure for BULLS The global directory data structure stored by each node in BULLS is a table. Each row represents the data stored for a node in the network. The columns represent the two basic types of data stored. The first column is the nodename, it is used to identify univocally a node in the network (IP address or node identification number). The second column is the list of filenames. This column stores the file share listing (set of filenames shared) in lexicographical order of the node in a given row. The storage requirements for the global directory data structure are evaluated later in this paper and are shown to be reasonable, even for large P2P networks. 2.2.2 Finite State Machine Description Figure 2 is the FSM representation of the BULLS protocol. Four states as in Gnutella are defined, INITIALIZE, IDLE, SEARCH, and SELECT. The global directory data structure described before will be referred to as data structure in the FSM s description and in the rest of the paper, since it is the only data structure used by BULLS. The states and transitions: INITIALIZE: A node entering the network can be in this state by requesting to receive neighbor addresses and downloading the data structure from a specialized bootstrapping node. On the reception of a response with the requested neighbor addresses, the node connects to its neighbors, forwards its own shared file list (one update message per file shared) and transitions to IDLE. IDLE: In this state a node can 1) make a file search by a local look-up in the data structure and transition to SEARCH, 2) detect a change in the data structure, repeat via an update message the changes in data structure (one update per change) and remain in IDLE, 3) receive an update message, modify the data structure with the update received, store it in the cache, repeat it (send update message to all neighbors except the one from which the message was received from), and remain in IDLE, 4) receive a depart message, update data structure by modifying departing node s row entry and repeat depart message, or 5) disconnect from the network by sending a depart message. SEARCH: In this state the node waits for results from a local look-up and it can 1) transition to SELECT if local look-up is successful or 2) transition to IDLE if local look-up does not return results. SELECT: In this state a node from which to download a file is selected. The set of possible to select from is returned by the successful local look-up executed in the SEARCH state. The node downloads the file, updates its shared files, updates its data structure, and transitions to IDLE. The transitions that impact the amount of overhead traffic generated are: 1) The transition from INITIALIZE to IDLE in which a broadcast message per file shared entry to all is issued. Broadcast is done as in Gnutella. 2) The transition from IDLE that occurs from a change in the shared files, update message is broadcast. 3) The transition from IDLE that occurs when a depart message is received and then broadcast. 4) The transition from IDLE that occurs when an update message is received and broadcast. 5) The transition that allows from the IDLE state to disconnect by the broadcast of a depart message.

Independent variables: D = Node degree M = Number of files shared per node P = Probability of a node having a given file N filename = Number of bytes required to store a filename N hops = Number of hops () a queryhit travels N nodename = Number of bytes required to store a node name N = Number of in the P2P network R search = Rate of searches per node (messages/sec) R update = Rate of file list updates per node (messages/sec) = Time a node stays in the P2P network (sec) Dependent variables: S bulls = Storage required per node for BULLS (bytes) X bulls = BULLS overhead messages rate per node = Gnutella overhead message rate per node X gnutella Figure 3. Model variables 3. PERFORMANCE MODEL The models developed in this section result in expressions for the storage requirement of the data structure of BULLS (S bulls ) in bytes and the overhead traffic per node in messages per second for Gnutella (X gnutella ) and BULLS (X bulls ). The models are developed as a function of the ten independent variables shown in Figure 3. The variables are defined for both Gnutella and BULLS. There are three assumptions: 1. The first assumption is that the number of in the P2P network (N ) remains constant. This makes the behavior of both protocols independent of the number of that connect to the network or which connect first to the network. Overhead traffic can be analyzed when the P2P network is in a stable state where the same number of enter and depart. 2. The second assumption is that a single message from either BULLS or Gnutella is equivalent to sending one packet in the network. This allows the comparison of the overhead traffic to be based on the flow of messages and not on the specific characteristics of the links and of the network. 3. The third assumption defines each search to be equivalent to one file search (searches are used to locate one file in the network). Multiple files searches can be modeled as multiple single file searches. The total number of files shared in the network is MN node as each node shares M files. The varilable P is defined as the measure of popularity of a file (the probability a node has a requested file). If P = 0 a node does not have a file, otherwise when P = 1 the node has the file. Thus, P determines the number of queryhit responses for Gnutella. The node degree D is the number of neighbors maintained by a node. The rate of file searches per node, R search, corresponds to the total file query search activity initiated by the user at a node (successful and unsuccessful searches). A successful file search response in Gnutella is a queryhit. Each queryhit is routed back via the from which the query was received from. The number of hops () the queryhit travels in the network is N hops. All successful file searches are assumed to result in a complete file download Node 3 Node 2 Node 1 (a) Figure 4 Number of times a flooded message is received (causing an update to the shared file list of the node). In addition to downloads, it is possible for the shared file list of a node to be changed by users removing or adding files from sources other than the P2P network. The rate of shared file list addition and deletions is the rate of updates, R updates. The consequence of the second assumption is that a single message (or packet) is used to send a request to neighbor. The following five events describe the situations in which BULLS or Gnutella send a single message: 1) file search query (Gnutella), 2) queryhit response (Gnutella), 3) file update message (BULLS), 4) node departing message (BULLS), and 5) broadcasting the entire shared file list (BULLS) when a node connects to the network. It is assumed that M messages are required to broadcast the entire shared file list (i.e., each filename requires one message). This is an extreme assumption. The shared file list could be compressed and require far fewer than M messages. 3.1 BULLS Storage Requirement A Gnutella node does not require any local storage other than storing the shared files. In BULLS each node must store the data structure that contains all of the names of all files stored in the network by all. The size of this data structure (in bytes) is: bulls Node 5 Node 4 ( N MN ) S = N + (1) nodename Node 3 Node 2 Node 1 filename (b) Node 5 Node 4 3.2 Flow Models for Overhead Traffic The traffic overhead for both Gnutella and BULLS is generated by the flooding of messages. Each node that receives a unique (not already received) message repeats the message to all of its neighbors, except the neighbor it received the message from. Messages are determined to be unique by the use of a Globally Unique Identifier (GUID). Nodes store the GUID of previously received messages in a table (GUID table). Each time a message is received, its GUID is compared against the values stored in the GUID table. If the comparison is successful it will drop the message. Otherwise, the node will add the GUID to the GUID list and repeat the message to all of its neighbors. Thus, a node can receive a given message up to, but never more than D times. The actual number of times a node receives a given message is a function of the network topology and message forwarding delay. Figure 4 shows two cases where (a) each message sent by node 1 is received only once by node 2, and (b) where the message is received four times by node 2. In this paper, we consider the worst case of each node receiving a flooded message D times. In any case, this behavior will be the same

D = 6 M = 100 files P = 0.00125 N filename = 50 bytes N hops = 3.5 N nodename = 16 bytes N = 78125 R search = 4.17 x 10-3 messages/sec R update = 3.21 x 10-3 messages/sec = 12 hours to 7 days Figure 5. Numerical values for model variables between Gnutella and BULLS (both use the same rules to repeat messages and have the same network topologies), so relative comparisons are similar. The overhead message rate per node for Gnutella is ( N 1) + R N P( N 1) X. (2) gnutella = RsearchD search hops The first term is the rate of query messages seen by each node. Each node receives D copies of each query sent by every other node. The second term is an approximation of the rate of queryhit response messages seen by each node. Queryhit messages are returned via the backward path a query was received, thus each queryhit message travels on average N hops and thus is received by N hops. The overhead message rate per node for BULLS is X bulls = RupdatesD( N 1 ) + D( N / Tstay )( M + 1). (3) The first term is the rate of flooded directory update messages seen by each node as a result of adding or deleting a shared file. When all searches are successful (i.e., a file is found) and files are not otherwise added or deleted to a node, R updates will clearly be the same as R search. The second term is the rate of flooded update messages seen by each node as a result of entering the network (flooding their entire directory listing of shared files to all ) and from goodbye messages from departing (by the first assumption, the rate in which enter and depart the network is the same). Clearly, the trade-off in overhead traffic between Gnutella and BULLS is a function of N and M. BULLS will have lower overhead than Gnutella when the N and M are low, that is if the entering and departing the network ( ) values of ( ) R N P ( N 1 ) > D( N / T )( M + 1) search hops stay (4) 4. PERFORMANCE ANALYSIS The models for the storage requirements of BULLS and the overhead traffic of BULLS and Gnutella need to be parameterized for a performance comparison of Gnutella and BULLS. Figure 5 shows the values (and range) for the independent variables. The fixed value for each of the variables is representative and the range for is reasonable to study the dynamics of. The values for M, P and D were selected from literature, M from [6] and P, D from [4]. The estimates for the other variables are: N is calculated from D. Given that each node has D different neighbors and that the maximum number of hops a message travels is 7 hops based on the standard Gnutella time-to-live value of 7 [4], then N = ( D 1) 7 = 78125. messages / second 3000 2500 2000 1500 1000 500 0 40 100 160 220 280 340 400 460 520 580 640 x 10 3 (seconds) Figure 6. Impact of on overhead traffic N hops is the average path length a message can travel. It can be estimated as half the maximum number of hops a message travels (i.e., 7 hops), so N = 3. 5. R search is roughly the sum of the average time for a user to search (30 seconds), select the file to download (30 seconds) and download a file (3 minutes). This is an extreme case where a user does not consume (e.g., listen or view) a file before initiating another search and download. R updates is the sum of the rate of downloads (successful searches) and the rate a user adds or deletes a shared file. It has been estimated that 77% of the searches are successful [3]. The rate at which a user adds or deletes a shared file is approximated at one per every few hours, which is negligible with respect to the rate of downloads. The rate of updates is then R = 0. 77R. updates hops search is estimated to be in the range of many hours to several days. This models P2P applications as pervasive and always on as is currently the case with shared disks in desktop PCs. The value of has significant effect on BULLS overhead (in the second term in eq. (3)). N filename is 50 bytes. Filenames are not usually longer than 50 characters (1 byte per character). N nodename is 16 bytes because the IP address of node is used as the node name. The analysis results for the representative values in Figure 5 with = 12 hours and equations (1), (2), and (3) are: S bulls = 3.92 x 10 8 bytes X gnutella = 1956 message/second Gnutella BULLS X bulls = 2600 messages/second The data structure size is about 374 MBytes. Given that hard drives sizes are usually 100 GBytes or larger, the BULLS storage requirement can easily be satisfied. Given that storage costs decrease with time, it is probable that within a few years the amount of storage required for BULLS will be entirely negligible with respect to the capacity of a commodity hard drive. The message rate corresponds to less than 200 Kb/sec, which is reasonable for broadband connections of several Mb/sec data rate. If = 12 hours BULLS s overhead traffic rate is 33% greater than Gnutella s. Figure 6 shows the overhead traffic rate as a function of the rate of entering (and leaving) the

network. The variables M, P, D, N, R search, R updates, are fixed and is varied. Figure 6 demonstrates that the overhead traffic rate for BULLS decreases as increases. For example if is doubled to 24 hours, then BULLS s overhead is only 1% greater than Gnutella s. 4.1 Discussion of Analysis Results Using the values and the results from Figure 5 and 6, the following question about the network dynamics can be answered: Question: How does Gnutella and BULLS overhead traffic rate compare when the time a node stays in the network varies? Figure 6 shows that BULLS overhead traffic rate is higher than Gnutella when < 30 hours. As P2P becomes a pervasive Internet application users are likely to remain connected for longer periods of time. In the case, when > 30 hours, BULLS reduces Gnutella s overhead traffic by a minimum of 0.6% and a maximum of 19% ( = 7 days). It is possible to further reduce the BULLS rate of overhead traffic when broadcast their entire list of files shared. The broadcasting of M messages (one message for each file shared), can be reduced by compressing the text stored in the data structure. Text files can be compressed up to 90% depending on text redundancy. Also, the broadcast of updates can be reduced by batching update messages together instead of broadcasting updates separately. Thus BULLS has roughly the same, or lower, overhead than Gnutella. 5. RELATED WORK Flooding is suitable for a wide range of applications that have not been explored by existing P2P protocols. Many P2P protocols focus on limiting query flooding and do not allow to know what files are shared by other. Systems like FastTrack (i.e., Kazaa) use the concept of super to proxy search requests from other called leaves to limit flooding [1]. Flooding excludes the leaves with low probability of responding queries from file searches. Super store the directory of the files shared by each of its assigned leaves. Although, super know the files shared by its leaves, they do not know what files are shared by other super. Protocols such as Kazaa cannot determine the entire set of files shared in the network. Three successful applications of flooding the network with local information are presented in [5], [9] and by the Open Shortest Path First (OSPF) protocol. The first is a decentralized replica location mechanism for scientific data analysis projects. The index of a node s content is disseminated using a soft state protocol and queries are supported by Bloom Filters. The protocol in [5] has been evaluated for one specific data set, and not proven to be portable to P2P networks. The second approach in [9] proposes a wide-area file system in which distributed users can share data. Flooding is used to propagate the content of file updates instead of the name of the file been updated. OSPF is a link-state routing protocol that periodically floods the network with link state updates within its broadcast domain. Total knowledge of link state is possible within the broadcast domain. Gnutella based P2P protocols cannot determine the entire set of files shared in the network. BULLS differs from the existing approaches in two aspects 1) all are knowledgeable of what others in the network share explicitly, and 2) a different perspective from most unstructured P2P protocols is explored. Efforts to develop new applications under a different P2P paradigm remain a challenge. 6. CONCLUSIONS AND FUTURE WORK BULLS is a new paradigm for P2P one where file lists are broadcast instead of queries for files. With BULLS all have knowledge of what files all other are sharing. This new paradigm enables new applications. A flow model showed that BULLS overhead is very close to that of Gnutella and is significantly less (about 19% less) as P2P applications remain always on for longer periods of time. The storage requirement for BULLS is reasonable for commodity hard drives. Future work includes improving BULLS by limiting the number of updates broadcast. Updates can be broadcast only when file replicas fall below a threshold that affects file availability. Also, multiple update messages can be merged into a single update message. BULLS can be extended to enable P2P networks to be energy efficient by powering down with redundant file content. This is an algorithmic set cover problem. This is the immediate future work that will be undertaken as BULLS is investigated further. ACKNOWLEDGMENTS The authors thank Chamara Gunaratne and Cesar Guerrero, graduate students at USF, for their valuable comments. REFERENCES [1] Androutsellis-Theotokis, S. and Spinellis, D. A Survey of Peer-To-Peer Content Distribution Technologies, ACM Computing Surveys, 36,4 (December 2004), 335-371. [2] Karagiannis, T., Broido, A., Brownlee, N., Claffy, K. and Faloutsos, M. Is P2P Dying or Just Hiding?, Proceedings of GLOBECOM, (December 2004), 1532-1538. [3] Klemm, A., Lindemann, C., Vernon, M., and Waldhorst, O. Characterizing the Query Behavior in Peer-To-Peer File Sharing Systems, Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, (October 2004), 55-67. [4] Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. Search and Replication in Unstructured Peer-to-Peer Networks, Proceedings of the 16th International Conference on Supercomputing, (June 2002), 84-95. [5] Ripeanu, M. and Foster, I. A Decentralized, Adaptive Replica Location Mechanism, Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, (July 2002), 24-32. [6] Saroiu, S., Gummadi, P. and Gribble, S. A Measurement Study of Peer-to-Peer File Sharing Systems, Proceedings of SPIE in Multimedia Computing and Networking, 4673,1 (January 2002), 156-170. [7] Subhabrata, S. and Wang, J. Analyzing Peer-To-Peer Traffic Across Large Networks, IEEE/ACM Transactions on Networking, 12,2 (April 2004), 137-150. [8] Saito, Y., Karamanolis, C., Karlsson, M. and Mallik Mahalingam, M. Taming aggressive replication in the Pangaea wide-area file system, Proceedings of the 5th symposium on Operating Systems Design and Implementation, (December 2002), 15-30.