Scalable Content-Based Publish/Subscribe Services over Structured Peer-to-Peer Networks

Size: px

Start display at page:

Download "Scalable Content-Based Publish/Subscribe Services over Structured Peer-to-Peer Networks"

Britney Nichols
5 years ago
Views:

1 Scalable Content-Based Publish/Subscribe Services over Structured Peer-to-Peer Networks Xiaoyu Yang Department of ECECS University of Cincinnati Yingwu Zhu Department of CSSE University of Seattle Yiming Hu Department of ECECS University of Cincinnati Abstract The scalability has remained a challenge in the design of distributed publish/subscribe systems. In this paper we propose a novel solution to address this problem in contentbased pub/sub systems on top of Distributed Hash Table. The main objective is to ensure an appropriate amount of rendezvous point nodes in the system, as well as maintain an even load distribution among them. An attribute-vector based publish/subscribe scheme and related load balancing mechanisms (ID space partitioning, attribute grouping, dynamic ID space split-merge) are proposed to achieve this goal. The experimental results show that our approaches can achieve a good scalability by efficiently distribute/balance load among an adaptive quantity of rendezvous point nodes, while retaining very small overhead and latency. 1. Introduction The publish/subscribe system has become a prevalent paradigm for delivering data/events from publishers (data/event producers) to subscribers (data/event consumers) across large-scale distributed networks in a decoupled fashion. In such a system, subscribers register their interests to the system using a set of subscriptions. The publishers can be completely unaware of the subscribers and simply submit information to the system using a set of publications. Once receiving a publication, the system matches it to the subscriptions and then delivers it to the interested subscribers. Currently, the designs of pub/sub systems can be typically classified into two classes: topic-based and content-based. Topic-based pub/sub systems need to predefine a set of topics. A subscriber registers a set of topics in which it is interested and then will be notified of all of the events associated with these topics. Relatively, contentbased pub/sub systems allow users to flexibly specify complex interests with a set of predicates over the content of the publication [5]. Pub/sub systems can be either centralized or distributed. The centralized solution lacks scalability and suffers from single point of failure. Hence, the distributed solutions are more practical and preferred. The distributed hash table (DHT [14, 12, 19, 11]) paradigm is appropriate for building large-scale distributed applications due to its scalability, fault-tolerance and self-organization. As a result, many pub/sub systems [7, 4, 17, 16, 15, 9, ] have been built on top of DHTs. Peers in the system cooperate in storing subscriptions and delivering events in a fully distributed manner. The main challenge in the implementation of DHTbased pub/sub systems involves the design of an efficient, light-weighted event delivery algorithm and a uniform distribution of load among the peers. Previous work in our group has introduced a novel framework Ferry [2] for building content-based pub/sub services on top of a DHT. It exploits the embedded trees in the DHT to aggregate and deliver events from the rendezvous point (RP for short) nodes. Such a design can efficiently eliminate the cost in construction and maintenance of the dissemination trees. However, as an initial stage, Ferry could only adopt a small set of peers to be RP nodes. That is, for a pub/sub schema with n attributes, the maximum number of RP nodes is n. This restriction impacts the scalability of Ferry in large distributed networks, as the limited RP nodes can be easily overloaded. In this paper we present an enhanced solution, called Eferry, to address this problem. The basic idea is to maintain a suitable quantity of RP nodes in the system, as well as keep an even load distribution among them. Eferry achieves this goal from a few aspects: (1) we propose a novel subscription installation algorithm to choose certain RP nodes which are evenly distributed in the ID space. (2) The ID Space Partitioning and Attributes Grouping schemes are designed to flexibly adjust the amount of RP nodes as well as their load. (3) Eferry adopts a self-adaptive load balancing algorithm Dynamic ID Space Split-Merge to make sure that

2 no node is unduly loaded. Another contribution of this paper is an optimization of the previous event delivery mechanism by considering RP nodes as a special category of subscribers, which makes event publication more flexible and convenient. A cache and delay mechanism is also proposed to further aggregate event messages and reduce bandwidth usage. The performance of Eferry has been extensively evaluated using the metrics of load distribution, overlay hops, latency, overhead and bandwidth. The results show that a good scalability can be achieved by effectively distribute/balance load among the adaptive number of RP nodes, while retaining very small overhead and latency. The rest of this paper is structured as follows. Section 2 gives a survey of related work. Section 3 describes the key features of our design. Section 4 discusses load balancing issues. In section 5, experiments and results are presented and discussed. Finally, section 6 is the conclusion and future work. 2. Related Work Compared to the topic-based systems (e.g. ISIS [2] and IBus [8]), content-based pub/sub systems are preferable, because they give subscribers the flexibility to specify more complex interests with a set of predicates over the entire content of the publication. The events whose content satisfies all the specified predicates are notified to the corresponding subscribers. As a result, subscriptions are more expressive but the system is harder to implement. Many content-based pub/sub systems are based on a spanning tree of all brokers (e.g. SIENA [3], Gryphon [1]). However, such a spanning tree is not feasible when the system involves a large number of brokers that join and leave system dynamically. In addition, the overhead on brokers is so high that it may limit the system scalability and impose uneven load among nodes. Structured P2P systems, such as Chord [14], Pastry [12], Tapestry [19], and CAN [11], use distributed hash table to construct the overlay network and provide more efficient lookups. Many attempts have been made to design a P2Pbased pub/sub system [16, 17, 15, 9, 4,, 13, 21, 7]. Scribe [13] and Bayeux [21] are topic-based pub/sub systems built on top of Pastry and Tapestry respectively. They can not directly support content-based pub/sub services. Tam et al. [15] built a content-based pub/sub system from Scribe. However, their system still suffers from some restrictions on the expression of subscriptions. Terpstra et al. [16] presented a content-based pub/sub system on top of Chord. In order to make the system function correctly, it needs to maintain the invariants for filters, which is inefficient in case of frequent node join and departure. Triantafillou et al. [17] also built a content-based pub/sub system on top of Chord. The main drawback of their system is that subscription installation/update will involve a large number of nodes and messages, which is very expensive in distributed environment. Reach [9] and HOMED [4] are content-based pub/sub systems built on top of a P2P overlay, which maintain the high-level semantic relationships. They have a load balancing problem since unevenly distributed subscriptions will cause unevenly distributed nodes in the overlay id space. Meghdoot [7] is based on CAN. Considering the skewed distribution of real applications, it addresses the load balancing issue by zone splitting replication. The main limitation is that the overlay s dimension is proportional to the number of event attributes, which limits the scalability. Our earlier work Ferry [2] presented an architecture for content-based pub/sub services built on top of Chord. It exploits the embedded trees in DHT structure to deliver events so that cost in construction and maintenance of the dissemination trees can be eliminated. However, Ferry can only use a small set of RP nodes, which may cause a serious scalability concern for large-scale networks. The attribute partitioning [18] scheme can not effectively alleviate this problem, considering the very limited number of RP nodes it can introduce and the high overhead on subscription installation/update. In this paper, we propose an enhanced solution Eferry to pursue the expected scalability and uniform load distribution. Other work of this paper include an optimization of the event delivery mechanism in previous Eferry, and a cache & delay send algorithm which can further aggregate event messages and reduce bandwidth usage. 3. System Design In this section, we present the design of Eferry. First, we give a brief description of the content-based pub/sub model, as well as the representation of events and subscriptions. Next, we describe the subscriptions installation and events delivery mechanisms as mentioned in the introduction Pub/Sub Model in Eferry To be coherent with Ferry, the pub/sub model is still illustrated based on the pub/sub schema proposed by Fabret et al. [5], which is defined as: = {A 1, A 2,..., A n }, where each A i represents an attribute. In this scheme, an attribute consists of a name, a type and a domain, and can be described by a tuple [name : type, min, max]. An event is a set of equalities on attributes in schema, e.g. e = {..., A i = c i,...}, where A i and c i is a domain value of A i. A subscription is a conjunction of predicates which can specify a constant value or a range for the attribute using common operators ( =, >,, <,, and etc.). For a string

3 attribute, the predicate can also specify a regular expression with the operator. A subscription example is s = {(A i = v i ) (v j1 < A j v j2 ) (A k / (dog cat bird)[a Z 9] /i/)}. An event e matches a subscription s if and only if each predicate of s is satisfied by the value of corresponding attribute contained in event e. In this paper, we introduce two new definitions: Attribute Set and Attribute Vector, which are important in Eferry s implementations. The attributes set attr set of an event e (or subscription s) is a set containing the names of all attributes appeared in e or s. An attributes vector attr vec is a lexicographically ordered list of the elements in attr set Subscription Installation In a pub/sub system, the subscription installation deals with the issue of efficiently storing subscriptions. For this purpose, Eferry provides an efficient, attribute-vector based subscription installation algorithm. In this algorithm, each subscription has a unique subid composed of subscriber s nodeid and an internalid 1. Given a subscription s, an attribute vector attr vec is first extracted from s, then a key k is produced by hashing the attribute vector attr vec. So subscription s with its subid will be stored on the node which is an immediate successor of k. Algorithm 1 briefly describes the procedure of subscription installation. Algorithm 1 subscribe (Subscription s) 1: internalid generate an internal ID for this subscription 2: register internalid and s in the local repository 3: subid (nodeid, internalid) 4: attr set extract attribute names from s 5: attr vec sort elements in attr set into lexicographic order 6: k hash(attr vec) 7: R lookup(k) 8: R.sub register(subid, s) {remote procedure call on node N} The consistent hash function and the underlying DHT protocol ensure that keys are evenly distributed among different peers. Compared to the schemes that only hash one attribute (RndRP & PredRP [2]), our attribute-vector based algorithm has two distinct advantages: (1) Since the attribute vector is hashed, adequate RP nodes (maximumly 2 n 1, given a schema with n attributes) can be formed. The load (storing, matching, delivering) per RP node will be significantly decreased. (2) Since one RP node stores subscriptions with the same attribute set, the subscription management becomes more convenient. Some local optimization and index mechanism can therefore be easily implemented to optimize event matching. However, it should be pointed out when the pub/sub schema has few attributes, the above attribute vector based scheme cannot significantly improve the scalability (since n is small). We have developed a novel load-balance 1 internalids are positive integers used to identify different subscriptions of the same subscriber; The numbers that have not been used for a long time can be subtly reused. mechanism, called IDSpace Partitioning (discussed in section 4.1), to address this problem. In algorithm 1, a subscriber invokes hash() and lookup() functions to get the RP node R. Next, it remotely invokes the procedure sub register on node N to register its subscription s with the related subid. It takes O(log N) overlay hops for lookup() to locate the RP node, where N is number of nodes in the system. In Eferry, a RP node organizes all subscriptions stored on it into m buckets based on the subscribers nodeids, where m is the number of bits of ID space. Subscriptions with subid.nodeid [k + 2 i 1, k + 2 i ) will be put into bucket[i] (all arithmetic is modulo 2 m ), where k is the hash value of the attribute vector and 1 i m. Each bucket maintains a summary filter [18], and only the events accepted by filter can continue their matching process. Moreover, indexes are created and maintained in a bucket to facilitate event matching. The details of indexing will not be discussed in this paper. Algorithm 2 outlines the subscription registration process on a RP node. Note that more than one attribute vectors might be stored on a RP node. In this case, vectors are managed individually, with the RP node regarded as a few virtual nodes. Algorithm 2 sub register (SubID sid, Subscription s) Require: bucket[1..m] {store subscriptions in a bucket based on the subscriber s nodeid} 1: k hash the attribute vector extracted from s 2: i log 2 (sid.nodeid k) (mod 2 m ) + 1 3: put (sid, s) in bucket[i] 4: update summary filter of bucket[i] 5: update other relative data structures Although consistent hash function can guarantee a uniform distribution of attribute vectors among peers, the load on different RP nodes may still not be well balanced due to the skewness of real world data. The RP nodes of hot attribute vectors tend to be unduly loaded. A dynamic load balancing mechanism, called Dynamic ID Space Split- Merge, is presented in section 4.3 to deal with this issue. In Eferry, a subscriber can unregister or renew its subscriptions by sending a simple message only with subids to the corresponding RP nodes. The message does not need to include any content of the subscriptions, which can reduce the bandwidth usage Event Publication and Delivery When an event is generated, the pub/sub system gets all matching subscriptions and delivers the event to corresponding subscribers. Our earlier work [2] presented an efficient event delivery algorithm that aggregates and delivers event messages along DHT links. In this paper, we give an improved event publication and delivery mechanism. First, by treating RP nodes as special subscribers, the

4 event publication is combined into the event delivery process. Second, cache&delay send mechanism is introduced to further aggregate event messages in a short time period. Algorithm 3 publish (Event e) 1: internalid generate an internal publish ID 2: pubid (this node sid, internalid) 3: subidlist { } 4: attr set extract attribute set of s 5: for each non-empty attribute set a s attr set do 6: a v sort a s into lexicographic ordered vector 7: k hash(a v) 8: sid (k, ) 9: subidlist.pushback(sid) : end for 11: Message M (pubid, e, subidlist) 12: route message(m) Each publication has a pubid, composed of publisher s nodeid and an internalid 2. The event publication procedure is as follows: Given an event e, attribute set attr set is firstly extracted from e. For each non-empty subset of attr set (including attr set), an attribute vector is produced and a key k is generated by hashing the attribute vector. A special subid is generated by using k as subscriber s nodeid and as internalid. Then, an initial subidlist pointing to the related RP nodes is created on the publishing node. Finally, an event message, composed of pubid, e, and subidlist, is send to route message module for processing and routing. Algorithm 3 outlines this procedure. It should be noticed that if the pub/sub schema has too many attributes, so might the event, a long initial subidlist will be generated, which can cause event publication inefficient. Attributes Grouping mechanism will be discussed in section 4.2 to address this problem. Above event publication process does nothing but generate the subidlist, in order to initialize the event message processing and delivery. The route message process can be divided into two phases. The first phase is event processing and matching. The event and its subidlist are firstly extracted from the message. If there are subids targeting the current node, the event is either sent to the local application/user (the current node is a subscriber of the event), or matched with the subscriptions stored on this node (the current node is a RP node related to this event). The event matching returns a list of matched subids, which is then merged with the remainder of the old subidlist (all subids targeting the current node have been removed) to generate a new subidlist for event delivery. The second phase deals with event delivery, which is inherited from Ferry. By exploring DHT links, subidlist is divided into some subidlists based on their targeting node ID. All subids with targeting nodes sharing a common DHT link are put into the same list, according to Chord s routing protocol. The message carrying the subid list is then delivered through the corresponding DHT link. This mechanism 2 pubids are used to identify events in the system, so internalid in pubid can be reused after a sufficient time period Algorithm 4 route message (Message M) Require: nid is this node s ID Require: event match(e) matches event e with subscriptions stored on current node, returns matched subidlist 1: pid extract pubid from M 2: e extract event from M 3: i subidlist extract subidlist from M 4: o subidlist {} { initialize to empty} 5: while i subidlist is not empty do 6: sid pop a subid from i subidlist 7: if sid.internalid = && successor(sid.nodeid) = nid then 8: matchidlist event match(e) 9: i subidlist i subidlist + matchidlist : else if sid.internalid && sid.nodeid = nid then 11: deliver e to local application 12: else 13: push sid to o subidlist 14: end if 15: end while 16: 17: if event e is not in the event buffer then 18: allocate buffer for event e, identified by pid 19: init RT entry sidlist[1..k] for event e, corresponding to Chord s k neighbor nodes in routing table 2: setup timer to call deliver event(pid) 21: end if 22: 23: for each sid o subidlist do 24: find neighbor node N j whose node ID is equal to or immediately precedes sid.nodeid 25: put sid into RT entry sidlist[j] 26: end for can efficiently aggregate the event messages and reduce the network bandwidth usage. Eferry enhances the delivery scheme by introducing a novel cache&delay send mechanism (algorithm 4, lines 17-21), which can cache incoming messages for a short time, to further aggregate the messages carrying the same event but different subidlist. The total number of messages is therefore reduced with the overhead of a slight increase in average event delivery latency. Algorithm 5 deliver event (PubID pid) 1: locate event e in the events buffer by pid 2: for i = 1 to k do 3: if RT entry sidlist[i] is not empty then 4: Message M (pid, e, RT entry sidlist[i]) 5: N j.route message(m) 6: end if 7: end for 8: remove event with P ubid = pid form event buffer Algorithm 4&5 outline the procedure of event message processing and routing. Two distinct features of Eferry s event publication and delivery algorithms are: (1) By treating RP nodes as special subscribers, the event publication process is integrated with the event delivery process, which is applicable to the cases with a large number of RP nodes. (2) cache&delay send mechanism can reduce the bandwidth usage by further aggregating messages. 4. Load Balancing An important issue in the distributed system is load balancing. As discussed in section 3.2 and 3.3, Eferry needs to solve the following load balancing problems:

5 If the pub/sub schema has few attributes, there is only a small amount of overloaded RP nodes in the system, which impacts the system scalability. If pub/sub schema has too many attributes, there will be a large amount of underloaded RP nodes in the system, which will cause event publication inefficient. Due to the skewness of real world data, load distribution on different RP nodes is not well balanced. The RP nodes with hot attribute vectors may be unduly loaded Dynamic ID Space Split-Merge ID Space Partitioning and Attribute Grouping schemes ensure that the pub/sub system has an appropriate number of RP nodes. However, the data from real world tend to be skewed, so RP nodes holding popular/hot attribute vectors may be overly loaded. In this subsection, a dynamic load balancing approach is proposed to address this problem. In this section, three load balancing schemes, called ID Space Partitioning, Attributes Grouping, and Dynamic ID Space Split-Merge, are proposed to address these problems respectively ID Space Partitioning Let m be the number of bits in the key/node identifiers, the ID space can be partitioned into 2 n parts by regarding the highest n bits of the identifier as the partition number. When a node registers a subscription s in the system, a key k is first generated by hashing the attribute vector extracted from s. Then a new key k is calculated as (arithmetic is modulo 2 m ): x = (2 n 1) (m n) k = subscriber s nodeid & x + k & ( x) highest n bits of sub s nodeid lowest m n bits of k The subscription is installed to the node which is the successor of k along the Chord ring. When an event is published, for each hashed key k, 2 n keys are generated by substituting the highest n bits with a partition number p [.. 2 n 1] respectively. With this mechanism, the number of RP nodes is expected to be augmented by a factor 2 n. However, the value of n should be carefully budgeted, since excessive RP nodes will cause event publication inefficient Attributes Grouping Too many attributes in the pub/sub schema will introduce excessive, underloaded RP nodes. By grouping the relatively unpopular attributes, a large number of underloaded RP nodes can be merged into a few nodes. For example, given a pub/sub schema S = {A 1, A 2,..., A 2 }, it has 2 attributes. By default, excessive RP nodes, up to (2 2 1), might be introduced to the system. If the unpopular attributes can be grouped together, say S = {A 1, A 2,..., A 6, G 1(A7,...,A 14 ), G 2(A15,...,A 2 )}, attributes in the same group is replaced by a compound attribute in the attribute vector. As the result, the redundant underloaded RP nodes are efficiently merged, and there are only up to (2 8 1) RP nodes in the system. Figure 1. ID Space Split Merge As illustrated in figure 1, given an attribute vector att vec, all related subscriptions will be stored on node N, the successor node of key k = hash(attr vec) on the Chord ring. When att vec is becoming hot, more and more subscriptions will be stored on N, which will cause N overloaded. A load balancing mechanism is needed to offload part of the subscriptions on node N to other nodes. Recall that subscriptions on RP nodes are stored into m buckets according to the subscriber s node ID, where m is the number of bits of ID space. Subscriptions with subscriber s nodeid in [k + 2 i 1, k + 2 i ) are stored in bucket[i]. By moving subscriptions in bucket[m] to node A, which is the successor node of key k + 2 m 1, the ID Space is divided into two half: the subscriptions with subscriber s node ID in [k, k + 2 m 1 ) are stored on node N, while subscriptions with subscriber s node ID in [k + 2 m 1, k) are transferred to node A. Therefore the load on node N is significantly reduced. Here node A is an auxiliary node to node N. Any new subscriptions whose node ID in [k + 2 m 1, k) will also be forwarded to node A. When an event matches the summary filter of bucket[m], N will put a special subid (k + 2 m 1, ) to the matched subidlist, so A can receive this event in next hop and match it with the subscriptions stored on it. This splitting procedure can be further invoked, as illustrated in figure 1, if N and A are still overloaded. The splitting can also facilitate event delivery, because the original long matched subidlist (for subscribers in [k +2 m 1, k)]) is replaced by a single subid = (k + 2 m 1, ) in the event message sent from N. Note that the subscription installation is slightly impacted by the additional hops introduced. ID space merging is the reverse procedure of splitting. The ID space is merged when load on these nodes reach a low threshold. For example, in figure 1, once load on the leaf auxiliary nodes B, C, and D is lower than a threshold,

6 they will periodically send a merging request to the corresponding upper layer node N or A. The upper layer nodes will give a positive response if their load is also low. How to determine the thresholds for splitting and merging is out of the scope of this paper. 5. Experimental Evaluation In this section, we evaluate the performance of Eferry through simulations. We start our discussion by describing the experimental setup and metrics used for evaluation. Then, the experimental results are presented Experimental Setup We built Eferry on top of P2PSim 3, a discrete-event packet level simulator. Currently, P2PSim can simulate many DHT protocols with various parameters. We use the Chord-PNS (proximity neighbor selection) protocol and its default parameter configurations. The network model in our simulation is derived from the King dataset 4, which includes the pairwise latencies of 24 DNS servers in the Internet measured by King method [6]. The average RTT of the simulated network is 198ms. The schema used in our simulation was derived from a stock quotes model proposed in Meghdoot [7]. The definition of the schema is as follows: = Symbol : STRING [aaa, zzzzz] High : FLOAT [,.] Low : FLOAT [,.] Open : FLOAT [,.] Close : FLOAT [,.] V olume : INTEGER [, 3] Date : STRING [2/Jan/98, 31/Dec/4] Each attribute in schema is assigned a popular degree, which indicates the probability with which the attribute will show up in a subscription. The values of popular degrees used in our simulations are described as follows: {p(symbol) = 95%; p(high) = %; p(low) = %; p(open) = 45%; p(close) = 3%; p(volume) = %; p(date) = 5%}. We use stocks in the simulation. Subscriptions are generated based on the attributes popular degrees and some predefined templates as proposed in [7]. Events are randomly generated and the interarrival times of events are exponentially distributed with average value of 12s. A set of cost metrics are used to evaluate the performance of Eferry: (1)latency: the average time of delivering an event to all corresponding subscribers; (2)overlay hops: the average overlay hops of delivering an event to its subscribers; (3)overhead: the ratio of the number of intermediate nodes to the number of subscribers per event delivery; (4)bandwidth cost: the ratio of total bandwidth consumption to the number of nodes involved per event delivery. The message size in the simulation can be derived from the following assumptions: 2 bytes for header, byte for events, 6 bytes for pubid, and 6 bytes for each subid. In addition to these common performance metrics, we also evaluate the load distribution among RP nodes Experimental Results We evaluated the performance of Eferry through detailed simulations. Due to space limitation, we only show part of the results in this section. We first present results for a 24-node network with inter-node latencies derived from 24 DNS servers., subscriptions and 115, events are used in the simulation. The average number of subscribers per event is. Figure 2 illustrates the distribution of events with Percentage of events Percentage of events (,] (,15](15,2](2,25](25,3] >3 Latency ( ms) (,1] (1,1.5] (1.5,2] (2,2.5] (2.5,3] >3 Overhead Percentage of events Percentage of events (,2] (2,2.5] (2.5,3] (3,3.5] (3.5,4] >4 Overlay hops (,35] (35,](,45](45,](,55] >55 Bandwidth (bytes/node) Figure 2. Distribution of events with respect to latency, overlay hops, overhead, and bandwidth cost. respect to latency, overlay hops, overhead, and bandwidth cost respectively. About 87% of events can be delivered to all corresponding subscribers within 2ms, and the average delivery latency per event is ms. 88% of events can be delivered via 3 4 overlay hops, with an average of 3.6 hops. The average bandwidth cost and overhead are.45 bytes/node and 1.23 per event respectively. The results show that Eferry can efficiently deliver events to corresponding subscribers with small bandwidth cost and latency, which implies that although more RP nodes are introduced by our attribute-vector based scheme, the performance of the event delivery has not been negatively impacted. Figure 3 shows the performance of Eferry with various number of subscribers (per event). As the number of subscribers increases from % to % of the total number of nodes in the system, the overlay hops keep approximately constant around 3.59 and the average latency has a slight increase from ms to 1.12ms. The bandwidth cost has a reasonable increase from.45 to (bytes/node) and the overhead drops from 1.23 to.4. The results show

7 Overhead Latency(ms) % 2% 3% % % % % % Overlay hops Bandwidth(bytes/node) % 2% 3% % % % % % Percentage of subscriptions Normal Attributes Grouping Grouping&ID Space Partition Grouping&Dynamic ID Space Split Nodes ranked by load % 2% 3% % % % % % 45 % 2% 3% % % % % % Figure 3. Performance with various percentages of nodes as subscribers per event (24 nodes,, subscriptions) that Eferry is scalable to a large number of subscribers in the system. The performance of cache&delay send mechanism is also evaluated. As shown in figure 4, with ms delay, the average bandwidth cost can be decreased from.45 to when the number of subscribers per event is small (% as subscribers); the bandwidth cost decreases from to when the number of subscribers per event is large (% as subscribers), while the latency has a slight increase of about 8ms. Here the bandwidth cost is per node and per event, so a reduction on bandwidth cost shown here will greatly reduce the bandwidth consumption of the whole system. The results show that given a small delay time, the cache&delay send mechanism can significantly reduce the bandwidth cost at the penalty of a slight increase of event delivery latency. Bandwidth cost (bytes/node) Bandwidth (% sub) Bandwidth (% sub) Latency (% sub) Latency (% sub) Delay time (ms) Figure 4. Effect of cache delay send mechanism The load balancing schemes are studied in a large size network of, nodes (derived from the 24-DNS server measurements) with, subscriptions. The load on a RP node is measured by the ratio of subscriptions stored on it to the total number of subscriptions in the system. Figure 5 shows the load distribution on RP nodes under different load balancing schemes. The nodes are sorted by the decreasing order of load, and only first nodes are plotted. Clearly, attribute vector based pub/sub scheme can introduce more RP nodes compared with single Latency (ms) Figure 5. Load distribution in the system attribute based scheme in ferry, so the average load per RP node is expected to be low. However, due to the skewness of data, load on these RP nodes is highly imbalanced. As depicted in Figure 5(the solid line), the maximum load is about 15.34% and there are also many underloaded nodes in the system. Attributes Grouping (unpopular attributes of volume, close, and date are grouped) can merge excessive underloaded nodes to several nodes, which will make event publication more efficient. As shown in figure 5, by attributes grouping, a large amount of underloaded (less than 1%) nodes are removed from the RP node set, whose load are transferred to the moderate-loaded nodes (ranked 3 to 16). Besides, ID Space Partition can further decrease the load on each node by introducing more RP nodes. With a partition factor of 2, the maximum load is decreased to 7.8%. Moreover, the load of RP nodes can be adaptively adjusted through Dynamic ID Space Split & Merge. As shown in Figure 5, the dynamic scheme combined with grouping achieves a well-balanced load distribution, with a threshold of subscriptions. By evaluating the performance of these load-balancing mechanisms with different partition factors and threshold values, our results show that appropriate combinations of these mechanisms can help maintain an adaptive quantity of RP nodes which are evenly loaded. Latency(ms) Overhead % 2% 3% % % % % % % 2% 3% % % % % % Overlay hops Bandwidth(bytes/node) % 2% 3% % % % % % % 2% 3% % % % % % Figure 6. Performance with various percentages of nodes as subscribers per event (, nodes,, subscriptions) Finally, we evaluate the performance of Eferry in the large network of, nodes, with various percentages

8 of nodes randomly chosen as subscribers. All three load balancing mechanisms are deployed. As shown in figure 6, when number of interested subscribers per event increases from % to % of the total number of nodes in the system, the overlay hops almost keep constant about The average latency has a slight increase from ms to ms. The bandwidth cost has a moderate increase from to (bytes/node), while the overhead drops significantly from 1.25 to.6. Results in figure 5 and 6 show that Eferry is scalable to a large network size. 6. Conclusion and Future Work In this paper we propose a novel approach to address the scalability problem in content-based pub/sub systems on top of DHT. The basic idea is to keep adequate rendezvous point nodes in the system, as well as maintain even load distribution among them. To achieve this goal, we have designed an attribute-vector based scheme and related load balancing mechanisms (ID Space Partitioning, Attributes Grouping and Dynamic ID Space Split-Merge). Moreover, an optimized event delivery mechanism and a cache and delay algorithm are proposed to facilitate event publication and further aggregate event messages. The experimental results show that Eferry can efficiently distribute/balance load among a suitable amount of RP nodes with very small overhead and latency. Moreover, Eferry can scale to a large number of subscribers and a large network size. Currently, Eferry can deal with node join/ departure/ failure. However, the performance of Eferry under high node churn rate has not been explored. This will be one of our future tasks. Also, we need to evaluate Eferry using real-world datasets, based on which more optimizations may be proposed. References [1] G. Banavar, T. Chandra, B. Mukherjee, J. Nagarajarao, R. E. Strom, and D. C. Sturman. An efficient multicast protocol for content-based publish-subscribe systems. In Proceedings of the 19th IEEE ICDCS, pages , [2] K. P. Birman. The process group approach to reliable distributed computing. Communications of the ACM, 36(12):36 53, Dec [3] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and evaluation of a wide-area event notification service. ACM Transactions on Computer Systems, 19(3): , 21. [4] Y. Choi, K. Park, and D. Park. Homed: A peer-topeer overlay architecture for large-scale content-based publish/subscribe systems. In Proceedings of the third International Workshop on Distributed Event-Based Systems (DEBS), pages 2 25, Edinburgh, Scotland, UK, May 24. [5] F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proceedings of the 21 ACM SIGMOD, volume 3, pages , Santa Barbara,CA, 21. [6] K. P. Gummadi, S. Saroiu, and S. D. Gribble. King: Estimating latency between arbitrary internet end hosts. In Proceedings of the 22 SIGCOMM Internet Measurement Workshop, Marseille, France, Nov. 22. [7] A. Gupta, O. D. Sahin, D. Agrawal, and A. E. Abbadi. Meghdoot: Content-based publish/subscribe over p2p networks. In ACM/IFIP/USENIX 5th International Middleware Conference, Toronto, Ontario, Canada, Oct. 24. [8] B. Oki, M. Pfluegl, A. Siegel, and D. Skeen. The information bus: an architecture for extensible distributed systems. In Proceedings of the fourteenth ACM SOSP, pages 58 68, Asheville, NC, Dec [9] G. Perng, C. Wang, and M. K. Reiter. Providing contentbased services in a peer-to-peer environment. In Proceedings of the third International Workshop on Distributed Event-Based Systems (DEBS), pages 74 79, Edinburgh, Scotland, UK, May 24. [] P. R. Pietzuch and J. Bacon. Peer-to-peer overlay broker networks in an event-based middleware. In Proceedings of the Second International Workshop on Distributed Event-Based Systems (DEBS), San Diego, CA, June 23. [11] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and Shenker. A scalable content-addressable network. In Proceedings of ACM SIGCOMM, pages , San Diego, CA, Aug. 21. [12] A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proceedings of the 18th IFIP/ACM International Conference on Distributed System Platforms (Middleware), pages 329 3, Heidelberg, Germany, Nov. 21. [13] A. I. T. Rowstron, A.-M. Kermarrec, M. Castro, and P. Druschel. SCRIBE: The design of a large-scale event notification infrastructure. In Proceedings of the 3rd International Networked Group Communication, pages 3 43, 21. [14] I. Stoica, R. Morris, D. Karger, M. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of ACM SIGCOMM, pages 149 1, San Diego, CA, Aug. 21. [15] D. Tam, R. Azimi, and H.-A. Jacobsen. Building contentbased publish/subscribe systems with distributed hash tables. In Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing, Berlin,Germany, Sept. 23. [16] W. W. Terpstra, S. Behnel, L. Fiege, A. Zeidler, and A. P. Buchmann. A peer-to-peer approach to content-based publish/subscribe. In Proceedings of the Second International Workshop on Distributed Event-Based Systems (DEBS), San Diego, CA, June 23. [17] P. Triantafillou and I. Aekaterinidis. Content-based publishsubscribe over structured P2P networks. In Proceedings of the third International Workshop on Distributed Event- Based Systems (DEBS), pages 4 9, Edinburgh, Scotland, UK, May 24. [18] Y.-M. Wang, L. Qiu, D. Achlioptas, G. Das, P. Larson, and H. J. Wang. Subscription partitioning and routing in contentbased publish/subscribe systems. In Proceedings of the 16th International Symposium on Distributed Computing (DISC), Toulouse, France, Oct. 22. [19] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerance wide-area location and routing. Technical Report UCB/CSD , Computer Science Division, University of California, Berkeley, Apr. 21. [2] Y. Zhu and Y. Hu. Ferry: An architecture for content-based publish/subscribe services on p2p networks. In ICPP, pages IEEE Computer Society, 25. [21] S. Q. Zhuang, B. Y. Zhao, A. D. Joseph, R. H. Katz, and J. Kubiatowicz. Bayeux: An architecture for scalable and fault-tolerant wide-area data dissemination. In Proceedings of the Eleventh International Workshop on Network and Operating System Support for Digital Audio and Video (NOSS- DAV), June 21.

A Large-scale and Decentralized Infrastructure for Content-based Publish/Subscribe Services

A Large-scale and Decentralized Infrastructure for Content-based Publish/Subscribe Services Xiaoyu Yang, Yingwu Zhu 2 and Yiming Hu Dept. of Electrical and Computer Engineering 2 Dept. of Computer Science