Matching Algorithms for User Notification in Digital Libraries

Size: px

Start display at page:

Download "Matching Algorithms for User Notification in Digital Libraries"

Anne Ross
5 years ago
Views:

1 Matching Algorithms for User Notification in Digital Libraries H. Belhaj Frej, P. Rigaux 2 and N. Spyratos Abstract We consider a publish/subscribe system for digital libraries which continuously evaluates queries over a large repository containing document descriptions. The subscriptions, the query expressions and the document descriptions, all rely on a taxonomy that is a hierarchically organized set of keywords, or terms. The digital library supports insertion, update and removal of a document. Each of these operations is seen as an event that must be notified only to those users whose subscriptions match the document s description. The paper addresses the problem of efficiently supporting the notification process, and makes contributions in three directions: (a) definition of a formal model for the publish/subscribe process; (b) introduction of a semi-lattice structure on subscriptions allowing the filtering out of non matching subscriptions; (c) experimental results that show the cost benefits obtained by our approach. Introduction The publish/subscribe interaction paradigm provides subscribers with the ability to express their interest in classes of events generated by publishers. A system that supports this paradigm must be able to find, for each incoming event e, the subscriptions that match e, in order to determine which subscribers should be notified. Many typical web applications can be seen as variants of this general framework, including auction sites, on-line rental offices, virtual bookshops, etc. This work is conducted in the context of the KP- LAB (Knowledge Practices Laboratory) project, Work Package5:"Middleware". ( LRI, Univ. Paris-Sud, 9400 Orsay 2 Lamsade, Univ. Paris-Dauphine, 7506 Paris They act as brokers which store only descriptions of the published items. In the present paper we focus on digital libraries (DL) which maintain (in a repository) descriptions of documents and pointers to documents contents. In this context a publisher is an author that provides to the DL descriptions of his documents and ways to access their contents (e.g. their URIs), whereas a subscriber is a user willing to be informed of any event affecting a document that relates to his topics of interest. We consider a DL model with the following characteristics: There is a taxonomy to which the authors of documents and the subscribers of the library, both adhere; this taxonomy is just a set of keywords, or terms structured as a tree. An example of a taxonomy is the well known ACM Computing Classification System []. A document is represented in the DL repository by a description of its content together with an identifier (say, the document s URI) allowing to access the document s content; the description is just a set of terms from the taxonomy. A query against the library is just a conjunction of terms from the taxonomy (i.e. a conjunctive query) A user is represented by an identifier together with a subscription; a subscription is just a query defining (intentionally) the documents of interest to the user. This simple model covers a wide range of DL organizations, document structures and document formats. In particular, the DL repository can be centralized or distributed without affecting our model. Here, what we call centralized repository is one in which the

2 Programming Theory Languages Algorithms OOL Sort C++ Java MergeSort QuickSort BubbleSort JSP JavaBeans Figure : A taxonomy library stores not only the descriptions of documents but also their contents; and what we call distributed repository is one in which the library stores only the descriptions of documents plus pointers to their contents - the contents themselves being stored at the local repositories of the authors. A subscriber must be informed, or notified whenever an event matching his subscription occurs at the DL. The problem addressed in this paper can be stated as follows: assuming a large number of subscriptions and/or a high rate of events, how to design a fast algorithm for finding those subscribers that should be notified when an event occurs at the DL. An important amount of work has been devoted in the recent past to designing such algorithms for large-scale systems [3]. These works apply mostly to relational databases, where subscriptions are seen as conjunctions of predicates (i.e. of relational variables). Their main common ideas can be summarized as follows: (i) if a predicate appears in many subscriptions, avoid its repeated evaluation, and (ii) try to filter out large sets of subscriptions by evaluating first the most selective predicates. The present paper considers specifically the context of Digital Libraries and makes the following contributions.. we define a refinement relation over the set of subscriptions and provide simple algorithms for (a) testing whether a subscription refines another, and (b) computing the least upper bound (lub) of a set of subscriptions; 2. we propose a cost model and an incremental aggregation algorithm in order to maintain a structure of the set of subscriptions that minimizes the matching cost; 3. we provide a set of experimental results which show the gain obtained by our method. The main idea behind our work is to exploit a refinement relation between subscriptions so as to avoid useless computations. Roughly speaking, if a subscription S 2 refines a subscription S, this means that the set of documents satisfying S 2 is a subset of the set of documents satisfying S. Therefore, if a matching test for S fails, there is no need to carry out the test for S 2. This can be generalized to sets of subscriptions C = {S, S 2,..., S n } that we call clusters: if we can find an aggregate subscription S such that any event that matches some S i also matches S, then we make an initial test for S; if this initial test fails, then no further test is needed for the subscriptions in C. On the other hand, if the test for S succeeds then it will succeed for at least one subscription in C. However, in practice the set of events that match S is a superset of those that match the subscriptions in C. Our algorithm aims at minimizing this loss of precision by tightening the representation of the clusters. In the following Section 2 we present our model and in Section 3 we present our algorithms. We present experimental results in Section 4 and discuss related work in Section 5. In Section 6 we offer some concluding remarks and discuss perspectives. 2

3 2 The Model As we mentioned in the introduction, we assume a single taxonomy over which several basic concepts of a digital library are defined. To begin with, we denote the library taxonomy as (T, ), where T is a set of keywords, or terms, and is a subsumption relation over T, that is a reflexive and transitive binary relation over T - that we assume to be a tree. Given two terms, s and t, if s t then we say that s is subsumed by t, or that t subsumes s. We represent the taxonomy as a tree in which the nodes are the terms of T and where there is an arrow from term t to term s iff t subsumes s. Figure shows an example of a taxonomy, in which the term Languages subsumes the term OOL, the term OOL subsumes the terms C++ and Java, and so on. In order to make a document sharable by the community of library users, its author must register the document at the library. This is done by providing a description of the document s content together with an identifier (say, the document s URI) allowing to access the document s content. As already mentioned in the introduction, the description is just a set of terms from the taxonomy. Definition (Description). Given a taxonomy ( T, ) we call description in T any set of terms from T. For example, if the document contains the quick sort algorithm written in java then the terms QuickSort and Java can be used to describe its content. In this case the set of terms {QuickSort, Java} is the description of the document. During document registration, what is actually stored at the DL repository is the document identifier and the document s description. Conceptually, the repository can be thought of as a binary relation between terms and document identifiers (i.e. as a set of pairs <term, doc-id>) defined as follows: during registration of a document with identifier d, the library stores a pair (t, d) for each term t appearing in the description of d. Figure 2 shows an example of repository over the taxonomy. The dotted lines indicate the pairs (t, d) of the repository, relating terms with document identifiers. A description can be redundant if some of the terms it contains are subsumed by other terms. For example, the description {QuickSort, Java, Sort} is redundant, as Quick- Sort is subsumed by Sort. If we remove either Sort or QuickSort then we obtain a non-redundant description: either {QuickSort, Java} or {Sort, Java}, respectively. Redundant descriptions are undesirable as they can lead to redundant computations during subscription evaluation. We shall therefore limit our attention to non-redundant descriptions. More generally, we shall limit our attention to reduced sets of terms defined as follows: Definition 2 (Reduced Set of terms). A set of terms D from T is called reduced if for any terms s and t in D, s t and t s. Following the above definition one can reduce a description in (at least) two ways: removing all but the minimal terms, or removing all but the maximal terms. In this work we adopt the first approach, that is we reduce a description by removing all but its minimal terms. Definition 3 (Reduction). Given a description D in T we call reduction of D, denoted reduce(d), the set of minimal terms in D with respect to the subsumption. The reason for our choice is that by removing all but minimal terms we obtain a more accurate description of the document. This should be clear from our previous example, where the description {QuickSort, Java} is more accurate than {Sort, Java}. Henceforth, when using the term description we shall always mean reduced description. A query in our model is just a conjunction of terms from the taxonomy, and its answer is a set of documents as defined below. In the following definition the symbol tail(t) stands for the set of all terms in the taxonomy T strictly subsumed by t, that is tail(t) = {s s t}, and R stands for the repository of the digital library. Queries are defined over T and they are answered based on R. Definition 4 (Query over T and its answer). A query q over T is either a single term or a conjunction of terms from T. Its answer, denoted by ans(q), is a set of documents defined as follows: 3

4 Programming Theory Languages Algorithms OOL Sort C++ Java MergeSort QuickSort BubbleSort d 5 JSP JavaBean d 3 d4 d d 2 d 6 Figure 2: A repository Case : q is a single term t from T, i.e., q = t ans(q) = if tail(t) = then {d (t, d) R} else {ans(s) s tail(t)} Case 2: q is a conjunction of terms, i.e., q = t t 2 t n ans(q) = ans(t ) ans(t 2 ) ans(t n ) As an example, consider the query q = C++ Sort. Referring to Figure 2 and applying the above definition, we find ans(q) = {d 5, d 6 } {d 3, d 4, d 6 } = {d 6 }. A subscription in our model is just a query describing (intentionally) the set of documents of interest to a user. In reality, a user can define his subscription by selecting terms from the taxonomy. The conjunction of the selected terms is the user s subscription. Definition 5 (Subscription). Given a taxonomy ( T, ) we call description in T any reduced set of terms from T. Actually, henceforth, we shall think of a subscription either as a set of terms (e.g., {C++, Java}) or as a query (e.g., C++ Java). The answer to this query at a given moment in time represents all documents of interest to the user that the library contains at that particular moment in time. However, the answer changes over time, as new documents are inserted in the library, or existing documents are modified or deleted. These changes that occur at the library over time is precisely what we call events: Definition 6 (Event). In a digital library, we call event the insertion, modification or removal of a document. An event is represented by the description of the document being inserted, modified, or removed. When an event occurs the system must inform, or notify each user whose subscription is matched by the event. Definition 7 (Matching). Let e be an event and q a subscription. We say that e matches q if the following holds:. A document d is removed and d ans(q) before the event occurs. 2. A document d is inserted and d ans(q) after the event has occurred. The case of document modification is treated as a deletion followed by an insertion. Clearly, when an event e occurs, the system must decide which users are to be notified. A naive approach is to examine every subscription q and test whether e matches q. However, if the set of subscriptions is large, and/or the rate of events is high, the system might become quickly overwhelmed. In what follows, we shall refer to this simplistic approach as Naive. 4

5 In this paper, we introduce a more efficient approach based on the observation that testing whether an event matches a subscription is basically a set membership test (i.e. testing whether a document belongs to a given set of documents - which happens to be the answer to a query). The idea that we exploit here is the following: if we have to perform test membership for every set in a collection of sets, we can save computations by starting with maximal sets first (maximality with respect to set inclusion). Indeed, if a document does not belong to a maximal set then we don t need to test membership for any of its subsets. In order to implement this idea, we need to define first a notion of refinement between subscriptions. In fact, we need a definition that translates the following intuition: if subscription q refines subscription q 2 then every event that matches q also matches q 2. Definition 8 (Refinement Relation). Let q and q 2 be two subscriptions. We say that q is finer than q 2, denoted q q 2, iff t 2 q 2, t q t t 2. In other words, q is finer than q 2 if every term of q 2 subsumes some term of q. For example, the subscription q = {QuickSort, Java, BubbleSort} is finer than q 2 = {Sort, OOL}, whereas q 2 is not finer than q. If the sets of terms q and q 2 are reduced sets then the refinement relation just defined becomes a partial order. In fact, it can be shown that : (i) there exists a least upper bound for any subset sub S of the set S of all subscriptions. (ii)the set S is an upper semi-lattice. Proposition. Let sub S be a subset of the set S of all subscriptions. There exists a least upper bound for sub S denoted lub (sub S ). Proposition 2 ([5]). (S, ) is an upper semilattice. Proof. Let sub S = {S,.., S n } be a set of subscriptions. Let U be the set of all subscriptions S such that S i S, i =, 2,..., n. We show that U has a unique minimal element, that we shall denote as lub(sub S, ). Let P = S S 2... S n be the Cartesian product of the subscriptions in sub S, and suppose that there are k tuples in this product, say P = {L, L 2,..., L k }. Let A = {lub (L ), lub (L 2 ),..., lub (L k )}, where lub (L i ) denotes the least upper bound of the terms in L i, with respect to. As (T, ) is a tree, this least upper bound exists, for all i =, 2,..., k. Now, let R be the reduction of A, i.e., R = reduce(a). R is the smallest element of U. Indeed, it follows from the definition of R that S i R, for i =, 2,..., n. Moreover, let S be any subscription in U, and let t be a term in S. It follows from the definition of U that there is a term v i in each subscription S i such that v i t. Consider now the tuple v =< v, v 2,..., v n >. By the definition of least upper bound, lub (v) t, and as lub (v) is in R, it follows that R S, and this completes the proof. The proof suggests a simple algorithm for computing the lub of a set of subscriptions {S,.., S n }:. Form the Cartesian product C of all subscriptions in {S,.., S n }; 2. Apply lub to each tuple of terms in C to find the set L of lub s; 3. Remove all but the minimal terms in L The output of this algorithm is a reduced set of terms which is the lub of {S,.., S n }. Consider the subscriptions S = {Java,C++, QuickSort} and S 2 = {JSP,MergeSort}. In order to find their lub we proceed as follows:. Cartesian product C of S and S 2 : C = S S 2 = {{Java,JSP}, {Java,MergeSort}, {C++,JSP}, {C++,MergeSort}, {QuickSort,JSP}, {QuickSort,MergeSort}} 2. L = {Java, Programming, OOL, Sort} 3. Reduce(L) = {Java,Sort} Note that this direct approach of finding the lub of a set of subscriptions runs in O( S S 2... S n ). We shall see a linear algorithm in the next section. Intuitively, refining a subscription can be seen as imposing tighter matching constraints. There exist two possible ways of doing so: by simply adding some terms to a subscription, or by replacing a term in the subscription by one of its descendants in the taxonomy. 5

6 For example, referring to the subscriptions shown in Figure 3, q = OOL Sort is refined by the subscription q 2 = Theory OOL Sort, since any event that matches q 2 also matches q. Generally, if q and q 2 are subscriptions, and q is a subset of q 2, then q 2 is a refinement of q. A subscription can also be refined through term subsumption. Thus q is also refined by q 3 = Java Sort because Java OOL. Note also that adding to a subscription q a term t subsumed by a term t which already appears in q yields an equivalent subscription. For example, q = OOL Sort is equivalent to q 2 = OOL Sort Languages. The refinement relation defined earlier also holds on equivalence classes of subscriptions (reduced subscriptions being just representatives of equivalence classes). Note that a subscription can be the refinement of several other subscriptions. For example, q 4 is a refinement of both q 2 and q 3 (and, by transitivity, it is also a refinement of q ). Figure 3 shows the refinement relation in the form of a graph. q 2 = Theory OOL Sort q = OOL Sort q 3 = Java Sort q 4 = Theory Java QuickSort Figure 3: The refinement relation Finally, if q is a subscription and d is a document with description D, then d ans(q) iff D q. Referring to the graph of Fig. 3, we can now explain how we exploit the refinement relation in our approach: if an event e does not match the subscription q, then it cannot match any of the subscriptions q 2, q 3 or q 4. Therefore the idea is to first evaluate e with respect to q (which is the most general subscription): if the matching is successful, then the evaluation continues with respect to both q 2 and q 3 (otherwise evaluation stops); and if the evaluation is successful for either q 2 or q 3, then evaluation continues with respect to q 4 (otherwise evaluation stops). Assuming that, in general, a large fraction of all events will fail to match with any particular subscription, this strategy is expected to save a significant number of computations. Of course, the failure rate of event matching depends on the filtering rate of each particular subscription, a concept that we shall see shortly. The set of user-submitted subscriptions can be organized as a directed acyclic graph that we call the subscription graph. Its acyclicity follows from the fact that the refinement relation is a partial order (up to subscription equivalence). However, although the subscription graph is acyclic, it may have several roots (i.e. more than one maximal element). Henceforth, we shall assume that the subscription graph has a single root, by adding to it (if necessary) the lub of all maximal subscriptions. In principle, it would be possible to maintain the transitive reduction of the subscription graph but its construction is too costly (The computation of the transitive reduction runs in O(n) for a graph of size n, for each node insertion [4]). However, it turns out that constructing the transitive reduction is not really necessary. In fact, it is sufficient to maintain one and only one path from the root to every node (the root being the most general subscription). This follows from the observation that if a path to a node q is successful (i.e., an event e matches all the predecessors of q along the path), then every path leading to q will be successful as well. As a consequence, it is sufficient to test matching along just one path. As this holds for every node in the subscription graph, it is sufficient to construct a spanning tree in order to be able to test matching for every node in the graph. Now, as there are in general many spanning trees, the question is how to choose the best one. To this end, we introduce the notion of filtering rate. The filtering rate of a subscription q, denoted σ(q), is the probability of an event matching q. It can be determined by various means, including a cost model that we present later on. So a best spanning tree will be one that optimizes the amount of filtering during a matching test. As an example, referring to the upper part of Figure 4, let us assume the following filtering rates for q, q 2 and q 3 : 6

7 9 q = OOL Sort 9 q 2 = Theory OOL Sort q 3 = Java Sort First choice 2 8 q 4 = Theory Java QuickSort Second choice q = OOL Sort q = OOL Sort q 2 = Theory OOL Sort q 3 = Java Sort q 2 = Theory OOL Sort q 3 = Java Sort 2 q 4 = Theory Java QuickSort 8 q 4 = Theory Java QuickSort Figure 4: Issues in subscription tree construction σ(q ) = 9 σ(q 2 ) = 2 σ(q 3 ) = 8 These filtering rates are shown on the edges, where every edge is labelled by the filtering rate of its source node. It is easy to check that there are only two spanning trees in the subscription graph, as shown at the bottom part of Figure 4. The spanning tree on the left is minimal, in the sense that only 2 of the matching tests against q 4 will succeed, instead of 8 if the spanning tree on the right were used (in the following we call such a spanning tree a minimal spanning tree [5]). From the point of view of our application, selecting the minimal spanning tree is tantamount to selecting, for each subscription q, the most filtering path leading to q. As usual with publish/subscribe applications, we assume that subscriptions may be added or removed dynamically. We must therefore construct and maintain the minimal spanning tree incrementally. However, incremental maintenance incurs a very significant cost, as in the worst case the whole tree must be reconstructed after a subscription is inserted, updated or deleted. In the next section, we propose an algorithm which provides an approximate solution to the problem. Then, we show through experiments that it still provides an effective reduction of the overall evaluation cost with respect to the trivial solution (i.e., the one that test matching with every submitted subscriptions). 3 Computational issues In what follows, we first present a linear algorithm for deciding whether a subscription is a refinement of another subscription, and for computing the least upper bound of a set of subscriptions. Next we describe the insertion algorithm. An important issue during the insertion process relates to how and when we can cluster subscriptions which are close to one another, and whether we must materialize the lub of a cluster and put it in the tree. Note that by doing so, we might introduce some artificial nodes which have not been subscribed by any user. This is justified because a lub, if it is close enough to the subscriptions that it covers, is a means to factorize the computations that would otherwise be carried out independently. We present a simple cost model and show that adding lubs in the subscription tree is almost always beneficial. Finally we end the Section with the notification algorithm. 7

8 3. Computing and Lub The evaluation of the relation and the computation of the lub of a set of subscriptions are the basic operations involved in the maintenance of the subscription graph. Given two subscriptions S and S 2, a naive implementation compares each term in S with each term in S 2 and runs in O( S S 2 ), both for and Lub. We present an optimized computation which uses an appropriate encoding of the terms of the taxonomy and avoids the Cartesian product of the naive solution. Its cost is linear in the size of the subscriptions. Our encoding extends the labelling scheme presented in [2] and further investigated in [7]. In our labelling scheme the successors of each node are assumed to be linearly ordered (by their position, say from left to right), and a node in the taxonomy tree is identified by its position with respect to the parent node. The label label(t) of a term t is obtained by concatenating the label of the parent followed by the the position of the term (with respect to the parent). For example, referring to Figure 5, if is the label of the node Programming, then.,.2 and.3 are respectively the labels of its son nodes, Theory, Languages and Algorithms. This encoding defines a total lexicographic order < l on the set of terms. Evaluating the subsumption relation t t 2, using this encoding, reduces to checking whether label(t 2 ) is a prefix of label(t ). The label of the least upper bound of two terms, lub (t, t 2 ), is the longest common prefix of label(t ) and label(t 2 ). Recall that since (T, ) is a tree, this least upper bound always exists. A subscription S is encoded as the list of the labels of its terms, sorted in lexicographic order. For example the subscription {Theory, Algorithms} is encoded [.,.3]. The order of the term labels in the labelling of a subscription helps reduce the number of computations required to evaluate the refinement relation S S 2, and also to compute the lub Lub(S, S 2 ), since merge-like algorithms can be applied. The correctness of these algorithms is based on the following properties: Proposition 3. Let S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m] be two (reduced) subscriptions, encoded in lexicographic order. Then:. Property : If t 2 j t i for some i < n and j m, then t 2 k t i+ for all k j 2. Property 2: Let i > and j < m such that t i < l t 2 j, then for all k < i and all l > j, lub (t i, t2 j ) lub (t k, t2 l ). The first property states that if a term t i from S subsumes a term t 2 j from S 2, t i i+ cannot subsume any term before t 2 j in the lexicographic order. Therefore it is possible to check the refinement relation by always advancing one position in the two lists S and S 2. Algorithm Refine Input: S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m], two reduced subscriptions. Output: true if S S 2, else false. begin i := ; j =: ; while (i < n and j < m) do // If t 2 j subsumes t i, advance on S and S 2 if (t i t2 j ) then i := i + ; j := j + else // If t i is less than t2 j : advance on S if (t i < l t 2 j ) then i := i+ // If t 2 j is less than t i : // no subsumed term for t 2 j in S : // S does not refine S 2 if (t 2 j < l t i ) then return f alse end if end do // When j < m, some terms in S 2 // do not have a subsumed term in S if (j < m) then return false else return true end The second property ensures the correctness of a merge-like algorithm which always advances on the subscription that contains the smallest term in lexicographic order. Algorithm Lub Input: S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m], two reduced subscriptions. Output: the least upper bound of (S, S 2 ). begin 8

9 Programming Theory. DB Logic Languages.2 OOL.2. Algorithms.3 Sort.3. C Java.2..2 MergeSort.3.. QuickSort.3..2 BubbleSort.3..3 JSP JavaBeans Figure 5: The taxonomy labelling i := ; j =: ; k := ; L[k] = lub (t i, t2 j ) while (i < n or j < m) do if (t 2 j < l t i or i = n) then // advance on S 2 j := j + else // advance on S i := i + endif current := lub ((t i, t2 j ) // Check whether the current lub // must replace the previous if (current L[k]) then L[k] := current else if (L[k] current) then // Ignore the current lub else //Add the current lub to the result k := k + ; L[k] := current endif endif end do return L end Thanks to the technique, Refine(S, S 2 ) runs in O(max( S, S 2 )), and Lub(S, S 2 ) in O( S + S 2 ). The detailed algorithms are given as an appendix. 3.2 The cost model The unit cost that we consider is the evaluation of the relation, called terms comparison in the following (or comparison for short). As mentioned above, the evaluation of S S 2 requires at most max( S, S 2 ) comparisons. For clarity, we assume in the following that the size of an event e (i.e. the number of terms in the description of the document being inserted, modified or removed) is larger than the size of a subscription. The impact of this assumption is marginal. In order to estimate the gain obtained when using the lub of a set of subscriptions as a filter we first define the filtering rate of a subscription. It estimates the percentage of successful matchings with respect to events occurring at the library. The filtering rate of a subscription is computed based on the selectivity σ(t) of a term t, which represents the probability for the description of a document to contain either t or a term subsumed by t. Term selectivity can be obtained by analyzing the current repository instance, or simply by assuming a uniform distribution. The analysis that follows is based on a uniform distribution. Let L be the set of leaves of the taxonomy. The selectivity of t T is estimated by the following inductive formula: σ(t) = { σ(t) = T, if t L T + Σ t children(t) σ(t ) otherwise Given a subscription S = {t, t 2,, t n }, the filtering rate of S is σ(s) = σ(t ) σ(t 2 )... σ(t n ) (conjunctive semantics). Note that this simple model satisfies the obvious constraint that S S σ(s) σ(s ). For instance the filtering rate of a subscription 9

10 S = {root T }, where root T is the root of T is equal to : any event matches this subscription. The filtering rate of a subscription S = {l}, where l is a leaf of T, is T. We can now measure the gain of adding the lub for a subset of subscriptions in the tree. Consider a set of subscriptions q, q 2,, q m and their lub denoted q lub. The number of comparisons when testing an event with n terms can be estimated as: Cost cl = ( σ(q lub )) n+σ(q lub ) (n+m n) () The cost of checking whether e matches q lub is n, with a success probability σ(q lub ), in which case we must also consider its children. In the worst case, all the terms have to be tested for each child, hence m n comparisons. The cost of an independent evaluation of each subscription (without using q lub as a filter), is Cost multi = m n (2) The difference between the two solutions depends both on the filtering rate of q lub (the smaller the better), and on the number of children (the higher the better). Using the lub as a filter saves comparisons when Cost cl < Cost multi, i.e., when ( σ(q lub )) n + σ(q lub ) (n + m n) < n m + m σ(q lub ) < (3) m The formula 3 shows that the efficiency of the evaluation process heavily depends on the presence of the q lub as a subscription. If the subscription does not appear in the subscription graph G S, some degenerate cases may occur. The following example illustrates an extreme case. Example. Let the set of subscriptions S be {q = {l }, q 2 = {l 2 },... q m = {l m }}. Each l i is a leaf in T and their common parent is l.. Assume that there is no other subscription; then the number of comparisons with an event of size n is n m. 2. Now the subscription q lub = {l} is submitted. Its filtering rate is m+ T and the cost for an event can be estimated to be n+ m (m+) n T. Providing that m T, the filtering is quite effective. In conclusion, the closer the lub to the subscriptions of the cluster, the higher the amount of filtering which is obtained. Formula 3 is also sufficient to conclude that in practice the evaluation cost is almost always reduced by using the lub of two subscriptions, the only exception being when the lub consists solely of the root root T of the taxonomy. The construction of the subscription spanning tree, presented below, systematically attempts to add lubs in the tree during the insertion of a new subscription. 3.3 Insertion algorithm The insertion algorithm constructs incrementally a tree T S, where each node consists of a subscription S, associated with a bucket containing the set of user identifiers that have subscribed S. Initially the tree consists of the subscription {root T }, containing the root term of the taxonomy, with an empty bucket. When a user u subscribes to S, one first searches the location of S in the tree. This is performed in two steps: Candidate parent selection. A node N in T S is a candidate parent for S if the following conditions hold: (i) S N, and (ii) for each child N of N, S N, i.e., S strictly refines N but does not refine any child of N. The algorithm performs a top-down search, looking for a candidate parent. Starting from the root, it chooses at each level the most selective child which refines N. When such a child no longer exists the candidate parent is found. Note that this is an heuristic which avoids to follow an unbounded number of paths in the tree, but does not guarantee that the best candidate parent (i.e., the most selective one) is found. Lub selection. Once the candidate parent N is found, the second step inserts S as a child or grandchild of N as follows. First, for each child N of N, one computes Lub(S, N ) and keeps only those lubs which refine N. Now:. if at least one such lub l = Lub(S, N ) has been found, the most selective one is chosen, and a new subtree l(s, N ) is inserted under the parent of N ; 0

11 2. else S is inserted as a child of N. In all cases the id of user u is simply added to the bucket of S. Consider the tree in the top part of Figure 6, and suppose that the subscription S ={JSP, QuickSort} is to be inserted. We first search the candidate parent. During the top-down traversal, the most selective path must be chosen. At level 3, the possible paths are {OOL, QuickSort} and {Java, Sort}: the node with the minimum filtering rate, {OOL, QuickSort}, is chosen. Since none of its children refines S, this node is the candidate parent. Subsequently, we compute successively the lub of S and every child of {OOL, QuickSort} :. L = Lub ({JSP,QuickSort}, {C++, QuickSort}) = {OOL, QuickSort} 2. L 2 = Lub ({JSP, QuickSort}, {JavaBeans, QuickSort}) = {Java, QuickSort}. Since L already exists in the tree, the new subscription tree is the one in the bottom part of Figure 6. Note that when the lub of S and a node N is computed, we know for sure that this lub does not refine any sibling N of N, otherwise Lub(N, N ) would have been inserted in the tree in the first place. This is a consequence of the tree construction algorithm, which is summarized below. Insert(u, S, T S ) Input: a user u, a subscription S and the tree T S Output: the new tree after the insertion begin // First step: search the candidate parent using // a depth-first search parent := CandidateParent(S, root T ) // Second step: compute the candidate lubs C := for each N i parent.children do if Lub(S, N i )<>parent then C = C {N i } endfor if C = then // No candidate lub. // S is a child of parent parent.children := parent.children {S} else // Add the lub of the two subscriptions // Choose a node N such that that Lub(N, S) // is the most selective Choose a node N in C Add the subtree [Lub(N, S)(N, S)] to parent Remove N from parent s children endif Add the id of the subscriber u to the bucket of S return T S end The following special cases are not represented in the algorithm: (i) a subscription S is already represented in T S and (ii) the lub of S and a node N is S itself. The extension is straightforward. 3.4 Removal algorithm A leaf in the subscription tree whose bucket becomes empty can be removed (note that an empty internal node can still play the role of a filter, and must be kept in the structure). The removal of an empty leaf S, with parent node P, is outlined below:. first compute the lub L of the siblings of S (if S has no sibling, then L = ), and remove S from the children of P ; then: 2. (a) if S has at least one sibling, the second step depends on the bucket of P : if it is is empty, P is replaced by L, else L becomes the child of P. (b) else S has no sibling, and P may become an empty leaf in turn if its bucket is empty. The procedure must then be called recursively and bottom-up. In the worse case, the removal of S may affect all the nodes along the path from S up to the root. Note however that lazy upating can be used (i.e., the adjustment of S s ancestors is not done immediatly), since the tree supports correctly the insert and search operations after step (). 3.5 Notification algorithm Whenever a new event e arrives, the algorithm scans the tree top-down, starting from the root of the tree.

12 Programming Algorithms Programming Algorithms OOL, QuickSort Java, Sort 3/49 3/49 6/49 C++, QuickSort JavaBeans, QuickSort Java, BubbleSort OOL, QuickSort Java, Sort 3/49 C++, QuickSort 3/49 6/49 Java, QuickSort Java, BubbleSort JavaBeans, QuickSort JSP, QuickSort a. The initial tree b. The tree after insertion of {JSP, QuickSort} Figure 6: The insertion algorithm Cmatch NCmatch Nb subscriptions Depth Insertion level Avg. fanout Depth Insertion level Avg. fanout 30, , , , , , Table : Structure of the subscriptions tree for different subscriptions datasets The main procedure, Match(N), is called recursively and proceeds as follows:. if e does not match N, the scan stops; there is no need to access the children of N; 2. else the users of the bucket associated to N can be notified while Match is called recursively for each child of N. The cost of the algorithm is strongly influenced by the average number of children of a node (fanout). If this number is very large, many of the children will not refine an event e, and this results in useless evaluations of the refinement relation. When the fanout of the subscriptions tree decreases the global amount of filtering out increases, and our algorithm is expected to greatly reduce the number of dead-ends during the tree traversal. The following section, on experimentation, is intended to validate this analysis. 4 Experiments We analyze the behavior of our clustered graph structure, called Cmatch ( clustered matching ), and compare it with the following competitors:. Naive is the trivial solution which stores the subscriptions in a linear structure. 2. NCmatch ( non-clustered matching ) relies on a subscriptions tree without clustering, i.e., we never introduce the lub of user s subscription during an insertion. The NCmatch implementation is mostly intended to assess experimentally the gain of the clustering in Cmatch. The impact of the number of users who registered a given subscription S is neutral, because once a subscription that matches an event is found, all the notification variants (Naive, NCmatch and Cmatch) merely scan the bucket of users, sending a notification to each of them. For clarity we ignore the cost of this specific operation in the presentation of our experimental results, and focus on the cost of finding the set of relevant subscriptions. The evaluation cost is measured with respect to the following indicators: (i) number of terms comparisons, and (ii) the number of nodes visited for the tree-based solutions Cmatch and NCmatch. We analyze successively the two main operations: the insertion of new subscriptions ( subscribe ) and the 2

13 0 C++ Quicksort BubbleSor OOL, QuickSort, BubbleSort 0 Java 0 MergeSort BubbleSor Figure 7: Avoiding redundant comparisons search of the subscriptions that match an event ( notify ). 4. Experimental setting The structure has been implemented in Java on a Pentium IV processor (3,000MHz) with,024mb of main memory. The implementation conforms to the specifications given in the previous section, expect for the following optimization used in Cmatch and NCmatch. During the top-down traversal of the subscriptions tree, a same term comparison may have to be carried out repeatedly. Consider the three subscriptions : S = {OOL, QuickSort, BubbleSort}, S 2 = {C++, QuickSort, BubbleSort}, S 3 = {OOL, MergeSort, BubbleSort}. The tree for these subscriptions is shown in Figure 7. If an event e refines S, we know for sure that for each term in S, we found a subsumed term in e. We must now evaluate e S 2 and e S 3. If one or several of the terms in S are also present in S 2 and S 3, it is useless to search again for a subsumed term in e. We maintain, at each node N in the graph, a mask of bits which indicates the terms shared by N and one of its ancestors. This is illustrated in Figure 7. The parent node is the subscription {OOL, QuickSort, BubbleSort}. The two children share respectively with their parent the terms BubbleSort and QuickSort (left child), and BubbleSort (right child). A bit is set to if the corresponding term is shared with the parent, or to 0. During the matching process we need to evaluate the term comparison only for the 0-bit terms. This saves 2/3 of the comparisons for the left child in Figure 7, and /3 for the right child. Our experimental setting simulates a Digital Library storing a set of scientific documents described by terms from the ACM Computing Classification System [] taxonomy. The taxonomy contains,36 terms, and its maximal depth is 5. We produced randomly several sets of distinct subscriptions, with a cardinality ranging from 30,000 to 80, Cost of subscriptions The cost of an insertion for the Naive approach is negligible, since it consists only in an insertion in a linear structure, performed in constant time. We focus therefore on the comparison of Cmatch and NCmatch. Table summarizes the structural properties of the subscriptions tree. The average number of terms in a subscription is 5, with a close clustering around this value (i.e., most subscriptions have 4, 5, 6 or 7 terms). The table gives, for each dataset, the depth of the subscriptions tree, the average insertion level for a new subscription and the average fanout. The most striking feature is the very large fanout of nodes for the non-clustered solution NCmatch. This clearly relates to the large number of terms in the taxonomy, which reduces the probability to find a refinement relationship between two submitted subscriptions. As a result one obtains a tree with a few levels, where the refinement relation is sparsely represented. It can be expected that with a very large taxonomy, the tree degenerates to an almost linear structure, where all the subscriptions tend to be a child of the root node. Clearly such a structure looses most of the benefits of the approach, since the insertion is costly, yet the amount of filtering remains low. On the other hand, the larger the subcriptions set, the higher the probablity to find a subscription refined by another one. This explains that the fanout decreases as new subscriptions are added. The Cmatch structure which clusters the subscriptions and represents these clusters as subtrees rooted by their lub achieves a quite significant reduction of the fanout with respect to NCmatch. Table shows that the average number of children of a node is about two orders of magnitude lower for Cmatch with respect to NCmatch. This is clearly a quite desirable property since it reduces both the cost of insertions 3

14 and the cost of search operations. We made several experiments that vary the size of the subscriptions, and the obtained results show that the above conclusions still hold. Table 2 shows the cost of inserting a subscription in an existing subscriptions tree, for different subscriptions datasets. We measure the number of nodes visited by the insertion algorithm, and the number of terms comparisons. As expected from the trees properties, the gain of the clustered solution is quite impressive. In particular the small number of comparisons which are necessary to insert a new subscription shows that the structure can support a high ratio of updates. This constitutes an important property for a publish/subscribe system. The results of Table 2 also confirm that the clustering is essential to exploit the refinement relation and filter out most of the comparisons which are made by the non-clustered NCmatch structure. Again, this is related to the size of the taxonomy. With a smaller one, the refinement relation would be more represented in the subscriptions tree, with a probable reduction of the gap between Cmatch and NCmatch. However, the main conclusion of our analytical and experimental study on that matter is the high benefit, in all cases, of the lub-based clustering approach. nodes. The Cmatch algorithm benefits strongly from the clustering. The lubs play their role of filters and allow to get rid of most of the irrelevant computations. The number of comparisons remains very small with respect to Naive. Table 3 summarizes these properties for different subscriptions datasets Number of comparisons Number of nodes visited Figure 8: Evolution of insertions cost 4.3 Cost of notifications We now turn our attention to the notification process. The results obtained for the three solutions Naive, NCmatch and Cmatch are given in Table 3 for our 6 subscriptions datasets. We compare both the average number of nodes visited and the average number of terms comparisons (the latter being more representative of the actual cost) for processing a single event (note that an event generates several notifications in general). For the naive solution, the number of visited nodes is equal to the number of subscriptions. The gain of the non-clustered solution is not very significant since about 50% of the computations are saved. This is easily explained by the shape of the tree, with a small number of levels and a large fanout. The notification algorithm must visit all the nodes that refine the incoming event, and test all the children of these An important aspect of the behavior of the algorithms is clearly illustrated by the curves of Figure 8. Whereas NCmatch and Naive exhibit a linear degradation of their computation costs with respect to the size of the subscriptions set, the performances of Cmatch decrease very slowly (Naive is not shown on the curve because of the very large values of its figures). Actually the cost of processing required to deliver a notification turns out to be almost constant, and independent from the number of subscriptions. This shows that the pruning effect of the tree is quite effective and removes most of the unnecessary computations. 5 Related work Several data structures and algorithms represent the subscriptions and the descriptions by sets 4

15 Cmatch NCmatch Nb sub. Visited Nodes Comparisons Visited Nodes Comparisons 30, ,803 34,493 60, ,072 76,598 90, ,649 6,57 20, ,425 57,352 50, ,425 98,680 80, ,424 22,272 Table 2: Cost of insertions for different subscriptions datasets Cmatch NCmatch Naive Nb sub. Visited nodes Comp. Visited nodes Comp. Visited nodes Comp. 30, ,266 5,893 30,998 30,000 69,20 60, ,585 3,474 60,027 60,000 33,822 90, ,82 47,56 90,400 90,000 97,57 20, ,074 62,543 9,24 20,000 26,807 50, ,250 77,728 47,452 50, ,898 80, ,42 92,880 75,90 80, ,004 Table 3: Cost of notifications for different subscriptions datasets. of predicates, where a predicate is a triple : (attribute, op, value) and op is either or or =. Two main techniques are used in this context. The first one relies on a two-step approach: first the predicates are evaluated with respect to the event s values, and second the matching subscriptions are determined by counting their number of satisfied predicates [0, 2, 3, 6, 7]. [0] uses indexing of equality predicates to speed up the matching of atomic formulas and clusters subscriptions to minimize cache failure. A similar approach is used in [2] and in [3]. In the SIFT system [7, 6], the subscriptions are composed of a set of weighted keywords. The matching algorithm is based on techniques of similarity computation. These techniques do not apply to our problem: we do not have attributes, and the indexing of equality predicates does not extend to the subsumption relation. The second technique works also in two steps. The first step organizes the subscription directory in a special structure, while the second step uses this special structure to filter the incoming events. For the Elvin Publish/subscribe system, Gough and Smith [] present an algorithm translating the subscriptions into a tree. When an event occurs every predicate is tested only once but it is stored in a redundant way leading to a combinatorial explosion of its occurrence. This is not the case of our tree structure since the number of nodes is in the worst case 2 S, where S is the number of subscriptions. Furthermore, the maintenance of the tree structure used by Gough Smith [] is very costly, compared to our approach. The algorithm presented in [3] is also based on a tree structure where subscriptions are stored in the leaves and each non-leaf node represents a predicate comparison. The space requirements are important: each new subscription with K predicates adds K + nodes, and several paths may have to be followed during the matching process. Moreover the structure is suitable only for equality tests. A similar structure and filtering algorithm are proposed in [6]. More recently, XML-based filtering systems have been proposed [8, 9, 4]. In summary we are not aware of a publish/subscribe technique that considers a simple keyword-based query language and a subsumption relation over the terms. Compared to other works in related areas, the solution proposed in the present paper presents a reasonable storage cost (at most twice the number of subscriptions) and achieves a nice trade-off between the performances of subscription (insertion in the structure) and notification (search in the structure). 5

Fast User Notification in Large-Scale Digital Libraries: Experiments and Results

Fast User Notification in Large-Scale Digital Libraries: Experiments and Results H. Belhaj Frej 1, P. Rigaux 2, and N. Spyratos 1 1 LRI, Univ. Paris-Sud, 91400 Orsay, {hanen, spyratos}@lri.fr, 2 Lamsade,