Matching Algorithms for User Notification in Digital Libraries

Size: px
Start display at page:

Download "Matching Algorithms for User Notification in Digital Libraries"

Transcription

1 Matching Algorithms for User Notification in Digital Libraries H. Belhaj Frej, P. Rigaux 2 and N. Spyratos Abstract We consider a publish/subscribe system for digital libraries which continuously evaluates queries over a large repository containing document descriptions. The subscriptions, the query expressions and the document descriptions, all rely on a taxonomy that is a hierarchically organized set of keywords, or terms. The digital library supports insertion, update and removal of a document. Each of these operations is seen as an event that must be notified only to those users whose subscriptions match the document s description. The paper addresses the problem of efficiently supporting the notification process, and makes contributions in three directions: (a) definition of a formal model for the publish/subscribe process; (b) introduction of a semi-lattice structure on subscriptions allowing the filtering out of non matching subscriptions; (c) experimental results that show the cost benefits obtained by our approach. Introduction The publish/subscribe interaction paradigm provides subscribers with the ability to express their interest in classes of events generated by publishers. A system that supports this paradigm must be able to find, for each incoming event e, the subscriptions that match e, in order to determine which subscribers should be notified. Many typical web applications can be seen as variants of this general framework, including auction sites, on-line rental offices, virtual bookshops, etc. This work is conducted in the context of the KP- LAB (Knowledge Practices Laboratory) project, Work Package5:"Middleware". ( LRI, Univ. Paris-Sud, 9400 Orsay 2 Lamsade, Univ. Paris-Dauphine, 7506 Paris They act as brokers which store only descriptions of the published items. In the present paper we focus on digital libraries (DL) which maintain (in a repository) descriptions of documents and pointers to documents contents. In this context a publisher is an author that provides to the DL descriptions of his documents and ways to access their contents (e.g. their URIs), whereas a subscriber is a user willing to be informed of any event affecting a document that relates to his topics of interest. We consider a DL model with the following characteristics: There is a taxonomy to which the authors of documents and the subscribers of the library, both adhere; this taxonomy is just a set of keywords, or terms structured as a tree. An example of a taxonomy is the well known ACM Computing Classification System []. A document is represented in the DL repository by a description of its content together with an identifier (say, the document s URI) allowing to access the document s content; the description is just a set of terms from the taxonomy. A query against the library is just a conjunction of terms from the taxonomy (i.e. a conjunctive query) A user is represented by an identifier together with a subscription; a subscription is just a query defining (intentionally) the documents of interest to the user. This simple model covers a wide range of DL organizations, document structures and document formats. In particular, the DL repository can be centralized or distributed without affecting our model. Here, what we call centralized repository is one in which the

2 Programming Theory Languages Algorithms OOL Sort C++ Java MergeSort QuickSort BubbleSort JSP JavaBeans Figure : A taxonomy library stores not only the descriptions of documents but also their contents; and what we call distributed repository is one in which the library stores only the descriptions of documents plus pointers to their contents - the contents themselves being stored at the local repositories of the authors. A subscriber must be informed, or notified whenever an event matching his subscription occurs at the DL. The problem addressed in this paper can be stated as follows: assuming a large number of subscriptions and/or a high rate of events, how to design a fast algorithm for finding those subscribers that should be notified when an event occurs at the DL. An important amount of work has been devoted in the recent past to designing such algorithms for large-scale systems [3]. These works apply mostly to relational databases, where subscriptions are seen as conjunctions of predicates (i.e. of relational variables). Their main common ideas can be summarized as follows: (i) if a predicate appears in many subscriptions, avoid its repeated evaluation, and (ii) try to filter out large sets of subscriptions by evaluating first the most selective predicates. The present paper considers specifically the context of Digital Libraries and makes the following contributions.. we define a refinement relation over the set of subscriptions and provide simple algorithms for (a) testing whether a subscription refines another, and (b) computing the least upper bound (lub) of a set of subscriptions; 2. we propose a cost model and an incremental aggregation algorithm in order to maintain a structure of the set of subscriptions that minimizes the matching cost; 3. we provide a set of experimental results which show the gain obtained by our method. The main idea behind our work is to exploit a refinement relation between subscriptions so as to avoid useless computations. Roughly speaking, if a subscription S 2 refines a subscription S, this means that the set of documents satisfying S 2 is a subset of the set of documents satisfying S. Therefore, if a matching test for S fails, there is no need to carry out the test for S 2. This can be generalized to sets of subscriptions C = {S, S 2,..., S n } that we call clusters: if we can find an aggregate subscription S such that any event that matches some S i also matches S, then we make an initial test for S; if this initial test fails, then no further test is needed for the subscriptions in C. On the other hand, if the test for S succeeds then it will succeed for at least one subscription in C. However, in practice the set of events that match S is a superset of those that match the subscriptions in C. Our algorithm aims at minimizing this loss of precision by tightening the representation of the clusters. In the following Section 2 we present our model and in Section 3 we present our algorithms. We present experimental results in Section 4 and discuss related work in Section 5. In Section 6 we offer some concluding remarks and discuss perspectives. 2

3 2 The Model As we mentioned in the introduction, we assume a single taxonomy over which several basic concepts of a digital library are defined. To begin with, we denote the library taxonomy as (T, ), where T is a set of keywords, or terms, and is a subsumption relation over T, that is a reflexive and transitive binary relation over T - that we assume to be a tree. Given two terms, s and t, if s t then we say that s is subsumed by t, or that t subsumes s. We represent the taxonomy as a tree in which the nodes are the terms of T and where there is an arrow from term t to term s iff t subsumes s. Figure shows an example of a taxonomy, in which the term Languages subsumes the term OOL, the term OOL subsumes the terms C++ and Java, and so on. In order to make a document sharable by the community of library users, its author must register the document at the library. This is done by providing a description of the document s content together with an identifier (say, the document s URI) allowing to access the document s content. As already mentioned in the introduction, the description is just a set of terms from the taxonomy. Definition (Description). Given a taxonomy ( T, ) we call description in T any set of terms from T. For example, if the document contains the quick sort algorithm written in java then the terms QuickSort and Java can be used to describe its content. In this case the set of terms {QuickSort, Java} is the description of the document. During document registration, what is actually stored at the DL repository is the document identifier and the document s description. Conceptually, the repository can be thought of as a binary relation between terms and document identifiers (i.e. as a set of pairs <term, doc-id>) defined as follows: during registration of a document with identifier d, the library stores a pair (t, d) for each term t appearing in the description of d. Figure 2 shows an example of repository over the taxonomy. The dotted lines indicate the pairs (t, d) of the repository, relating terms with document identifiers. A description can be redundant if some of the terms it contains are subsumed by other terms. For example, the description {QuickSort, Java, Sort} is redundant, as Quick- Sort is subsumed by Sort. If we remove either Sort or QuickSort then we obtain a non-redundant description: either {QuickSort, Java} or {Sort, Java}, respectively. Redundant descriptions are undesirable as they can lead to redundant computations during subscription evaluation. We shall therefore limit our attention to non-redundant descriptions. More generally, we shall limit our attention to reduced sets of terms defined as follows: Definition 2 (Reduced Set of terms). A set of terms D from T is called reduced if for any terms s and t in D, s t and t s. Following the above definition one can reduce a description in (at least) two ways: removing all but the minimal terms, or removing all but the maximal terms. In this work we adopt the first approach, that is we reduce a description by removing all but its minimal terms. Definition 3 (Reduction). Given a description D in T we call reduction of D, denoted reduce(d), the set of minimal terms in D with respect to the subsumption. The reason for our choice is that by removing all but minimal terms we obtain a more accurate description of the document. This should be clear from our previous example, where the description {QuickSort, Java} is more accurate than {Sort, Java}. Henceforth, when using the term description we shall always mean reduced description. A query in our model is just a conjunction of terms from the taxonomy, and its answer is a set of documents as defined below. In the following definition the symbol tail(t) stands for the set of all terms in the taxonomy T strictly subsumed by t, that is tail(t) = {s s t}, and R stands for the repository of the digital library. Queries are defined over T and they are answered based on R. Definition 4 (Query over T and its answer). A query q over T is either a single term or a conjunction of terms from T. Its answer, denoted by ans(q), is a set of documents defined as follows: 3

4 Programming Theory Languages Algorithms OOL Sort C++ Java MergeSort QuickSort BubbleSort d 5 JSP JavaBean d 3 d4 d d 2 d 6 Figure 2: A repository Case : q is a single term t from T, i.e., q = t ans(q) = if tail(t) = then {d (t, d) R} else {ans(s) s tail(t)} Case 2: q is a conjunction of terms, i.e., q = t t 2 t n ans(q) = ans(t ) ans(t 2 ) ans(t n ) As an example, consider the query q = C++ Sort. Referring to Figure 2 and applying the above definition, we find ans(q) = {d 5, d 6 } {d 3, d 4, d 6 } = {d 6 }. A subscription in our model is just a query describing (intentionally) the set of documents of interest to a user. In reality, a user can define his subscription by selecting terms from the taxonomy. The conjunction of the selected terms is the user s subscription. Definition 5 (Subscription). Given a taxonomy ( T, ) we call description in T any reduced set of terms from T. Actually, henceforth, we shall think of a subscription either as a set of terms (e.g., {C++, Java}) or as a query (e.g., C++ Java). The answer to this query at a given moment in time represents all documents of interest to the user that the library contains at that particular moment in time. However, the answer changes over time, as new documents are inserted in the library, or existing documents are modified or deleted. These changes that occur at the library over time is precisely what we call events: Definition 6 (Event). In a digital library, we call event the insertion, modification or removal of a document. An event is represented by the description of the document being inserted, modified, or removed. When an event occurs the system must inform, or notify each user whose subscription is matched by the event. Definition 7 (Matching). Let e be an event and q a subscription. We say that e matches q if the following holds:. A document d is removed and d ans(q) before the event occurs. 2. A document d is inserted and d ans(q) after the event has occurred. The case of document modification is treated as a deletion followed by an insertion. Clearly, when an event e occurs, the system must decide which users are to be notified. A naive approach is to examine every subscription q and test whether e matches q. However, if the set of subscriptions is large, and/or the rate of events is high, the system might become quickly overwhelmed. In what follows, we shall refer to this simplistic approach as Naive. 4

5 In this paper, we introduce a more efficient approach based on the observation that testing whether an event matches a subscription is basically a set membership test (i.e. testing whether a document belongs to a given set of documents - which happens to be the answer to a query). The idea that we exploit here is the following: if we have to perform test membership for every set in a collection of sets, we can save computations by starting with maximal sets first (maximality with respect to set inclusion). Indeed, if a document does not belong to a maximal set then we don t need to test membership for any of its subsets. In order to implement this idea, we need to define first a notion of refinement between subscriptions. In fact, we need a definition that translates the following intuition: if subscription q refines subscription q 2 then every event that matches q also matches q 2. Definition 8 (Refinement Relation). Let q and q 2 be two subscriptions. We say that q is finer than q 2, denoted q q 2, iff t 2 q 2, t q t t 2. In other words, q is finer than q 2 if every term of q 2 subsumes some term of q. For example, the subscription q = {QuickSort, Java, BubbleSort} is finer than q 2 = {Sort, OOL}, whereas q 2 is not finer than q. If the sets of terms q and q 2 are reduced sets then the refinement relation just defined becomes a partial order. In fact, it can be shown that : (i) there exists a least upper bound for any subset sub S of the set S of all subscriptions. (ii)the set S is an upper semi-lattice. Proposition. Let sub S be a subset of the set S of all subscriptions. There exists a least upper bound for sub S denoted lub (sub S ). Proposition 2 ([5]). (S, ) is an upper semilattice. Proof. Let sub S = {S,.., S n } be a set of subscriptions. Let U be the set of all subscriptions S such that S i S, i =, 2,..., n. We show that U has a unique minimal element, that we shall denote as lub(sub S, ). Let P = S S 2... S n be the Cartesian product of the subscriptions in sub S, and suppose that there are k tuples in this product, say P = {L, L 2,..., L k }. Let A = {lub (L ), lub (L 2 ),..., lub (L k )}, where lub (L i ) denotes the least upper bound of the terms in L i, with respect to. As (T, ) is a tree, this least upper bound exists, for all i =, 2,..., k. Now, let R be the reduction of A, i.e., R = reduce(a). R is the smallest element of U. Indeed, it follows from the definition of R that S i R, for i =, 2,..., n. Moreover, let S be any subscription in U, and let t be a term in S. It follows from the definition of U that there is a term v i in each subscription S i such that v i t. Consider now the tuple v =< v, v 2,..., v n >. By the definition of least upper bound, lub (v) t, and as lub (v) is in R, it follows that R S, and this completes the proof. The proof suggests a simple algorithm for computing the lub of a set of subscriptions {S,.., S n }:. Form the Cartesian product C of all subscriptions in {S,.., S n }; 2. Apply lub to each tuple of terms in C to find the set L of lub s; 3. Remove all but the minimal terms in L The output of this algorithm is a reduced set of terms which is the lub of {S,.., S n }. Consider the subscriptions S = {Java,C++, QuickSort} and S 2 = {JSP,MergeSort}. In order to find their lub we proceed as follows:. Cartesian product C of S and S 2 : C = S S 2 = {{Java,JSP}, {Java,MergeSort}, {C++,JSP}, {C++,MergeSort}, {QuickSort,JSP}, {QuickSort,MergeSort}} 2. L = {Java, Programming, OOL, Sort} 3. Reduce(L) = {Java,Sort} Note that this direct approach of finding the lub of a set of subscriptions runs in O( S S 2... S n ). We shall see a linear algorithm in the next section. Intuitively, refining a subscription can be seen as imposing tighter matching constraints. There exist two possible ways of doing so: by simply adding some terms to a subscription, or by replacing a term in the subscription by one of its descendants in the taxonomy. 5

6 For example, referring to the subscriptions shown in Figure 3, q = OOL Sort is refined by the subscription q 2 = Theory OOL Sort, since any event that matches q 2 also matches q. Generally, if q and q 2 are subscriptions, and q is a subset of q 2, then q 2 is a refinement of q. A subscription can also be refined through term subsumption. Thus q is also refined by q 3 = Java Sort because Java OOL. Note also that adding to a subscription q a term t subsumed by a term t which already appears in q yields an equivalent subscription. For example, q = OOL Sort is equivalent to q 2 = OOL Sort Languages. The refinement relation defined earlier also holds on equivalence classes of subscriptions (reduced subscriptions being just representatives of equivalence classes). Note that a subscription can be the refinement of several other subscriptions. For example, q 4 is a refinement of both q 2 and q 3 (and, by transitivity, it is also a refinement of q ). Figure 3 shows the refinement relation in the form of a graph. q 2 = Theory OOL Sort q = OOL Sort q 3 = Java Sort q 4 = Theory Java QuickSort Figure 3: The refinement relation Finally, if q is a subscription and d is a document with description D, then d ans(q) iff D q. Referring to the graph of Fig. 3, we can now explain how we exploit the refinement relation in our approach: if an event e does not match the subscription q, then it cannot match any of the subscriptions q 2, q 3 or q 4. Therefore the idea is to first evaluate e with respect to q (which is the most general subscription): if the matching is successful, then the evaluation continues with respect to both q 2 and q 3 (otherwise evaluation stops); and if the evaluation is successful for either q 2 or q 3, then evaluation continues with respect to q 4 (otherwise evaluation stops). Assuming that, in general, a large fraction of all events will fail to match with any particular subscription, this strategy is expected to save a significant number of computations. Of course, the failure rate of event matching depends on the filtering rate of each particular subscription, a concept that we shall see shortly. The set of user-submitted subscriptions can be organized as a directed acyclic graph that we call the subscription graph. Its acyclicity follows from the fact that the refinement relation is a partial order (up to subscription equivalence). However, although the subscription graph is acyclic, it may have several roots (i.e. more than one maximal element). Henceforth, we shall assume that the subscription graph has a single root, by adding to it (if necessary) the lub of all maximal subscriptions. In principle, it would be possible to maintain the transitive reduction of the subscription graph but its construction is too costly (The computation of the transitive reduction runs in O(n) for a graph of size n, for each node insertion [4]). However, it turns out that constructing the transitive reduction is not really necessary. In fact, it is sufficient to maintain one and only one path from the root to every node (the root being the most general subscription). This follows from the observation that if a path to a node q is successful (i.e., an event e matches all the predecessors of q along the path), then every path leading to q will be successful as well. As a consequence, it is sufficient to test matching along just one path. As this holds for every node in the subscription graph, it is sufficient to construct a spanning tree in order to be able to test matching for every node in the graph. Now, as there are in general many spanning trees, the question is how to choose the best one. To this end, we introduce the notion of filtering rate. The filtering rate of a subscription q, denoted σ(q), is the probability of an event matching q. It can be determined by various means, including a cost model that we present later on. So a best spanning tree will be one that optimizes the amount of filtering during a matching test. As an example, referring to the upper part of Figure 4, let us assume the following filtering rates for q, q 2 and q 3 : 6

7 9 q = OOL Sort 9 q 2 = Theory OOL Sort q 3 = Java Sort First choice 2 8 q 4 = Theory Java QuickSort Second choice q = OOL Sort q = OOL Sort q 2 = Theory OOL Sort q 3 = Java Sort q 2 = Theory OOL Sort q 3 = Java Sort 2 q 4 = Theory Java QuickSort 8 q 4 = Theory Java QuickSort Figure 4: Issues in subscription tree construction σ(q ) = 9 σ(q 2 ) = 2 σ(q 3 ) = 8 These filtering rates are shown on the edges, where every edge is labelled by the filtering rate of its source node. It is easy to check that there are only two spanning trees in the subscription graph, as shown at the bottom part of Figure 4. The spanning tree on the left is minimal, in the sense that only 2 of the matching tests against q 4 will succeed, instead of 8 if the spanning tree on the right were used (in the following we call such a spanning tree a minimal spanning tree [5]). From the point of view of our application, selecting the minimal spanning tree is tantamount to selecting, for each subscription q, the most filtering path leading to q. As usual with publish/subscribe applications, we assume that subscriptions may be added or removed dynamically. We must therefore construct and maintain the minimal spanning tree incrementally. However, incremental maintenance incurs a very significant cost, as in the worst case the whole tree must be reconstructed after a subscription is inserted, updated or deleted. In the next section, we propose an algorithm which provides an approximate solution to the problem. Then, we show through experiments that it still provides an effective reduction of the overall evaluation cost with respect to the trivial solution (i.e., the one that test matching with every submitted subscriptions). 3 Computational issues In what follows, we first present a linear algorithm for deciding whether a subscription is a refinement of another subscription, and for computing the least upper bound of a set of subscriptions. Next we describe the insertion algorithm. An important issue during the insertion process relates to how and when we can cluster subscriptions which are close to one another, and whether we must materialize the lub of a cluster and put it in the tree. Note that by doing so, we might introduce some artificial nodes which have not been subscribed by any user. This is justified because a lub, if it is close enough to the subscriptions that it covers, is a means to factorize the computations that would otherwise be carried out independently. We present a simple cost model and show that adding lubs in the subscription tree is almost always beneficial. Finally we end the Section with the notification algorithm. 7

8 3. Computing and Lub The evaluation of the relation and the computation of the lub of a set of subscriptions are the basic operations involved in the maintenance of the subscription graph. Given two subscriptions S and S 2, a naive implementation compares each term in S with each term in S 2 and runs in O( S S 2 ), both for and Lub. We present an optimized computation which uses an appropriate encoding of the terms of the taxonomy and avoids the Cartesian product of the naive solution. Its cost is linear in the size of the subscriptions. Our encoding extends the labelling scheme presented in [2] and further investigated in [7]. In our labelling scheme the successors of each node are assumed to be linearly ordered (by their position, say from left to right), and a node in the taxonomy tree is identified by its position with respect to the parent node. The label label(t) of a term t is obtained by concatenating the label of the parent followed by the the position of the term (with respect to the parent). For example, referring to Figure 5, if is the label of the node Programming, then.,.2 and.3 are respectively the labels of its son nodes, Theory, Languages and Algorithms. This encoding defines a total lexicographic order < l on the set of terms. Evaluating the subsumption relation t t 2, using this encoding, reduces to checking whether label(t 2 ) is a prefix of label(t ). The label of the least upper bound of two terms, lub (t, t 2 ), is the longest common prefix of label(t ) and label(t 2 ). Recall that since (T, ) is a tree, this least upper bound always exists. A subscription S is encoded as the list of the labels of its terms, sorted in lexicographic order. For example the subscription {Theory, Algorithms} is encoded [.,.3]. The order of the term labels in the labelling of a subscription helps reduce the number of computations required to evaluate the refinement relation S S 2, and also to compute the lub Lub(S, S 2 ), since merge-like algorithms can be applied. The correctness of these algorithms is based on the following properties: Proposition 3. Let S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m] be two (reduced) subscriptions, encoded in lexicographic order. Then:. Property : If t 2 j t i for some i < n and j m, then t 2 k t i+ for all k j 2. Property 2: Let i > and j < m such that t i < l t 2 j, then for all k < i and all l > j, lub (t i, t2 j ) lub (t k, t2 l ). The first property states that if a term t i from S subsumes a term t 2 j from S 2, t i i+ cannot subsume any term before t 2 j in the lexicographic order. Therefore it is possible to check the refinement relation by always advancing one position in the two lists S and S 2. Algorithm Refine Input: S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m], two reduced subscriptions. Output: true if S S 2, else false. begin i := ; j =: ; while (i < n and j < m) do // If t 2 j subsumes t i, advance on S and S 2 if (t i t2 j ) then i := i + ; j := j + else // If t i is less than t2 j : advance on S if (t i < l t 2 j ) then i := i+ // If t 2 j is less than t i : // no subsumed term for t 2 j in S : // S does not refine S 2 if (t 2 j < l t i ) then return f alse end if end do // When j < m, some terms in S 2 // do not have a subsumed term in S if (j < m) then return false else return true end The second property ensures the correctness of a merge-like algorithm which always advances on the subscription that contains the smallest term in lexicographic order. Algorithm Lub Input: S = [t, t 2,, t n] and S 2 = [t 2, t 2 2,, t 2 m], two reduced subscriptions. Output: the least upper bound of (S, S 2 ). begin 8

9 Programming Theory. DB Logic Languages.2 OOL.2. Algorithms.3 Sort.3. C Java.2..2 MergeSort.3.. QuickSort.3..2 BubbleSort.3..3 JSP JavaBeans Figure 5: The taxonomy labelling i := ; j =: ; k := ; L[k] = lub (t i, t2 j ) while (i < n or j < m) do if (t 2 j < l t i or i = n) then // advance on S 2 j := j + else // advance on S i := i + endif current := lub ((t i, t2 j ) // Check whether the current lub // must replace the previous if (current L[k]) then L[k] := current else if (L[k] current) then // Ignore the current lub else //Add the current lub to the result k := k + ; L[k] := current endif endif end do return L end Thanks to the technique, Refine(S, S 2 ) runs in O(max( S, S 2 )), and Lub(S, S 2 ) in O( S + S 2 ). The detailed algorithms are given as an appendix. 3.2 The cost model The unit cost that we consider is the evaluation of the relation, called terms comparison in the following (or comparison for short). As mentioned above, the evaluation of S S 2 requires at most max( S, S 2 ) comparisons. For clarity, we assume in the following that the size of an event e (i.e. the number of terms in the description of the document being inserted, modified or removed) is larger than the size of a subscription. The impact of this assumption is marginal. In order to estimate the gain obtained when using the lub of a set of subscriptions as a filter we first define the filtering rate of a subscription. It estimates the percentage of successful matchings with respect to events occurring at the library. The filtering rate of a subscription is computed based on the selectivity σ(t) of a term t, which represents the probability for the description of a document to contain either t or a term subsumed by t. Term selectivity can be obtained by analyzing the current repository instance, or simply by assuming a uniform distribution. The analysis that follows is based on a uniform distribution. Let L be the set of leaves of the taxonomy. The selectivity of t T is estimated by the following inductive formula: σ(t) = { σ(t) = T, if t L T + Σ t children(t) σ(t ) otherwise Given a subscription S = {t, t 2,, t n }, the filtering rate of S is σ(s) = σ(t ) σ(t 2 )... σ(t n ) (conjunctive semantics). Note that this simple model satisfies the obvious constraint that S S σ(s) σ(s ). For instance the filtering rate of a subscription 9

10 S = {root T }, where root T is the root of T is equal to : any event matches this subscription. The filtering rate of a subscription S = {l}, where l is a leaf of T, is T. We can now measure the gain of adding the lub for a subset of subscriptions in the tree. Consider a set of subscriptions q, q 2,, q m and their lub denoted q lub. The number of comparisons when testing an event with n terms can be estimated as: Cost cl = ( σ(q lub )) n+σ(q lub ) (n+m n) () The cost of checking whether e matches q lub is n, with a success probability σ(q lub ), in which case we must also consider its children. In the worst case, all the terms have to be tested for each child, hence m n comparisons. The cost of an independent evaluation of each subscription (without using q lub as a filter), is Cost multi = m n (2) The difference between the two solutions depends both on the filtering rate of q lub (the smaller the better), and on the number of children (the higher the better). Using the lub as a filter saves comparisons when Cost cl < Cost multi, i.e., when ( σ(q lub )) n + σ(q lub ) (n + m n) < n m + m σ(q lub ) < (3) m The formula 3 shows that the efficiency of the evaluation process heavily depends on the presence of the q lub as a subscription. If the subscription does not appear in the subscription graph G S, some degenerate cases may occur. The following example illustrates an extreme case. Example. Let the set of subscriptions S be {q = {l }, q 2 = {l 2 },... q m = {l m }}. Each l i is a leaf in T and their common parent is l.. Assume that there is no other subscription; then the number of comparisons with an event of size n is n m. 2. Now the subscription q lub = {l} is submitted. Its filtering rate is m+ T and the cost for an event can be estimated to be n+ m (m+) n T. Providing that m T, the filtering is quite effective. In conclusion, the closer the lub to the subscriptions of the cluster, the higher the amount of filtering which is obtained. Formula 3 is also sufficient to conclude that in practice the evaluation cost is almost always reduced by using the lub of two subscriptions, the only exception being when the lub consists solely of the root root T of the taxonomy. The construction of the subscription spanning tree, presented below, systematically attempts to add lubs in the tree during the insertion of a new subscription. 3.3 Insertion algorithm The insertion algorithm constructs incrementally a tree T S, where each node consists of a subscription S, associated with a bucket containing the set of user identifiers that have subscribed S. Initially the tree consists of the subscription {root T }, containing the root term of the taxonomy, with an empty bucket. When a user u subscribes to S, one first searches the location of S in the tree. This is performed in two steps: Candidate parent selection. A node N in T S is a candidate parent for S if the following conditions hold: (i) S N, and (ii) for each child N of N, S N, i.e., S strictly refines N but does not refine any child of N. The algorithm performs a top-down search, looking for a candidate parent. Starting from the root, it chooses at each level the most selective child which refines N. When such a child no longer exists the candidate parent is found. Note that this is an heuristic which avoids to follow an unbounded number of paths in the tree, but does not guarantee that the best candidate parent (i.e., the most selective one) is found. Lub selection. Once the candidate parent N is found, the second step inserts S as a child or grandchild of N as follows. First, for each child N of N, one computes Lub(S, N ) and keeps only those lubs which refine N. Now:. if at least one such lub l = Lub(S, N ) has been found, the most selective one is chosen, and a new subtree l(s, N ) is inserted under the parent of N ; 0

11 2. else S is inserted as a child of N. In all cases the id of user u is simply added to the bucket of S. Consider the tree in the top part of Figure 6, and suppose that the subscription S ={JSP, QuickSort} is to be inserted. We first search the candidate parent. During the top-down traversal, the most selective path must be chosen. At level 3, the possible paths are {OOL, QuickSort} and {Java, Sort}: the node with the minimum filtering rate, {OOL, QuickSort}, is chosen. Since none of its children refines S, this node is the candidate parent. Subsequently, we compute successively the lub of S and every child of {OOL, QuickSort} :. L = Lub ({JSP,QuickSort}, {C++, QuickSort}) = {OOL, QuickSort} 2. L 2 = Lub ({JSP, QuickSort}, {JavaBeans, QuickSort}) = {Java, QuickSort}. Since L already exists in the tree, the new subscription tree is the one in the bottom part of Figure 6. Note that when the lub of S and a node N is computed, we know for sure that this lub does not refine any sibling N of N, otherwise Lub(N, N ) would have been inserted in the tree in the first place. This is a consequence of the tree construction algorithm, which is summarized below. Insert(u, S, T S ) Input: a user u, a subscription S and the tree T S Output: the new tree after the insertion begin // First step: search the candidate parent using // a depth-first search parent := CandidateParent(S, root T ) // Second step: compute the candidate lubs C := for each N i parent.children do if Lub(S, N i )<>parent then C = C {N i } endfor if C = then // No candidate lub. // S is a child of parent parent.children := parent.children {S} else // Add the lub of the two subscriptions // Choose a node N such that that Lub(N, S) // is the most selective Choose a node N in C Add the subtree [Lub(N, S)(N, S)] to parent Remove N from parent s children endif Add the id of the subscriber u to the bucket of S return T S end The following special cases are not represented in the algorithm: (i) a subscription S is already represented in T S and (ii) the lub of S and a node N is S itself. The extension is straightforward. 3.4 Removal algorithm A leaf in the subscription tree whose bucket becomes empty can be removed (note that an empty internal node can still play the role of a filter, and must be kept in the structure). The removal of an empty leaf S, with parent node P, is outlined below:. first compute the lub L of the siblings of S (if S has no sibling, then L = ), and remove S from the children of P ; then: 2. (a) if S has at least one sibling, the second step depends on the bucket of P : if it is is empty, P is replaced by L, else L becomes the child of P. (b) else S has no sibling, and P may become an empty leaf in turn if its bucket is empty. The procedure must then be called recursively and bottom-up. In the worse case, the removal of S may affect all the nodes along the path from S up to the root. Note however that lazy upating can be used (i.e., the adjustment of S s ancestors is not done immediatly), since the tree supports correctly the insert and search operations after step (). 3.5 Notification algorithm Whenever a new event e arrives, the algorithm scans the tree top-down, starting from the root of the tree.

12 Programming Algorithms Programming Algorithms OOL, QuickSort Java, Sort 3/49 3/49 6/49 C++, QuickSort JavaBeans, QuickSort Java, BubbleSort OOL, QuickSort Java, Sort 3/49 C++, QuickSort 3/49 6/49 Java, QuickSort Java, BubbleSort JavaBeans, QuickSort JSP, QuickSort a. The initial tree b. The tree after insertion of {JSP, QuickSort} Figure 6: The insertion algorithm Cmatch NCmatch Nb subscriptions Depth Insertion level Avg. fanout Depth Insertion level Avg. fanout 30, , , , , , Table : Structure of the subscriptions tree for different subscriptions datasets The main procedure, Match(N), is called recursively and proceeds as follows:. if e does not match N, the scan stops; there is no need to access the children of N; 2. else the users of the bucket associated to N can be notified while Match is called recursively for each child of N. The cost of the algorithm is strongly influenced by the average number of children of a node (fanout). If this number is very large, many of the children will not refine an event e, and this results in useless evaluations of the refinement relation. When the fanout of the subscriptions tree decreases the global amount of filtering out increases, and our algorithm is expected to greatly reduce the number of dead-ends during the tree traversal. The following section, on experimentation, is intended to validate this analysis. 4 Experiments We analyze the behavior of our clustered graph structure, called Cmatch ( clustered matching ), and compare it with the following competitors:. Naive is the trivial solution which stores the subscriptions in a linear structure. 2. NCmatch ( non-clustered matching ) relies on a subscriptions tree without clustering, i.e., we never introduce the lub of user s subscription during an insertion. The NCmatch implementation is mostly intended to assess experimentally the gain of the clustering in Cmatch. The impact of the number of users who registered a given subscription S is neutral, because once a subscription that matches an event is found, all the notification variants (Naive, NCmatch and Cmatch) merely scan the bucket of users, sending a notification to each of them. For clarity we ignore the cost of this specific operation in the presentation of our experimental results, and focus on the cost of finding the set of relevant subscriptions. The evaluation cost is measured with respect to the following indicators: (i) number of terms comparisons, and (ii) the number of nodes visited for the tree-based solutions Cmatch and NCmatch. We analyze successively the two main operations: the insertion of new subscriptions ( subscribe ) and the 2

13 0 C++ Quicksort BubbleSor OOL, QuickSort, BubbleSort 0 Java 0 MergeSort BubbleSor Figure 7: Avoiding redundant comparisons search of the subscriptions that match an event ( notify ). 4. Experimental setting The structure has been implemented in Java on a Pentium IV processor (3,000MHz) with,024mb of main memory. The implementation conforms to the specifications given in the previous section, expect for the following optimization used in Cmatch and NCmatch. During the top-down traversal of the subscriptions tree, a same term comparison may have to be carried out repeatedly. Consider the three subscriptions : S = {OOL, QuickSort, BubbleSort}, S 2 = {C++, QuickSort, BubbleSort}, S 3 = {OOL, MergeSort, BubbleSort}. The tree for these subscriptions is shown in Figure 7. If an event e refines S, we know for sure that for each term in S, we found a subsumed term in e. We must now evaluate e S 2 and e S 3. If one or several of the terms in S are also present in S 2 and S 3, it is useless to search again for a subsumed term in e. We maintain, at each node N in the graph, a mask of bits which indicates the terms shared by N and one of its ancestors. This is illustrated in Figure 7. The parent node is the subscription {OOL, QuickSort, BubbleSort}. The two children share respectively with their parent the terms BubbleSort and QuickSort (left child), and BubbleSort (right child). A bit is set to if the corresponding term is shared with the parent, or to 0. During the matching process we need to evaluate the term comparison only for the 0-bit terms. This saves 2/3 of the comparisons for the left child in Figure 7, and /3 for the right child. Our experimental setting simulates a Digital Library storing a set of scientific documents described by terms from the ACM Computing Classification System [] taxonomy. The taxonomy contains,36 terms, and its maximal depth is 5. We produced randomly several sets of distinct subscriptions, with a cardinality ranging from 30,000 to 80, Cost of subscriptions The cost of an insertion for the Naive approach is negligible, since it consists only in an insertion in a linear structure, performed in constant time. We focus therefore on the comparison of Cmatch and NCmatch. Table summarizes the structural properties of the subscriptions tree. The average number of terms in a subscription is 5, with a close clustering around this value (i.e., most subscriptions have 4, 5, 6 or 7 terms). The table gives, for each dataset, the depth of the subscriptions tree, the average insertion level for a new subscription and the average fanout. The most striking feature is the very large fanout of nodes for the non-clustered solution NCmatch. This clearly relates to the large number of terms in the taxonomy, which reduces the probability to find a refinement relationship between two submitted subscriptions. As a result one obtains a tree with a few levels, where the refinement relation is sparsely represented. It can be expected that with a very large taxonomy, the tree degenerates to an almost linear structure, where all the subscriptions tend to be a child of the root node. Clearly such a structure looses most of the benefits of the approach, since the insertion is costly, yet the amount of filtering remains low. On the other hand, the larger the subcriptions set, the higher the probablity to find a subscription refined by another one. This explains that the fanout decreases as new subscriptions are added. The Cmatch structure which clusters the subscriptions and represents these clusters as subtrees rooted by their lub achieves a quite significant reduction of the fanout with respect to NCmatch. Table shows that the average number of children of a node is about two orders of magnitude lower for Cmatch with respect to NCmatch. This is clearly a quite desirable property since it reduces both the cost of insertions 3

14 and the cost of search operations. We made several experiments that vary the size of the subscriptions, and the obtained results show that the above conclusions still hold. Table 2 shows the cost of inserting a subscription in an existing subscriptions tree, for different subscriptions datasets. We measure the number of nodes visited by the insertion algorithm, and the number of terms comparisons. As expected from the trees properties, the gain of the clustered solution is quite impressive. In particular the small number of comparisons which are necessary to insert a new subscription shows that the structure can support a high ratio of updates. This constitutes an important property for a publish/subscribe system. The results of Table 2 also confirm that the clustering is essential to exploit the refinement relation and filter out most of the comparisons which are made by the non-clustered NCmatch structure. Again, this is related to the size of the taxonomy. With a smaller one, the refinement relation would be more represented in the subscriptions tree, with a probable reduction of the gap between Cmatch and NCmatch. However, the main conclusion of our analytical and experimental study on that matter is the high benefit, in all cases, of the lub-based clustering approach. nodes. The Cmatch algorithm benefits strongly from the clustering. The lubs play their role of filters and allow to get rid of most of the irrelevant computations. The number of comparisons remains very small with respect to Naive. Table 3 summarizes these properties for different subscriptions datasets Number of comparisons Number of nodes visited Figure 8: Evolution of insertions cost 4.3 Cost of notifications We now turn our attention to the notification process. The results obtained for the three solutions Naive, NCmatch and Cmatch are given in Table 3 for our 6 subscriptions datasets. We compare both the average number of nodes visited and the average number of terms comparisons (the latter being more representative of the actual cost) for processing a single event (note that an event generates several notifications in general). For the naive solution, the number of visited nodes is equal to the number of subscriptions. The gain of the non-clustered solution is not very significant since about 50% of the computations are saved. This is easily explained by the shape of the tree, with a small number of levels and a large fanout. The notification algorithm must visit all the nodes that refine the incoming event, and test all the children of these An important aspect of the behavior of the algorithms is clearly illustrated by the curves of Figure 8. Whereas NCmatch and Naive exhibit a linear degradation of their computation costs with respect to the size of the subscriptions set, the performances of Cmatch decrease very slowly (Naive is not shown on the curve because of the very large values of its figures). Actually the cost of processing required to deliver a notification turns out to be almost constant, and independent from the number of subscriptions. This shows that the pruning effect of the tree is quite effective and removes most of the unnecessary computations. 5 Related work Several data structures and algorithms represent the subscriptions and the descriptions by sets 4

15 Cmatch NCmatch Nb sub. Visited Nodes Comparisons Visited Nodes Comparisons 30, ,803 34,493 60, ,072 76,598 90, ,649 6,57 20, ,425 57,352 50, ,425 98,680 80, ,424 22,272 Table 2: Cost of insertions for different subscriptions datasets Cmatch NCmatch Naive Nb sub. Visited nodes Comp. Visited nodes Comp. Visited nodes Comp. 30, ,266 5,893 30,998 30,000 69,20 60, ,585 3,474 60,027 60,000 33,822 90, ,82 47,56 90,400 90,000 97,57 20, ,074 62,543 9,24 20,000 26,807 50, ,250 77,728 47,452 50, ,898 80, ,42 92,880 75,90 80, ,004 Table 3: Cost of notifications for different subscriptions datasets. of predicates, where a predicate is a triple : (attribute, op, value) and op is either or or =. Two main techniques are used in this context. The first one relies on a two-step approach: first the predicates are evaluated with respect to the event s values, and second the matching subscriptions are determined by counting their number of satisfied predicates [0, 2, 3, 6, 7]. [0] uses indexing of equality predicates to speed up the matching of atomic formulas and clusters subscriptions to minimize cache failure. A similar approach is used in [2] and in [3]. In the SIFT system [7, 6], the subscriptions are composed of a set of weighted keywords. The matching algorithm is based on techniques of similarity computation. These techniques do not apply to our problem: we do not have attributes, and the indexing of equality predicates does not extend to the subsumption relation. The second technique works also in two steps. The first step organizes the subscription directory in a special structure, while the second step uses this special structure to filter the incoming events. For the Elvin Publish/subscribe system, Gough and Smith [] present an algorithm translating the subscriptions into a tree. When an event occurs every predicate is tested only once but it is stored in a redundant way leading to a combinatorial explosion of its occurrence. This is not the case of our tree structure since the number of nodes is in the worst case 2 S, where S is the number of subscriptions. Furthermore, the maintenance of the tree structure used by Gough Smith [] is very costly, compared to our approach. The algorithm presented in [3] is also based on a tree structure where subscriptions are stored in the leaves and each non-leaf node represents a predicate comparison. The space requirements are important: each new subscription with K predicates adds K + nodes, and several paths may have to be followed during the matching process. Moreover the structure is suitable only for equality tests. A similar structure and filtering algorithm are proposed in [6]. More recently, XML-based filtering systems have been proposed [8, 9, 4]. In summary we are not aware of a publish/subscribe technique that considers a simple keyword-based query language and a subsumption relation over the terms. Compared to other works in related areas, the solution proposed in the present paper presents a reasonable storage cost (at most twice the number of subscriptions) and achieves a nice trade-off between the performances of subscription (insertion in the structure) and notification (search in the structure). 5

Fast User Notification in Large-Scale Digital Libraries: Experiments and Results

Fast User Notification in Large-Scale Digital Libraries: Experiments and Results Fast User Notification in Large-Scale Digital Libraries: Experiments and Results H. Belhaj Frej 1, P. Rigaux 2, and N. Spyratos 1 1 LRI, Univ. Paris-Sud, 91400 Orsay, {hanen, spyratos}@lri.fr, 2 Lamsade,

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Binary Decision Diagrams

Binary Decision Diagrams Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms Trees Sidra Malik sidra.malik@ciitlahore.edu.pk Tree? In computer science, a tree is an abstract model of a hierarchical structure A tree is a finite set of one or more nodes

More information

COMP3121/3821/9101/ s1 Assignment 1

COMP3121/3821/9101/ s1 Assignment 1 Sample solutions to assignment 1 1. (a) Describe an O(n log n) algorithm (in the sense of the worst case performance) that, given an array S of n integers and another integer x, determines whether or not

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Greedy Algorithms CHAPTER 16

Greedy Algorithms CHAPTER 16 CHAPTER 16 Greedy Algorithms In dynamic programming, the optimal solution is described in a recursive manner, and then is computed ``bottom up''. Dynamic programming is a powerful technique, but it often

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Database System Concepts, 5th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

6.001 Notes: Section 4.1

6.001 Notes: Section 4.1 6.001 Notes: Section 4.1 Slide 4.1.1 In this lecture, we are going to take a careful look at the kinds of procedures we can build. We will first go back to look very carefully at the substitution model,

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017

Scribe: Virginia Williams, Sam Kim (2016), Mary Wootters (2017) Date: May 22, 2017 CS6 Lecture 4 Greedy Algorithms Scribe: Virginia Williams, Sam Kim (26), Mary Wootters (27) Date: May 22, 27 Greedy Algorithms Suppose we want to solve a problem, and we re able to come up with some recursive

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

Semantics via Syntax. f (4) = if define f (x) =2 x + 55. 1 Semantics via Syntax The specification of a programming language starts with its syntax. As every programmer knows, the syntax of a language comes in the shape of a variant of a BNF (Backus-Naur Form)

More information

Efficient pebbling for list traversal synopses

Efficient pebbling for list traversal synopses Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running

More information

CS 270 Algorithms. Oliver Kullmann. Binary search. Lists. Background: Pointers. Trees. Implementing rooted trees. Tutorial

CS 270 Algorithms. Oliver Kullmann. Binary search. Lists. Background: Pointers. Trees. Implementing rooted trees. Tutorial Week 7 General remarks Arrays, lists, pointers and 1 2 3 We conclude elementary data structures by discussing and implementing arrays, lists, and trees. Background information on pointers is provided (for

More information

2.2 Syntax Definition

2.2 Syntax Definition 42 CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

COMP Data Structures

COMP Data Structures COMP 2140 - Data Structures Shahin Kamali Topic 5 - Sorting University of Manitoba Based on notes by S. Durocher. COMP 2140 - Data Structures 1 / 55 Overview Review: Insertion Sort Merge Sort Quicksort

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

We assume uniform hashing (UH):

We assume uniform hashing (UH): We assume uniform hashing (UH): the probe sequence of each key is equally likely to be any of the! permutations of 0,1,, 1 UH generalizes the notion of SUH that produces not just a single number, but a

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

(2,4) Trees. 2/22/2006 (2,4) Trees 1

(2,4) Trees. 2/22/2006 (2,4) Trees 1 (2,4) Trees 9 2 5 7 10 14 2/22/2006 (2,4) Trees 1 Outline and Reading Multi-way search tree ( 10.4.1) Definition Search (2,4) tree ( 10.4.2) Definition Search Insertion Deletion Comparison of dictionary

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

3.7 Denotational Semantics

3.7 Denotational Semantics 3.7 Denotational Semantics Denotational semantics, also known as fixed-point semantics, associates to each programming language construct a well-defined and rigorously understood mathematical object. These

More information

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm

Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 161 CHAPTER 5 Hierarchical Intelligent Cuttings: A Dynamic Multi-dimensional Packet Classification Algorithm 1 Introduction We saw in the previous chapter that real-life classifiers exhibit structure and

More information

COMP 250 Fall recurrences 2 Oct. 13, 2017

COMP 250 Fall recurrences 2 Oct. 13, 2017 COMP 250 Fall 2017 15 - recurrences 2 Oct. 13, 2017 Here we examine the recurrences for mergesort and quicksort. Mergesort Recall the mergesort algorithm: we divide the list of things to be sorted into

More information

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract

More information

Lecture 6: Analysis of Algorithms (CS )

Lecture 6: Analysis of Algorithms (CS ) Lecture 6: Analysis of Algorithms (CS583-002) Amarda Shehu October 08, 2014 1 Outline of Today s Class 2 Traversals Querying Insertion and Deletion Sorting with BSTs 3 Red-black Trees Height of a Red-black

More information

arxiv: v2 [cs.ds] 30 Sep 2016

arxiv: v2 [cs.ds] 30 Sep 2016 Synergistic Sorting, MultiSelection and Deferred Data Structures on MultiSets Jérémy Barbay 1, Carlos Ochoa 1, and Srinivasa Rao Satti 2 1 Departamento de Ciencias de la Computación, Universidad de Chile,

More information

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures.

Trees. Q: Why study trees? A: Many advance ADTs are implemented using tree-based data structures. Trees Q: Why study trees? : Many advance DTs are implemented using tree-based data structures. Recursive Definition of (Rooted) Tree: Let T be a set with n 0 elements. (i) If n = 0, T is an empty tree,

More information

Propagate the Right Thing: How Preferences Can Speed-Up Constraint Solving

Propagate the Right Thing: How Preferences Can Speed-Up Constraint Solving Propagate the Right Thing: How Preferences Can Speed-Up Constraint Solving Christian Bessiere Anais Fabre* LIRMM-CNRS (UMR 5506) 161, rue Ada F-34392 Montpellier Cedex 5 (bessiere,fabre}@lirmm.fr Ulrich

More information

Multi Domain Logic and its Applications to SAT

Multi Domain Logic and its Applications to SAT Multi Domain Logic and its Applications to SAT Tudor Jebelean RISC Linz, Austria Tudor.Jebelean@risc.uni-linz.ac.at Gábor Kusper Eszterházy Károly College gkusper@aries.ektf.hu Abstract We describe a new

More information

CS301 - Data Structures Glossary By

CS301 - Data Structures Glossary By CS301 - Data Structures Glossary By Abstract Data Type : A set of data values and associated operations that are precisely specified independent of any particular implementation. Also known as ADT Algorithm

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology

9/29/2016. Chapter 4 Trees. Introduction. Terminology. Terminology. Terminology. Terminology Introduction Chapter 4 Trees for large input, even linear access time may be prohibitive we need data structures that exhibit average running times closer to O(log N) binary search tree 2 Terminology recursive

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Model Checking I Binary Decision Diagrams

Model Checking I Binary Decision Diagrams /42 Model Checking I Binary Decision Diagrams Edmund M. Clarke, Jr. School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 2/42 Binary Decision Diagrams Ordered binary decision diagrams

More information

Intro to DB CHAPTER 12 INDEXING & HASHING

Intro to DB CHAPTER 12 INDEXING & HASHING Intro to DB CHAPTER 12 INDEXING & HASHING Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West

More information

Figure 4.1: The evolution of a rooted tree.

Figure 4.1: The evolution of a rooted tree. 106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.

More information

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Analytical Modeling of Parallel Systems To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015

MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 CS161, Lecture 2 MergeSort, Recurrences, Asymptotic Analysis Scribe: Michael P. Kim Date: April 1, 2015 1 Introduction Today, we will introduce a fundamental algorithm design paradigm, Divide-And-Conquer,

More information

Handout 9: Imperative Programs and State

Handout 9: Imperative Programs and State 06-02552 Princ. of Progr. Languages (and Extended ) The University of Birmingham Spring Semester 2016-17 School of Computer Science c Uday Reddy2016-17 Handout 9: Imperative Programs and State Imperative

More information

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415 Faloutsos 1 introduction selection projection

More information

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging.

Operations on Heap Tree The major operations required to be performed on a heap tree are Insertion, Deletion, and Merging. Priority Queue, Heap and Heap Sort In this time, we will study Priority queue, heap and heap sort. Heap is a data structure, which permits one to insert elements into a set and also to find the largest

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

Fast algorithms for max independent set

Fast algorithms for max independent set Fast algorithms for max independent set N. Bourgeois 1 B. Escoffier 1 V. Th. Paschos 1 J.M.M. van Rooij 2 1 LAMSADE, CNRS and Université Paris-Dauphine, France {bourgeois,escoffier,paschos}@lamsade.dauphine.fr

More information

Advanced Algorithms. Class Notes for Thursday, September 18, 2014 Bernard Moret

Advanced Algorithms. Class Notes for Thursday, September 18, 2014 Bernard Moret Advanced Algorithms Class Notes for Thursday, September 18, 2014 Bernard Moret 1 Amortized Analysis (cont d) 1.1 Side note: regarding meldable heaps When we saw how to meld two leftist trees, we did not

More information

Slides for Faculty Oxford University Press All rights reserved.

Slides for Faculty Oxford University Press All rights reserved. Oxford University Press 2013 Slides for Faculty Assistance Preliminaries Author: Vivek Kulkarni vivek_kulkarni@yahoo.com Outline Following topics are covered in the slides: Basic concepts, namely, symbols,

More information

Horn Formulae. CS124 Course Notes 8 Spring 2018

Horn Formulae. CS124 Course Notes 8 Spring 2018 CS124 Course Notes 8 Spring 2018 In today s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it

More information

7.3 Spanning trees Spanning trees [ ] 61

7.3 Spanning trees Spanning trees [ ] 61 7.3. Spanning trees [161211-1348 ] 61 7.3 Spanning trees We know that trees are connected graphs with the minimal number of edges. Hence trees become very useful in applications where our goal is to connect

More information

Chapter 11: Indexing and Hashing" Chapter 11: Indexing and Hashing"

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing" Database System Concepts, 6 th Ed.! Silberschatz, Korth and Sudarshan See www.db-book.com for conditions on re-use " Chapter 11: Indexing and Hashing" Basic Concepts!

More information

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1.

Thus, it is reasonable to compare binary search trees and binary heaps as is shown in Table 1. 7.2 Binary Min-Heaps A heap is a tree-based structure, but it doesn t use the binary-search differentiation between the left and right sub-trees to create a linear ordering. Instead, a binary heap only

More information

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs

Computational Optimization ISE 407. Lecture 16. Dr. Ted Ralphs Computational Optimization ISE 407 Lecture 16 Dr. Ted Ralphs ISE 407 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms in

More information

FINAL EXAM SOLUTIONS

FINAL EXAM SOLUTIONS COMP/MATH 3804 Design and Analysis of Algorithms I Fall 2015 FINAL EXAM SOLUTIONS Question 1 (12%). Modify Euclid s algorithm as follows. function Newclid(a,b) if a

More information

18.3 Deleting a key from a B-tree

18.3 Deleting a key from a B-tree 18.3 Deleting a key from a B-tree B-TREE-DELETE deletes the key from the subtree rooted at We design it to guarantee that whenever it calls itself recursively on a node, the number of keys in is at least

More information

1 The range query problem

1 The range query problem CS268: Geometric Algorithms Handout #12 Design and Analysis Original Handout #12 Stanford University Thursday, 19 May 1994 Original Lecture #12: Thursday, May 19, 1994 Topics: Range Searching with Partition

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Trees. (Trees) Data Structures and Programming Spring / 28

Trees. (Trees) Data Structures and Programming Spring / 28 Trees (Trees) Data Structures and Programming Spring 2018 1 / 28 Trees A tree is a collection of nodes, which can be empty (recursive definition) If not empty, a tree consists of a distinguished node r

More information

1 (15 points) LexicoSort

1 (15 points) LexicoSort CS161 Homework 2 Due: 22 April 2016, 12 noon Submit on Gradescope Handed out: 15 April 2016 Instructions: Please answer the following questions to the best of your ability. If you are asked to show your

More information

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim:

We will show that the height of a RB tree on n vertices is approximately 2*log n. In class I presented a simple structural proof of this claim: We have seen that the insert operation on a RB takes an amount of time proportional to the number of the levels of the tree (since the additional operations required to do any rebalancing require constant

More information

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version

Lecture Notes: External Interval Tree. 1 External Interval Tree The Static Version Lecture Notes: External Interval Tree Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong taoyf@cse.cuhk.edu.hk This lecture discusses the stabbing problem. Let I be

More information

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g) Introduction to Algorithms March 11, 2009 Massachusetts Institute of Technology 6.006 Spring 2009 Professors Sivan Toledo and Alan Edelman Quiz 1 Solutions Problem 1. Quiz 1 Solutions Asymptotic orders

More information

Lecture Notes on Binary Decision Diagrams

Lecture Notes on Binary Decision Diagrams Lecture Notes on Binary Decision Diagrams 15-122: Principles of Imperative Computation William Lovas Notes by Frank Pfenning Lecture 25 April 21, 2011 1 Introduction In this lecture we revisit the important

More information

Lecture 7 February 26, 2010

Lecture 7 February 26, 2010 6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some

More information

CHAPTER 3 LITERATURE REVIEW

CHAPTER 3 LITERATURE REVIEW 20 CHAPTER 3 LITERATURE REVIEW This chapter presents query processing with XML documents, indexing techniques and current algorithms for generating labels. Here, each labeling algorithm and its limitations

More information

Search. The Nearest Neighbor Problem

Search. The Nearest Neighbor Problem 3 Nearest Neighbor Search Lab Objective: The nearest neighbor problem is an optimization problem that arises in applications such as computer vision, pattern recognition, internet marketing, and data compression.

More information

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs

Algorithms in Systems Engineering ISE 172. Lecture 16. Dr. Ted Ralphs Algorithms in Systems Engineering ISE 172 Lecture 16 Dr. Ted Ralphs ISE 172 Lecture 16 1 References for Today s Lecture Required reading Sections 6.5-6.7 References CLRS Chapter 22 R. Sedgewick, Algorithms

More information

Properties of red-black trees

Properties of red-black trees Red-Black Trees Introduction We have seen that a binary search tree is a useful tool. I.e., if its height is h, then we can implement any basic operation on it in O(h) units of time. The problem: given

More information

Hash-Based Indexing 165

Hash-Based Indexing 165 Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19

More information

Final Exam in Algorithms and Data Structures 1 (1DL210)

Final Exam in Algorithms and Data Structures 1 (1DL210) Final Exam in Algorithms and Data Structures 1 (1DL210) Department of Information Technology Uppsala University February 0th, 2012 Lecturers: Parosh Aziz Abdulla, Jonathan Cederberg and Jari Stenman Location:

More information

Efficient subset and superset queries

Efficient subset and superset queries Efficient subset and superset queries Iztok SAVNIK Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaška 8, 5000 Koper, Slovenia Abstract. The paper

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS583-002) Amarda Shehu Fall 2017 1 Binary Search Trees Traversals, Querying, Insertion, and Deletion Sorting with BSTs 2 Example: Red-black Trees Height of a Red-black

More information

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree

Introduction. for large input, even access time may be prohibitive we need data structures that exhibit times closer to O(log N) binary search tree Chapter 4 Trees 2 Introduction for large input, even access time may be prohibitive we need data structures that exhibit running times closer to O(log N) binary search tree 3 Terminology recursive definition

More information

Binary Trees

Binary Trees Binary Trees 4-7-2005 Opening Discussion What did we talk about last class? Do you have any code to show? Do you have any questions about the assignment? What is a Tree? You are all familiar with what

More information

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge.

Binary Trees. BSTs. For example: Jargon: Data Structures & Algorithms. root node. level: internal node. edge. Binary Trees 1 A binary tree is either empty, or it consists of a node called the root together with two binary trees called the left subtree and the right subtree of the root, which are disjoint from

More information

Binary Search Trees, etc.

Binary Search Trees, etc. Chapter 12 Binary Search Trees, etc. Binary Search trees are data structures that support a variety of dynamic set operations, e.g., Search, Minimum, Maximum, Predecessors, Successors, Insert, and Delete.

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

A counter-example to the minimal coverability tree algorithm

A counter-example to the minimal coverability tree algorithm A counter-example to the minimal coverability tree algorithm A. Finkel, G. Geeraerts, J.-F. Raskin and L. Van Begin Abstract In [1], an algorithm to compute a minimal coverability tree for Petri nets has

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 5.1 Introduction You should all know a few ways of sorting in O(n log n)

More information

Trees. Chapter 6. strings. 3 Both position and Enumerator are similar in concept to C++ iterators, although the details are quite different.

Trees. Chapter 6. strings. 3 Both position and Enumerator are similar in concept to C++ iterators, although the details are quite different. Chapter 6 Trees In a hash table, the items are not stored in any particular order in the table. This is fine for implementing Sets and Maps, since for those abstract data types, the only thing that matters

More information

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science

CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions

More information

Precomputation Schemes for QoS Routing

Precomputation Schemes for QoS Routing 578 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 11, NO. 4, AUGUST 2003 Precomputation Schemes for QoS Routing Ariel Orda, Senior Member, IEEE, and Alexander Sprintson, Student Member, IEEE Abstract Precomputation-based

More information