Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks

Size: px
Start display at page:

Download "Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks"

Transcription

1 Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks Erietta Liarou 1, Stratos Idreos 2, and Manolis Koubarakis 3 1 Technical University of Crete, Chania, Greece 2 CWI, Amsterdam, The Netherlands 3 National and Kapodistrian University of Athens, Athens, Greece Abstract. We study the problem of evaluating conjunctive queries composed of triple patterns over RDF data stored in distributed hash tables. Our goal is to develop algorithms that scale to large amounts of RDF data, distribute the query processing load evenly and incur little network traffic. We present and evaluate two novel query processing algorithms with these possibly conflicting goals in mind. We discuss the various tradeoffs that occur in our setting through a detailed experimental evaluation of the proposed algorithms. 1 Introduction Research at the frontiers of P2P networks and Semantic Web has recently received a lot of interest [23]. One of the most interesting open problems in this area is how to evaluate queries expressed in Semantic Web query languages (e.g., RDQL [22], RQL [17], SPARQL [21] or OWL-QL [10]) on top of P2P networks [9, 4, 19, 20, 25, 18]. In this paper we study the problem of evaluating conjunctive queries composed of triple patterns on top of RDF data stored in distributed hash tables. Distributed hash tables (DHTs) are an important class of P2P networks that offer distributed hash table functionality, and allow one to develop scalable, robust and fault-tolerant distributed applications [2]. DHTs have recently been used for the distributed storage and retrieval of various kinds of data e.g., relational [12, 14], textual [27], RDF [9] etc. Conjunctions of triple patterns are core constructs of some RDF query languages (e.g., RDQL [22] and SPARQL [21]) and used implicitly in all others (e.g., in the generalized path expressions of RQL [17]). The contributions of this paper are the following. We present two novel algorithms for the evaluation of conjunctive RDF queries composed of triple patterns on top of the distributed hash table Chord [24]. This has been an open problem since the proposal of RDFPeers [9] where only atomic triple patterns and conjunctions of triple patterns with the same variable or constant subject and possibly different constant predicates have been studied. Extending these query classes considered by RDFPeers to full conjunctive queries is an important issue if we want to deal effectively with the full functionality of existing RDF query This work was supported in part by the European Commission project Ontogrid (

2 languages [22, 17, 21]. But notice that the resulting query class is more challenging than the ones considered in RDFPeers. In the terminology of relational databases: we now have to deal with arbitrary selections, projections and joins on a virtual ternary relation consisting of all triples. The focus of our work is on the experimental evaluation of the proposed algorithms. We concentrate on three parameters that are critical in a distributed setting: amount of data stored in the network, load distribution and generated network traffic. Our algorithms are designed so that they involve in the query evaluation as many network nodes as possible, store as little date in the network as possible, and minimize the amount of network traffic they create. Trying to achieve all of these goals involves a tradeoff, and we demonstrate how we can sacrifice good load distribution to keep data storage and network traffic low and vice versa. The rest of the paper is organized as follows. Section 2 presents a synopsis of the underlying assumptions regarding network architecture, data model and query language. Sections 3 and 4 present the alternative data indexing and query processing algorithms. Then, in Section 5, we present an optimization to further reduce the network traffic generated by the algorithms. In Section 6, we show a detailed experimental evaluation and comparison of our algorithms under various parameters that affect performance. Finally, Section 7 discusses related work, and Section 8 presents conclusions and future work directions. 2 System model and data model System model. We assume an overlay network where all nodes are equal, they run the same software and have the same rights and responsibilities. Each node n has a unique key (e.g., its public key), denoted by key(n). Nodes are organized according to the Chord protocol [24] and are assumed to have synchronized clocks. This property is necessary for the time semantics we describe later on in this section. In practice, nodes will run a protocol such as NTP and achieve accuracies within few milliseconds [6]. Each data item i has a unique key, denoted by key(i). Chord uses consistent hashing to map keys to identifiers. Each node and item is assigned an m-bit identifier, that should be large enough to avoid collisions. A cryptographic hash function, such as SHA-1 or MD5 is used: function Hash(k) returns the m-bit identifier of key k. The identifier of a node n is denoted as id(n) and is computed by id(n) = Hash(key(n)). Similarly, the identifier of an item i is denoted by id(i) and is computed by id(i) = Hash(key(i)). Identifiers are ordered in an identifier circle (ring) modulo 2 m, i.e., from 0 to 2 m 1. Key k is assigned to the first node which is equal or follows Hash(k) clockwise in the identifier space. This node is called the successor node of identifier Hash(k) and is denoted by Successor(Hash(k)). We will often say that this node is responsible for key k. A query for locating the node responsible for a key k can be done in O(log N) steps with high probability [24], where N is the number of nodes in the network. Chord is described in more detail in [24]. The algorithms we describe in this paper use the API defined in [26, 14, 13]. This API provides two functionalities not given by the standard DHT protocols:

3 (i) send a message to multiple nodes (multicast) and (ii) send d messages to d nodes where each node receives exactly one of these messages (this can be thought of as a variation of the multicast operation). Let us now briefly describe this API. Function send(msg, id), where msg is a message and id is an identifier, delivers msg from any node to node Successor(id) in O(log N) hops. Function multisend(msg, I), where I is a set of d > 1 identifiers I 1,..., I d, delivers msg to nodes n 1, n 2,..., n d such that n j = Successor(I j ), where 1 < j d. This happens in O(d log N) hops. Function multisend() can also be used as multisend(m, I), where M is a set of d messages and I is a set of d identifiers. In this case, for each I j, message M j is delivered to Successor(I j ) in O(d log N) hops in total. A detailed description and evaluation of alternative ways to implement this API can be found in [13]. Data model. In the application scenarios we target, each network node is able to describe in RDF the resources that it wants to make available to the rest of the network, by creating and inserting metadata in the form of RDF triples. In addition, each node can submit queries that describe information that this node wants to receive all possible answers that are available at this time. We use a very simple concept of schema equivalent to the notion of a namespace. Thus, we do not deal with RDFS and the associated reasoning about classes and instances. Different schemas can co-exist but we do not support schema mappings. Each node uses some of the available schemas for its descriptions and queries. We will use the standard RDF concept of a triple 4. Let D be a countably infinite set of URIs and RDF literals. A triple is used to represent a statement about the application domain and is a formula of the form (subject, predicate, object). The subject of a triple identifies the resource that the statement is about, the predicate identifies a property or a characteristic of the subject, while the object identifies the value of the property. The subject and predicate parts of a triple are URIs from D, while the object is a URI or a literal from D. For a triple t, we will use subj(t), pred(t) and obj(t) to denote the string value of the subject, the predicate and the object of t respectively. As in RDQL [22], a triple pattern is an expression of the form (s, p, o) where s and p are URIs or variables, and o is a URI, a literal or a variable. A conjunctive query q is a formula?x 1,...,?x n : (s 1, p 1, o 1 ) (s 2, p 2, o 2 ) (s n, p n, o n ) where?x 1,...,?x n are variables, each (s i, p i, o i ) is a triple pattern, and each variable?x i appears in at least one triple pattern (s i, p i, o i ). Variables will always start with the? character. Variables?x 1,...,?x n will be called answer variables when we want to distinguish them from other variables of the query. A query will be called atomic if it consists of a single conjunct. Let us now define the concept of valuation (so we can talk about values that satisfy a query). Let V be a finite set of variables. A valuation v over V is a total function v from V to the set D. In the natural way, we extend a valuation v to be identity on D and to map triple patterns (s i, p i, o i ) to triples, and conjunctions of triple patterns to conjunctions of triple patterns. 4

4 We will find it useful to use various concepts from relational database theory in the presentation of our work. In particular, the operations of the relational algebra utilized in algorithm QC below follow the unnamed perspective of the relational model (i.e., tuples are elements of Cartesian products and co-ordinate numbers are used instead of attribute names) [5]. An RDF database is a set of triples. Let DB be an RDF database and q a conjunctive query q 1 q n where each q i is a triple pattern. The answer to q over database DB consists of all n-tuples (v(?x 1 ),..., v(?x n )) where v is a valuation over the set of variables of q and v(q i ) DB for each i = 1,..., n. In the algorithms we will describe below, each query q has a unique key, denoted by key(q), that is created by concatenating an increasing number to the key of the node that posed q. 3 The QC algorithm Let us now describe our first query processing algorithm, the query chain algorithm (QC). The main characteristic of QC is that the query is evaluated by a chain of nodes. Intermediate results flow through the nodes of this chain and finally the last node in the chain delivers the result back to the node that submitted the query. We will first describe how triples are stored in the network and then how an incoming query is evaluated by QC. Indexing a new triple. Assume a node x that wants to make a resource available to the rest of the network. Node x creates an RDF description d that characterizes this resource and publishes it. Since, we are not interested in a centralized solution, we do not store the whole description d to a single node. Instead, we choose to split d into triples and disperse it in the network, trying to distribute responsibility of storing descriptions and answering future conjunctive queries to several nodes. Each triple is handled separately and is indexed to three nodes. Let us explain the exact details for a triple t = (s, p, o). Node x computes the index identifiers of t as follows: I 1 = Hash(s), I 2 = Hash(p) and I 3 = Hash(o). These identifiers are used to locate the nodes r 1, r 2 and r 3, that will store t. In Chord terminology, these nodes are the successors of the relevant identifiers, e.g., r 1 = Successor(I 1 ). Then, x uses the multisend() function to index t to these 3 nodes. Each node that receives a triple t stores it in its local triple table T T. In the discussion below, T T will be formally treated as a ternary relation (in the sense of the relational model). Evaluating a query. Assume a node x that poses a conjunctive query q which consists of triple patterns q 1,..., q k. Each triple pattern of q will be evaluated by a (possibly) different node; these nodes form the query chain for q. The order we use to evaluate the different triple patterns is crucial and we discuss the issues involved later on. Now, for simplicity, we assume that we first evaluate the first triple pattern, then the second and so on. Query evaluation proceeds as follows. Node x determines the node that will evaluate triple pattern q 1 by using one of the constants in q 1. For example, if q 1 = (?s 1, p 1,?o 1 ) then x computes identifier I 1 = Hash(pred(q j )) since the predicate part is the only constant part of q j. This identifier is used to locate the

5 node r 1 (the successor of I 1 ) that may have triples that satisfy q 1, since according to the way we index triples, all triples that have pred(q j ) as their predicate will be stored in r j. Thus, n sends the message QEval(q, i, R, IP (x)) to node r 1 where q is the query, i is the index of the triple pattern to be evaluated by node r 1, IP (x) is the IP address of node x that posed the query, and R is the relation that will be used to accumulate triples that are intermediate results towards the computation of the answer to q. In this call, R receives its initial value (formally, the trivial relation {()} i.e., the relation that consists of an empty tuple over an empty set of attributes). In case that q 1 has multiple constants, x will heuristically prefer to use first the subject, then the object and finally the predicate to determine the node that will evaluate q 1. Intuitively, there will be more distinct subject or object values than distinct predicates values in an instance of a given schema. Thus, our decision help us to achieve a better distribution of the query processing load. Local processing at each chain node. Assume now that a node n receives a message QEval(q, i, R, IP (x)). First, n evaluates the i-th triple pattern of q using its local triple table i.e., it computes the relation L = π X (σ F (T T )) where F is a selection condition and X is a (possibly empty) list of natural numbers between 1 and 3. F and X are formed in the natural way by taking into account the constants and variables of q i e.g., if q i is (?s i, p i, o i ) then L = π 1 (σ 2=pi 3=o i (T T )). Then, n computes a new relation with intermediate results R = π Y (R L) where Y is the (possibly empty) list of positive integers identifying columns of R and L that correspond to answer variables or variables with values that are needed in the rest of the query evaluation (i.e., variables appearing in a triple pattern q j of q such that j > i). Note that the special case of i = 1 (when R = π Y (L)) is covered by the above formula for R, given the initial value {()} of R. If R is not the empty relation then n creates a message QEval(q, i + 1, R, IP (x)) and sends it to the node that will evaluate triple pattern q i+1. If R is the empty relation then the computation stops and an empty answer is returned to node x. In the case that i = k, the last triple pattern of q is evaluated. Then, n simply returns relation R back to x using a message Answer(q, R ). Now R is indeed a relation with arity equal to the number of answer variables and contains the answer to query q over the database of triples in the network. In the current implementation, R = π Y (R π X (σ F (T T ))) is computed as follows. For each tuple t of R, we first rewrite q i by substituting variables of q i by their corresponding values in R. Then, we use q i to probe T T for matching triples. For each matching triple, the appropriate tuple of R is computed on the fly. Access to T T can be made vary fast (essentially constant time) using hashing. In relational terminology, this is a nested loops join using a hash index for the inner relation T T. This is a good implementation strategy given that we expect a good evaluation order for the triple patterns of q to minimize the number of tuples in intermediate relation R (see relevant discussion at the end of this section). Example. QC is shown in operation in Figure 1. Each event in this figure represents an event in the network, i.e., either the arrival of a new triple or the arrival of a new triple pattern. Events are drawn from left to right which

6 Fig. 1. The algorithm QC in operation represents the chronological order in which these events have happened. In each event, the figure shows the steps of the algorithm that take place due to this event. For readability, in each event we draw only the nodes that do something due to this event, i.e., store or search triples, evaluate a query etc. Finally, note that we use S for the function Successor(), H for the function Hash() and we use comma to denote a conjunction between two triple patterns. In Event 1, node n inserts three triples t 1, t 2 and t 3 in the network. In Event 2, node n submits a conjunctive query q that consists of three triple patterns. The figure shows how the query travels from node n to r 2, then to r 4 and finally to r 7, where the answer is computed and returned to n. Order of nodes in a query chain. The order in which the different triple patterns of a query are evaluated is crucial, and affects network traffic, query processing load or any other resource that we try to optimize. For example, if we want to minimize message size for QC, we would like to put early in the query chain nodes that are responsible for triple patterns with low selectivity. Selectivity information can be made available to each node if statistics regarding the contents of TTs are available. Then, when a node n determines the next triple pattern q i+1 to be evaluated, n has enough statistical information to determine a good node to continue the query evaluation. The details of how to make our algorithms adaptive in the above sense are the subject of future work. 4 The SBV algorithm Let us now present our second algorithm, the algorithm spread by value (SBV). SBV extends the ideas of QC to achieve a better distribution of the query processing load. It does not create a single chain for a query as QC does, but by exploiting the values of matching triples found while processing the query incrementally, it distributes the responsibility of evaluating a query to more nodes than QC. In other words, it is essentially constructing multiple chains for each query. A quick understanding of the difference between QC and SBV can be obtained from Figure 2. There, we draw for each algorithm, all the nodes that participate in query processing for a query q that consists of 3 triple patterns.

7 Fig. 2. Comparing the query chains in QC and SBV QC creates a single chain that consists of only 3 nodes and query evaluation is carried out by these nodes only. On the contrary, SBV creates multiple chains which can collectively be seen as a tree. Now the query processing load for q is spread among the nodes of this tree. Each path in this tree is determined by the values used by triples that match the respective triple patterns at the different nodes (thus the name of the algorithm). Indexing a new triple. Assume a new triple t = (s, p, o). In SBV t will be stored at the successor nodes of the identifiers Hash(s), Hash(p), Hash(o), Hash(s + p), Hash(s + o), Hash(p + o) and Hash(s + p + o). We will exploit these replicas of triple t to achieve a better query load distribution. Evaluating a query. As in QC, the node that poses a new query q of the form q 1 q k sends q to a node r 1 that is able to evaluate the first triple pattern q 1. From this point on, the query plan produced by SBV is created dynamically by exploiting the values of the matching triples that nodes find at each step in order to achieve a better distribution of the query processing load. For example, r 1 will use the values for variables of q 1, that it will find in local triples matching q 1, to bind the variables of q 2 q k that are common with q 1 and produce a new set of queries that will jointly determine the answer to the original query q. Since we expect to have multiple matching values for the variables of q 1, we also expect to have multiple next nodes where the new queries will continue their evaluation. Thus, multiple chains of nodes take responsibility for the evaluation of q. The nodes at the leafs of these chains will deliver answers back to the node that submitted q. Our previous discussion on the order of nodes/triple patterns in a query chain is also valid for SBV. For simplicity, in the formal description of SBV below, we assume again that the evaluation order is determined by the order that the triple patterns appear in the query. To determine which node will evaluate a triple pattern in SBV, we use the constant parts of the triple pattern as in QC. The difference is that if there are multiple constants in a triple pattern, we use the combination of all constant parts. For example, if q j = (?s j, p j, o j ), then I j = Hash(pred(q j )+obj(q j )) where the operator + denotes concatenation of string values. We use the concatenation of constant parts whenever possible, since the number of possible identifiers that can be created by a combination of constant parts is definitely higher and will allow us to achieve a better distribution of the query processing load.

8 Fig. 3. The algorithm SBV in operation Assume a node x that wants to submit a query q with set of answer variables V. x creates a message Eval(q, V, u, IP (x)), where u is the empty valuation. x computes the identifier of the node that will evaluate the first triple pattern and sends the message to it with the send() function in O(log N) hops. When a node r receives a message Eval(q, V, u, IP (x)) where q is a query q 1 q n and n > 1, r searches its local T T for stored triples that satisfy triple pattern q 1. Assume m matching triples are found. For each satisfying triple t i, there is a valuation v i such that t i = v i (q 1 ). For each v i, r computes a new valuation v i = u v i and a new query q i v i(q 2 q n ). Then r decides the node that will continue the algorithm with the evaluation of q i (as we described in the previous paragraph), and creates a new message msg i = Eval(q i, V, v i, IP (x)) for that node. As a result, we have a set of at most m messages and r uses the multisend() function to deliver them in O(m log N) hops. Each node that receives one of these messages reacts as described in this paragraph. In the case that a node r receives a message Eval(q, V, u, IP (x)) where q consists of a single triple pattern q 1 (i.e., r is the last node in this query chain), then the evaluation of q finishes at r. Thus, r simply computes all triples t in T T and valuations v such that t = v(q n ) and sends the set of all such valuations v back to node x that posed the original query in one hop (after projecting them on the answer variables of the initial query). These valuations are part of the answer to the query. This case covers the situation where n = 1 as well (i.e., q consists of a single conjunct). Figure 3 shows an example of SBV in operation. 5 Optimizing network traffic In this section we introduce a new routing table, called IP cache (IPC) [14] that can be used by our algorithms to significantly reduce network traffic. In both our algorithms, the evaluation of a query goes through a number of nodes. The observation is that similar queries will follow a route with some nodes in common and we can exploit this information to decrease network traffic. Assume a node x j that participates in the evaluation of a query q and needs to send a message to a next node x j+1 that costs O(log N) overlay hops. After the first time that

9 Fig. 4. The schema used in our experiments node x j has sent a message to node x j+1, x j can keep track of the IP address of x j+1 and use it in the future when the same query or a similar one obliges it to communicate with the same node. Then, x j can send a message to x j+1 in just 1 hop instead of O(log N). The cost for the maintenance of the IPC is only local. As we will show in the experiments section, the use of IPCs significantly improves network traffic. Another effect of IPC, is that we reduce the routing load incurred by nodes in the network. The routing load of a node n is defined as the number of messages that n receives so as to forward them closer towards their destination, i.e., these are messages not sent to n but through n. Without using the IPC, each message that forwards intermediate results will pass through O(log N) nodes while with IPCs, it will go directly to the receiver node. 6 Experiments In this section, we experimentally evaluate the algorithms presented in this paper. We implemented a simulator of Chord in Java on top of which we developed our algorithms. Our metrics are: (a) the amount of network traffic that is created and (b) how well the query processing load and storage load are distributed among the network nodes. Each metric will be carefully described in the relevant experiment. We create a uniform workload of queries and data triples. We synthetically create RDF triples and queries assuming an RDFS schema of the form shown in Figure 4, i.e., a balanced tree with depth d and branching factor k. We assume that each class has a set of k properties. Each property of a class C which is at level l < d 1 ranges over another class which belongs to level l + 1. Each class of level d 1 has also k properties which have values that range over XSD datatypes. These data types are located at the last level d. To create an RDF triple t, we first randomly choose a depth of the tree of our schema. Then, we randomly choose a class C i among the classes of this depth. After that, we randomly choose an instance of C i to be subj(t), a property p of C i to be pred(t) and a value from the range of p to be obj(t). If the range of the selected property p are instances of a class C j that belongs to the next level, then obj(t) is a resource, otherwise it is a literal. For our experiments, we use conjunctive path queries of the following form:?x : (?x, p 1,?o 1 ) (?o 1, p 2,?o 2 ) (?o n 1, p n, o n )

10 In other words, we want to know the nodes in the graph?x for which there is a path of length n to node o 1 labeled by predicates p 1,..., p n. Path queries are an important type of conjunctive queries for which database and query workloads over the schema of Figure 4 can be created easily. To create a query of this type, we randomly choose a property p 1 of class C 0. Property p 1 leads us to a class C 1 from the next level. Then we randomly choose a property p 2 of class C 1. This procedure is repeated until we create n triple patterns. For the last triple pattern, we also randomly choose a value (literal) from the range of p n to be o n. Our experiments use the following parameters. The depth of our schema is d = 4. The number of instances of each class is 500, the number of properties that each one has is k = 3 while the a literal can take up to 200 different values. Finally, the number of triple patterns in each query we create is 5. E1: Network traffic and IPC effect. This experiment provides a comparison of our algorithms in terms of the network traffic that they create. To estimate better the network traffic, we use weighted hops, i.e., each hope has as weight the amount of intermediate results that it carries. Furthermore, we investigate the effect of the IPC in each algorithm and the cost of this optimization. We set up this experiment as follows. We create a network of 10 4 nodes and install 10 4 triples. Then, in order to count how expensive it is to insert and evaluate a query, in terms of network traffic, we pose a set Q of 100 queries and calculate the average cost of answering them. In order to understand the effect of IPCs the experiment continues as follows. We train IPCs with a varying number of queries, starting from 5 queries up to 640. After each training phase, we insert the same set of queries Q and count (a) the average amount of network traffic that is created and (b) the average size of IPCs in the network. Each training phase, as we call it, has two effects: query insertions cause the algorithms to work so query chains are created and the rewritten queries are transferred through these chains, but also because of these forwarding actions IPCs are filled with information that can reduce the cost of a subsequent forwarding operation. After each training phase, we measure the cost of inserting a query in the network after all the queries inserted so far, by exploiting the content of IPCs. In Figure 5(a) we show the network traffic that each algorithm creates. The point 0 on the x-axis has the maximum cost, since it represents the cost to insert the first query in the network. In this case all IPCs are empty and their use has no effect. Thus, this point reflects the cost of the algorithms if we do not use IPCs. However, in the next phases where IPCs have information that we can exploit, we see that the network traffic required to answer a query is decreased. For example, observe that after the last phase the cost of QC is 87% lower than it was at point 0. Another important observation is that QC causes less network traffic than SBV. In QC the nodes that participate in query chains are successors of a single value (of the predicate value for the queries we use in these experiments), so it is more possible that a query can use the IPC. SBV always creates more network traffic since the nodes that participate in query chains are successors of the combination of two values (a subject plus a predicate value). Since the combinations of these values are more then just a single one, it is less possible to use the IPC. QC is also cheaper at the point 0 on the x-axis since SBV has to sent the information though multiple chains.

11 (a) Traffic cost (b) IPC effect Fig. 5. (E1) Traffic cost and IPC effect as more queries are submitted In Figure 5(b) we show the average storage cost of the IPCs. Note, that here for readability we use a logarithmic scale for the y-axis. During the training phases, nodes fill their IPCs so we see that the size of IPC increases, as the number of submitted queries increases. Since even a small IPC size can significantly reduce network traffic, we can allow each node to fill its IPC as long as it can handle its size. The IPC cost in SBV is much more greater than in QC which happens again because SBV creates multiple chains for each query. E2: Load distribution. In this experiment we compare the algorithms in terms of load distribution. We distinguish between two types of load: query processing load and storage load. The query processing load that a node n incurs is defined as the number of triple patterns that arrive to n and are compared against its locally stored triples. Note that for algorithm QC the comparison of a triple pattern with the triples stored in T T happens for each tuple of relation R when R is computed. Thus, the query processing load of a node n in QC is equal to the number of tuples in R whenever a message QEval() is received. The storage load of a node n is defined as the sum of triples that n stores locally. For this experiment, we create a network of 10 4 nodes where we insert triples. Then we insert 10 3 queries and after that, we count the query processing and the storage load of each node in the network. In Figure 6(a) we show the query processing load for both algorithms. On the x-axis of this graph, nodes are ranked starting from the node with the highest load. The y-axis represents the cumulative load, i.e, each point (a, b) in the graph represents the sum of load b for the a most loaded nodes. First, we observe that both algorithms create the same total query processing load in the network. SBV achieves to distribute the query processing load to a significantly higher portion of network nodes, i.e, in QC there are 306 nodes (out of 10 4 ) participating in query processing, while in SBV there are 9666 nodes. SBV achieves this nice distribution since it exploits the values used to create rewritten queries by forwarding the produced intermediate results to nodes that are the successors of a combination of two or three constant parts.

12 (a) Cumulative query processing load (b) Cumulative storage load Fig. 6. (E2) Query processing and storage load distribution Finally, in Figure 6(b) we present the storage load distribution for both algorithms. As before, nodes are ranked starting from the node with the highest load while the y-axis represents the cumulative storage load. We observe that in QC the total storage load is less than in SBV. This happens because in QC we store each triple according to the values of its subject, its predicate and its object, while in SBV we also use the combinations per two and three of these values. Thus, in SBV a triple is indexed/stored four more times than in QC. The highest total storage load in the network is a price we have to pay for the better distribution of the query processing load in SBV. Notice that our load balancing techniques are at the application level. Thus, they can be used together with DHT-level load balancing techniques, e.g., [16]. 7 Related work The recent book [23] is an up-to-date collection of papers on work at the frontiers of P2P networks and Semantic Web. In the rest of this section, we only survey works that are closely related to our own. In [9], Min Cai et al. studied the problem of evaluating RDF queries in a scalable distributed RDF repository, named RDFPeers. RDFPeers is implemented on top of MAAN [8], which extends the Chord protocols [24] to efficiently answer multi-attribute and range queries. [9] was the first work to consider RDF queries on top of a DHT. The authors of [9] propose algorithms for evaluating triple pattern queries, range queries and conjunctive multi-predicate queries for the one-time query processing scenario. Furthermore, a simple replication algorithm is used to improve load distribution. Finally, [9] sketches some ideas regarding publish/subscribe scenarios in RDFPeers. In previous work [18], we have presented algorithms that go beyond the preliminary ideas of [9] regarding publish/subscribe for conjunctive multi-predicate queries. The ideas in [9] have influenced the design of QC. However, we deal with the full class of conjunctive queries which is an extension of the class of conjunctive

13 multi-predicate queries considered in [9]. In addition, we have presented the more advanced algorithm SBV which achieves efficient load distribution in a novel way. The other interesting work in the area of RDF query processing on top of DHTs is GridVine [4]. GridVine is built on top of P-Grid [1] and can deal with the same kind of queries as RDFPeers. In addition, it has an original approach to global semantic interoperability by utilizing gossiping techniques [3]. Another distributed RDF repository that provides a general RDF-based metadata infrastructure for P2P applications is the Edutella system [19, 20]. Edutella has two differences with our proposal: it is based on super-peers (while our proposal assumes that all nodes are equal) and concentrates on data integration issues (while we do not study this topic). Edutella uses HyperCup in its super-peer layer to achieve efficient routing of messages, but it does not consider issues such as the distribution of triples in the network to achieve scalability, load balancing etc. as in our approach. The paper [25] is another interesting work on distributed RDF query processing focusing on the optimization of path queries over multiple sources. Our research is also closely related with work on P2P databases based on the relational model [7, 11, 12, 14]. Currently, one can distinguish two orthogonal research directions in this area: work that emphasizes semantic interoperability of peer databases [7, 11] and work that attempts to push the capabilities of current database query processors to new large-scale Internet-wide applications by utilizing DHTs [12]. Our work can be categorized in the latter direction since it studies the processing of a subclass of conjunctive relational queries on top of DHTs. The only existing study of conjunctive relational queries on top of DHTs is [12] where join queries are studied. The ideas in this paper complement the ones in [12] and could also be used profitably in the relational case. This is an avenue that we plan to explore in future work together with extensions of our current results on continuous relational queries [14]. 8 Conclusions and future work In this paper we presented two novel algorithms for the distributed evaluation of conjunctive RDF queries composed of triple patterns. The algorithms manage to distribute the query processing load to a large part of the network while trying to minimize network traffic and keep storage cost low. The key idea is to decompose each conjunctive query to the triple patterns that it consists of, and then handle each triple pattern separately at a different node. The first algorithm establishes a chain of nodes that carry out the query evaluation. The second algorithm dynamically exploits matching triples to determine the next node in the query plan and creates multiple node chains that carry out the query evaluation. As a result, it achieves a better distribution of the query processing load at the expense of extra network traffic and storage load in the network. Our future work concentrates on extending our algorithms so that they can be adaptive to changes in the environment (e.g., changes in the data distribution), be able to handle skewed workloads efficiently, take into account network proximity etc. We also plan to extend our algorithms to deal with RDFS reasoning. Eventually, we want to support the complete functionality of languages such

14 RDQL [22], RQL [17] and SPARQL [21]. The algorithms will be incorporated in our system Atlas [15] which is developed in the context of the Semantic Grid project OntoGrid 5 (Atlas currently implements QC). References [1] K. Aberer. P-Grid: A Self-Organizing Access Structure for P2P Information Systems. In CoopIS 01. [2] K. Aberer, L. O. Alima, A. Ghodsi, S. Girdzijauskas, M. Hauswirth, and S. Haridi. The essence of P2P: A reference architecture for overlay networks. In IEEE P2P [3] K. Aberer, P. Cudré-Mauroux, and M. Hauswirth. Start making sense: The chatty web approach for global semantic agreements. Journal of Web Semantics, 1(1), December [4] K. Aberer, P. Cudre-Mauroux, M. Hauswirth, and T. V. Pelt. GridVine: Building Internet-Scale Semantic Overlay Networks. In WWW 04. [5] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley, [6] M. Bawa, A. Gionis, H. Garcia-Molina, and R. Motwani. The Price of Validity in Dynamic Networks. In SIGMOD 04. [7] P. A. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu. Data Management for Peer-to-Peer Computing: A Vision. In WebDB 02. [8] M. Cai, M. Frank, and J. C. P. Szekely. MAAN: A Multi-Attribute Addressable Network for Grid Information Services. In Grid 03. [9] M. Cai, M. R. Frank, B. Yan, and R. M. MacGregor. A Subscribable Peer-to-Peer RDF Repository for Distributed Metadata Management. Journal of Web Semantics, 2(2): , December [10] R. Fikes, P. Hayes, and I. Horrocks. OWL-QL: A Language for Deductive Query Answering on the Semantic Web. Journal of Web Semantics, 2(1):19 29, December [11] S. Gribble, A. Halevy, Z. Ives, M. Rodrig, and D. Suciu. What Can Peer-to-Peer Do for Databases, and Vice Versa? In WebDB 01. [12] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and I. Stoica. Querying the Internet with PIER. In VLDB 03. [13] S. Idreos. Distributed evaluation of continuous equi-join queries over large structured overlay networks. Master s thesis, [14] S. Idreos, C. Tryfonopoulos, and M. Koubarakis. Distributed Evaluation of Continuous Equijoin Queries over Large Structured Overlay Networks. In ICDE 06. [15] Z. Kaoudi, I. Miliaraki, M. Magiridou, A. Papadakis-Pesaresi, and M. Koubarakis. Storing and querying RDF data in Atlas. In Demo Papers ESWC 06. [16] D. Karger and M. Ruhl. Simple Efficient Load Balancing Algorithms for Peer to Peer Systems. In SPAA 04. [17] G. Karvounarakis, S. Alexaki, V. Christophides, D. Plexousakis, and M. Scholl. RQL: A Declarative Query Language for RDF. In WWW 02. [18] E. Liarou, S. Idreos, and M. Koubarakis. Publish-Subscribe with RDF Data over Large Structured Overlay Networks. In DBISP2P 05. [19] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer, and T. Risch. EDUTELLA: A P2P Networking Infrastructure Based on RDF. In WWW 02. [20] W. Nejdl, M. Wolpers, W. Siberski, C. Schmitz, M. Schlosser, I. Brunkhorst, and A. Loser. Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks. In WWW 03. [21] E. Prud hommeaux and A. Seaborn. SPARQL Query Language for RDF [22] A. Seaborne. Rdql - a query language for RDF. W3C Member Submission, [23] S. Staab and H. Stuckenschmidt. Semantic Web and Peer-to-Peer. Springer, [24] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM 01. [25] H. Stuckenschmidt, R. Vdovjak, J. Broekstra, and G.-J. Houben. Towards Distributed Processing of RDF Path Queries. International Journal of Web Engineering and Technology, 2(2/3): , [26] C. Tryfonopoulos, S. Idreos, and M. Koubarakis. LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs. In ECDL 05. [27] C. Tryfonopoulos, S. Idreos, and M. Koubarakis. Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks. In SIGIR

Publish/Subscribe with RDF Data over Large Structured Overlay Networks

Publish/Subscribe with RDF Data over Large Structured Overlay Networks Publish/Subscribe with RDF Data over Large Structured Overlay Networks Erietta Liarou, Stratos Idreos, and Manolis Koubarakis Department of Electronic and Computer Engineering Technical University of Crete,

More information

Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks

Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks by Stratos Idreos A thesis submitted in fulfillment of the requirements for the Master of Computer Engineering

More information

Storage Balancing in P2P Based Distributed RDF Data Stores

Storage Balancing in P2P Based Distributed RDF Data Stores Storage Balancing in P2P Based Distributed RDF Data Stores Maximiliano Osorio and Carlos Buil-Aranda Universidad Técnica Federico Santa María, Valparaíso, Chile. {mosorio,cbuil}@inf.utfsm.cl Abstract.

More information

Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks

Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks Distributed Evaluation of Continuous Equi-join Queries over Large Structured Overlay Networks Stratos Idreos Christos Tryfonopoulos Manolis Koubarakis CWI Amsterdam, The Netherlands S.Idreos@cwi.nl Department

More information

Continuous Multi-Way Joins over Distributed Hash Tables

Continuous Multi-Way Joins over Distributed Hash Tables Continuous Multi-Way Joins over Distributed Hash Tables Stratos Idreos CWI, Amsterdam The Netherlands s.idreos@cwi.nl Erietta Liarou CWI, Amsterdam The Netherlands erietta@cwi.nl Manolis Koubarakis National

More information

LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs

LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs LibraRing: An Architecture for Distributed Digital Libraries Based on DHTs Christos Tryfonopoulos, Stratos Idreos, and Manolis Koubarakis Dept. of Electronic and Computer Engineering Technical University

More information

MonetDB/DataCell: leveraging the column-store database technology for efficient and scalable stream processing Liarou, E.

MonetDB/DataCell: leveraging the column-store database technology for efficient and scalable stream processing Liarou, E. UvA-DARE (Digital Academic Repository) MonetDB/DataCell: leveraging the column-store database technology for efficient and scalable stream processing Liarou, E. Link to publication Citation for published

More information

Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks

Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks Christos Tryfonopoulos Stratos Idreos Manolis Koubarakis Dept. of Electronic and Computer Engineering Technical University

More information

Design Issues and Challenges

Design Issues and Challenges P2P Research Design Issues and Challenges for RDF- and Schema-Based Peer-to-Peer Systems Wolfgang Nejdl and Wolf Siberski and Michael Sintek Abstract Databases have employed a schema-based approach to

More information

Super-Peer-Based Routing Strategies for RDF-Based Peer-to-Peer Networks

Super-Peer-Based Routing Strategies for RDF-Based Peer-to-Peer Networks Super-Peer-Based Routing Strategies for RDF-Based Peer-to-Peer Networks Wolfgang Nejdl a Martin Wolpers a Wolf Siberski a Christoph Schmitz a Mario Schlosser a Ingo Brunkhorst a Alexander Löser b a Learning

More information

Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks

Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks Publish/Subscribe Functionality in IR Environments using Structured Overlay Networks Christos Tryfonopoulos Stratos Idreos Manolis Koubarakis Dept. of Electronic and Computer Engineering, Technical University

More information

Semantic Overlay Networks

Semantic Overlay Networks VLDB 2005 Semantic Overlay Networks Karl Aberer and Philippe Cudré-Mauroux School of Computer and Communication Sciences EPFL -- Switzerland Overview of the Tutorial I. P2P Systems Overview II. Query Evaluation

More information

A Framework for Peer-To-Peer Lookup Services based on k-ary search

A Framework for Peer-To-Peer Lookup Services based on k-ary search A Framework for Peer-To-Peer Lookup Services based on k-ary search Sameh El-Ansary Swedish Institute of Computer Science Kista, Sweden Luc Onana Alima Department of Microelectronics and Information Technology

More information

P2P-DIET: Ad-hoc and Continuous Queries in Peer-to-Peer Networks using Mobile Agents

P2P-DIET: Ad-hoc and Continuous Queries in Peer-to-Peer Networks using Mobile Agents P2P-DIET: Ad-hoc and Continuous Queries in Peer-to-Peer Networks using Mobile Agents Stratos Idreos and Manolis Koubarakis Intelligent Systems Laboratory Dept. of Electronic and Computer Engineering Technical

More information

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE for for March 10, 2006 Agenda for Peer-to-Peer Sytems Initial approaches to Their Limitations CAN - Applications of CAN Design Details Benefits for Distributed and a decentralized architecture No centralized

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [P2P SYSTEMS] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Byzantine failures vs malicious nodes

More information

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou Scalability In Peer-to-Peer Systems Presented by Stavros Nikolaou Background on Peer-to-Peer Systems Definition: Distributed systems/applications featuring: No centralized control, no hierarchical organization

More information

Choosing Appropriate Peer-to-Peer Infrastructure for Your Digital Libraries

Choosing Appropriate Peer-to-Peer Infrastructure for Your Digital Libraries Choosing Appropriate Peer-to-Peer Infrastructure for Your Digital Libraries Hao Ding and Ingeborg Sølvberg Dept. of Computer and Information System, Norwegian Univ. of Sci. &Tech. Sem Sælands vei 7-9,

More information

Distributed RDF Query Processing and Reasoning in Peer-to-Peer Networks

Distributed RDF Query Processing and Reasoning in Peer-to-Peer Networks Distributed RDF Query Processing and Reasoning in Peer-to-Peer Networks Zoi Kaoudi Dept. of Informatics and Telecommunications National and Kapodistrian University of Athens, Greece Abstract. With the

More information

A semantic-based P2P resource organization model R-Chord

A semantic-based P2P resource organization model R-Chord The Journal of Systems and Software 79 (2006) 1619 1631 www.elsevier.com/locate/jss A semantic-based P2P resource organization model R-Chord Jie Liu a,b, *, Hai Zhuge a a China Knowledge Research Group,

More information

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 16. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2017 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

Query Processing Over Peer-To-Peer Data Sharing Systems

Query Processing Over Peer-To-Peer Data Sharing Systems Query Processing Over Peer-To-Peer Data Sharing Systems O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California at Santa Barbara odsahin, abhishek, agrawal,

More information

SPARQL Query Optimization on Top of DHTs

SPARQL Query Optimization on Top of DHTs SPARQL Query Optimization on Top of DHTs Zoi Kaoudi, Kostis Kyzirakos, and Manolis Koubarakis Dept. of Informatics and Telecommunications National and Kapodistrian University of Athens, Greece Abstract.

More information

SPARQL Query Optimization on Top of DHTs

SPARQL Query Optimization on Top of DHTs SPARQL Query Optimization on Top of DHTs Zoi Kaoudi, Kostis Kyzirakos, and Manolis Koubarakis Dept. of Informatics and Telecommunications National and Kapodistrian University of Athens, Greece Abstract.

More information

Building a low-latency, proximity-aware DHT-based P2P network

Building a low-latency, proximity-aware DHT-based P2P network Building a low-latency, proximity-aware DHT-based P2P network Ngoc Ben DANG, Son Tung VU, Hoai Son NGUYEN Department of Computer network College of Technology, Vietnam National University, Hanoi 144 Xuan

More information

A Scalable Content- Addressable Network

A Scalable Content- Addressable Network A Scalable Content- Addressable Network In Proceedings of ACM SIGCOMM 2001 S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Presented by L.G. Alex Sung 9th March 2005 for CS856 1 Outline CAN basics

More information

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks

Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks Super-Peer-Based Routing and Clustering Strategies for RDF-Based Peer-To-Peer Networks Wolfgang Nejdl, Martin Wolpers, Wolf Siberski, Christoph Schmitz, Mario Schlosser, Ingo Brunkhorst, Alexander Löser

More information

Modular P2P-Based Approach for RDF Data Storage and Retrieval

Modular P2P-Based Approach for RDF Data Storage and Retrieval Modular P2P-Based Approach for RDF Data Storage and Retrieval Imen Filali, Laurent Pellegrino, Francesco Bongiovanni, Fabrice Huet, Françoise Baude To cite this version: Imen Filali, Laurent Pellegrino,

More information

Humboldt Discoverer: A semantic P2P index for PDMS

Humboldt Discoverer: A semantic P2P index for PDMS Humboldt Discoverer: A semantic P2P index for PDMS Sven Herschel and Ralf Heese Humboldt-Universität zu Berlin, Databases and Information Systems Unter den Linden 6, D-10099 Berlin, Germany, {herschel

More information

Semantic Query Routing and Processing in P2P Database Systems: The ICS-FORTH SQPeer Middleware

Semantic Query Routing and Processing in P2P Database Systems: The ICS-FORTH SQPeer Middleware Semantic Query Routing and Processing in P2P Database Systems: The ICS-FORTH SQPeer Middleware George Kokkinidis and Vassilis Christophides Institute of Computer Science - FORTH Vassilika Vouton, P.O.

More information

L3S Research Center, University of Hannover

L3S Research Center, University of Hannover , University of Hannover Dynamics of Wolf-Tilo Balke and Wolf Siberski 21.11.2007 *Original slides provided by S. Rieche, H. Niedermayer, S. Götz, K. Wehrle (University of Tübingen) and A. Datta, K. Aberer

More information

08 Distributed Hash Tables

08 Distributed Hash Tables 08 Distributed Hash Tables 2/59 Chord Lookup Algorithm Properties Interface: lookup(key) IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Ines Abdelghani 22.09.2008 Contents 1 Introduction 2 2 Distributed Hash Tables: DHT 2 2.1 Generalities about DHTs............................

More information

Distributed Hash Table

Distributed Hash Table Distributed Hash Table P2P Routing and Searching Algorithms Ruixuan Li College of Computer Science, HUST rxli@public.wh.hb.cn http://idc.hust.edu.cn/~rxli/ In Courtesy of Xiaodong Zhang, Ohio State Univ

More information

Information Integration in Schema-Based Peer-To-Peer Networks

Information Integration in Schema-Based Peer-To-Peer Networks Information Integration in Schema-Based Peer-To-Peer Networks Alexander Löser 1, Wolf Siberski, Martin Wolpers, and Wolfgang Nejdl 2 1 CIS, Technical University Berlin, 10587 Berlin, Germany aloeser@cs.tu-berlin.de

More information

P2P Schema-Mapping over Network-bound XML Data

P2P Schema-Mapping over Network-bound XML Data Fourth International Conference on Semantics, Knowledge and Grid P2P Schema-Mapping over Network-bound XML Data Carmela Comito 1, Domenico Talia 2 DEIS - University of Calabria Via P. Bucci 41 c,87036,

More information

A subscribable peer-to-peer RDF repository for distributed metadata management

A subscribable peer-to-peer RDF repository for distributed metadata management Web Semantics: Science, Services and Agents on the World Wide Web 2 (2004) 109 130 A subscribable peer-to-peer RDF repository for distributed metadata management Min Cai, Martin Frank, Baoshi Yan, Robert

More information

A General Approach to Query the Web of Data

A General Approach to Query the Web of Data A General Approach to Query the Web of Data Xin Liu 1 Department of Information Science and Engineering, University of Trento, Trento, Italy liu@disi.unitn.it Abstract. With the development of the Semantic

More information

CS514: Intermediate Course in Computer Systems

CS514: Intermediate Course in Computer Systems Distributed Hash Tables (DHT) Overview and Issues Paul Francis CS514: Intermediate Course in Computer Systems Lecture 26: Nov 19, 2003 Distributed Hash Tables (DHT): Overview and Issues What is a Distributed

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

Using Statistics for Computing Joins with MapReduce

Using Statistics for Computing Joins with MapReduce Using Statistics for Computing Joins with MapReduce Theresa Csar 1, Reinhard Pichler 1, Emanuel Sallinger 1, and Vadim Savenkov 2 1 Vienna University of Technology {csar, pichler, sallinger}@dbaituwienacat

More information

Load Sharing in Peer-to-Peer Networks using Dynamic Replication

Load Sharing in Peer-to-Peer Networks using Dynamic Replication Load Sharing in Peer-to-Peer Networks using Dynamic Replication S Rajasekhar, B Rong, K Y Lai, I Khalil and Z Tari School of Computer Science and Information Technology RMIT University, Melbourne 3, Australia

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

Exploiting Semantic Clustering in the edonkey P2P Network

Exploiting Semantic Clustering in the edonkey P2P Network Exploiting Semantic Clustering in the edonkey P2P Network S. Handurukande, A.-M. Kermarrec, F. Le Fessant & L. Massoulié Distributed Programming Laboratory, EPFL, Switzerland INRIA, Rennes, France INRIA-Futurs

More information

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach Stratos Idreos 1, Christos Tryfonopoulos 1, Manolis Koubarakis 1, and Yannis Drougas 2 1 Intelligent

More information

Minimizing Churn in Distributed Systems

Minimizing Churn in Distributed Systems Minimizing Churn in Distributed Systems by P. Brighten Godfrey, Scott Shenker, and Ion Stoica appearing in SIGCOMM 2006 presented by Todd Sproull Introduction Problem: nodes joining or leaving distributed

More information

Publish/Subscribe for RDF-based P2P Networks

Publish/Subscribe for RDF-based P2P Networks Publish/Subscribe for RDF-based P2P Networks Paul - Alexandru Chirita 1, Stratos Idreos 2, Manolis Koubarakis 2, and Wolfgang Nejdl 1 1 L3S and University of Hannover Deutscher Pavillon Expo Plaza 1 30539

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Peer-to-Peer Discovery of Semantic Associations

Peer-to-Peer Discovery of Semantic Associations Peer-to-Peer Discovery of Semantic Associations Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibañez, Budak LSDIS Lab, Department of Computer Science, The University of Georgia 415 Boyd Graduate

More information

Semantic Overlay Networks

Semantic Overlay Networks Semantic Overlay Networks Arturo Crespo and Hector Garcia-Molina Write-up by Pavel Serdyukov Saarland University, Department of Computer Science Saarbrücken, December 2003 Content 1 Motivation... 3 2 Introduction

More information

SEMANTIC GRID RESOURCE DISCOVERY IN ATLAS*

SEMANTIC GRID RESOURCE DISCOVERY IN ATLAS* SEMANTIC GRID RESOURCE DISCOVERY IN ATLAS* Zoi Kaoudi, Iris Miliaraki, Matoula Magiridou Dept. of Informatics and Telecommunications National and Kapodistrian University of Athens, Greece {zoi, iris, matoula)

More information

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach

Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach Query Processing in Super-Peer Networks with Languages Based on Information Retrieval: the P2P-DIET Approach Stratos Idreos 1, Christos Tryfonopoulos 1, Manolis Koubarakis 1, and Yannis Drougas 2 1 Intelligent

More information

An Architecture for Peer-to-Peer Information Retrieval

An Architecture for Peer-to-Peer Information Retrieval An Architecture for Peer-to-Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu School of Computer and Communication Sciences EPFL, Lausanne, Switzerland July 2, 2004 Abstract Peer-to-Peer

More information

COMP718: Ontologies and Knowledge Bases

COMP718: Ontologies and Knowledge Bases 1/35 COMP718: Ontologies and Knowledge Bases Lecture 9: Ontology/Conceptual Model based Data Access Maria Keet email: keet@ukzn.ac.za home: http://www.meteck.org School of Mathematics, Statistics, and

More information

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems 1 Motivation from the File Systems World The App needs to know the path /home/user/my pictures/ The Filesystem

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

L3S Research Center, University of Hannover

L3S Research Center, University of Hannover , University of Hannover Structured Peer-to to-peer Networks Wolf-Tilo Balke and Wolf Siberski 3..6 *Original slides provided by K. Wehrle, S. Götz, S. Rieche (University of Tübingen) Peer-to-Peer Systems

More information

Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1

Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1 Electronic Notes in Theoretical Computer Science 150 (2006) 21 36 www.elsevier.com/locate/entcs Thematic Schema Building for Mediation-based Peer-to-Peer Architecture 1 Nicolas Lumineau 2 Anne Doucet 2

More information

DRing: A Layered Scheme for Range Queries over DHTs

DRing: A Layered Scheme for Range Queries over DHTs DRing: A Layered Scheme for Range Queries over DHTs Nicolas Hidalgo, Erika Rosas, Luciana Arantes, Olivier Marin, Pierre Sens and Xavier Bonnaire Université Pierre et Marie Curie, CNRS INRIA - REGAL, Paris,

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

Improving the Dependability of Prefix-Based Routing in DHTs

Improving the Dependability of Prefix-Based Routing in DHTs Improving the Dependability of Prefix-Based Routing in DHTs Sabina Serbu, Peter Kropf and Pascal Felber University of Neuchâtel, CH-9, Neuchâtel, Switzerland {sabina.serbu, peter.kropf, pascal.felber}@unine.ch

More information

Processing Rank-Aware Queries in P2P Systems

Processing Rank-Aware Queries in P2P Systems Processing Rank-Aware Queries in P2P Systems Katja Hose, Marcel Karnstedt, Anke Koch, Kai-Uwe Sattler, and Daniel Zinn Department of Computer Science and Automation, TU Ilmenau P.O. Box 100565, D-98684

More information

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks David R. Karger 1 and Matthias Ruhl 2 1 MIT Computer Science and Artificial Intelligence Laboratory Cambridge,

More information

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Designing Views to Answer Queries under Set, Bag,and BagSet Semantics Rada Chirkova Department of Computer Science, North Carolina State University Raleigh, NC 27695-7535 chirkova@csc.ncsu.edu Foto Afrati

More information

: Scalable Lookup

: Scalable Lookup 6.824 2006: Scalable Lookup Prior focus has been on traditional distributed systems e.g. NFS, DSM/Hypervisor, Harp Machine room: well maintained, centrally located. Relatively stable population: can be

More information

Mapping a Dynamic Prefix Tree on a P2P Network

Mapping a Dynamic Prefix Tree on a P2P Network Mapping a Dynamic Prefix Tree on a P2P Network Eddy Caron, Frédéric Desprez, Cédric Tedeschi GRAAL WG - October 26, 2006 Outline 1 Introduction 2 Related Work 3 DLPT architecture 4 Mapping 5 Conclusion

More information

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation Dynamo Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/20 Outline Motivation 1 Motivation 2 3 Smruti R. Sarangi Leader

More information

Peer-to-Peer Discovery of Semantic Associations

Peer-to-Peer Discovery of Semantic Associations Peer-to-Peer Discovery of Semantic Associations Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibañez, Budak LSDIS Lab, Department of Computer Science, The University of Georgia 415 Boyd Graduate

More information

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems George Fletcher Project 3, B649, Dr. Plale July 16, 2003 1 Introduction One of the key requirements for global level scalability

More information

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

More information

Content Overlays. Nick Feamster CS 7260 March 12, 2007

Content Overlays. Nick Feamster CS 7260 March 12, 2007 Content Overlays Nick Feamster CS 7260 March 12, 2007 Content Overlays Distributed content storage and retrieval Two primary approaches: Structured overlay Unstructured overlay Today s paper: Chord Not

More information

A Peer-to-peer Framework for Caching Range Queries

A Peer-to-peer Framework for Caching Range Queries A Peer-to-peer Framework for Caching Range Queries O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California Santa Barbara, CA 9316, USA {odsahin, abhishek, agrawal,

More information

tjuanli, ubc. ca

tjuanli, ubc. ca An Ontological Framework for Large-Scale Grid Resource Discovery Juan Li, Son Vuong Computer Science Department, University ofbritish Columbia tjuanli, vuong}@cs. ubc. ca Abstract This paper presents GOIDS

More information

Providing File Services using a Distributed Hash Table

Providing File Services using a Distributed Hash Table Providing File Services using a Distributed Hash Table Lars Seipel, Alois Schuette University of Applied Sciences Darmstadt, Department of Computer Science, Schoefferstr. 8a, 64295 Darmstadt, Germany lars.seipel@stud.h-da.de

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 17. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2016 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

PAIS: A Proximity-aware Interest-clustered P2P File Sharing System

PAIS: A Proximity-aware Interest-clustered P2P File Sharing System : A Proximity-aware Interest-clustered P2P File Sharing System Haiying Shen Department of Computer Science and Computer Engineering University of Arkansas, Fayetteville, AR 7271 hshen@uark.edu Abstract

More information

SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks

SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks Brian F. Cooper and Hector Garcia-Molina Department of Computer Science Stanford University Stanford, CA 94305 USA {cooperb,hector}@db.stanford.edu

More information

Range queries in trie-structured overlays

Range queries in trie-structured overlays Range queries in trie-structured overlays Anwitaman Datta, Manfred Hauswirth, Renault John, Roman Schmidt, Karl Aberer Ecole Polytechnique Fédérale de Lausanne (EPFL) CH-11 Lausanne, Switzerland Abstract

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Peer-to-Peer (P2P) Systems

Peer-to-Peer (P2P) Systems Peer-to-Peer (P2P) Systems What Does Peer-to-Peer Mean? A generic name for systems in which peers communicate directly and not through a server Characteristics: decentralized self-organizing distributed

More information

Semantic Query Routing Experiences in a PDMS

Semantic Query Routing Experiences in a PDMS Semantic Query Routing Experiences in a PDMS Federica Mandreoli, Riccardo Martoglia, Wilma Penzo, and Simona Sassatelli DII University of Modena and Reggio Emilia, Italy {fmandreoli,rmartoglia,sassatelli}@unimo.it

More information

Locality in Structured Peer-to-Peer Networks

Locality in Structured Peer-to-Peer Networks Locality in Structured Peer-to-Peer Networks Ronaldo A. Ferreira Suresh Jagannathan Ananth Grama Department of Computer Sciences Purdue University 25 N. University Street West Lafayette, IN, USA, 4797-266

More information

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 1 Introduction Modified by: Dr. Ramzi Saifan Definition of a Distributed System (1) A distributed

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

A Survey of Peer-to-Peer Systems

A Survey of Peer-to-Peer Systems A Survey of Peer-to-Peer Systems Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece kstef@cs.uoi.gr Abstract Peer-to-Peer systems have become, in a short period of time, one

More information

Lecture 1: Conjunctive Queries

Lecture 1: Conjunctive Queries CS 784: Foundations of Data Management Spring 2017 Instructor: Paris Koutris Lecture 1: Conjunctive Queries A database schema R is a set of relations: we will typically use the symbols R, S, T,... to denote

More information

Similarity Queries on Structured Data in Structured Overlays

Similarity Queries on Structured Data in Structured Overlays Similarity Queries on Structured Data in Structured Overlays Marcel Karnstedt, Kai-Uwe Sattler, Manfred Hauswirth, Roman Schmidt Department of Computer Science and Automation, TU Ilmenau, Germany School

More information

Inference in Hierarchical Multidimensional Space

Inference in Hierarchical Multidimensional Space Proc. International Conference on Data Technologies and Applications (DATA 2012), Rome, Italy, 25-27 July 2012, 70-76 Related papers: http://conceptoriented.org/ Inference in Hierarchical Multidimensional

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Architectures for Distributed Systems

Architectures for Distributed Systems Distributed Systems and Middleware 2013 2: Architectures Architectures for Distributed Systems Components A distributed system consists of components Each component has well-defined interface, can be replaced

More information

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol Min Li 1, Enhong Chen 1, and Phillip C-y Sheu 2 1 Department of Computer Science and Technology, University of Science and Technology of China,

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information