Efficient Management and Querying of RDF data in a P2P Framework

Size: px

Start display at page:

Download "Efficient Management and Querying of RDF data in a P2P Framework"

Meghan Todd
6 years ago
Views:

1 Disserta on zur Erlangung des Grades des Doktors der Ingenieurwissenscha en (Dr. - Ing.) der Technischen Fakultät der Albert- Ludwigs- Universität Freiburg im Breisgau Efficient Management and Querying of RDF data in a P2P Framework von Liaquat Ali Freiburg im Breisgau 2014

2 Dekan: Referenten: Datum der Promo on: Prof. Dr. Yiannos Manoli Prof. Dr. Georg Lausen Prof. Dr. Chris an Schindelhauer 18.Sep.2014

3 To: Zainab Haider & Abbas Ali

4 i

5 ii

6 iii

7 iv

8 Table of Contents List of Tables List of Figures ix x 1 Introduction Problem Definition Contribution Thesis Organization Foundations Data Representation XML RDF RDF building blocks Formalization RDF Serialization Ontologies RDF Schema OWL SPARQL Peer-to-Peer Networks Chord nuts P2P based RDF Storing and Querying P2P based RDF data Management P2P based RDF stores Architecture Storing RDF triples in DHTs Query Evaluation in DHTs Evaluation of Atomic Triple Patterns Evaluation of Conjunctive Triple Pattern Queries Analysis of Query Processing Time vi

9 TABLE OF CONTENTS vii 3.5 E ects of Network Structure Improvements on Query Evaluation Information locality Network locality Interest locality Prototype Architecture rdf architecture rdf peer architecture Performance Analysis Discussion of Queries Query Q Query Q Query Q Query Q Analysis of Routing Routing Analysis for Query Q Routing Analysis for Query Q Routing Analysis for Query Q Routing Analysis for Query Q Analysis of Query Response Time Response Time Analysis for Query Q Response Time Analysis for Query Q Response Time Analysis for Query Q Response Time Analysis for Query Q Related Work Centralized RDF Data Management Jena Sesame store RDF Data Management Using Vertical Partitioning Distributed RDF Data Management RDF stores using unstructured Peer-to-Peer Networks RDF stores using structured Peer-to-Peer Networks Summary Load Balancing Related Work Limitations of Indexing on Individual Keys State-of-the-art individual keys index scheme Solution: Indexing on Compound Keys Improving Query Processing Performance Analysis

10 TABLE OF CONTENTS viii Discussion of Queries Query Q Query Q Analysis of Load distribution Analysis of Query Load distribution for Query Q Analysis of Query Load distribution for Query Q Analysis of Storage Load distribution Analysis of Query Response Time Analysis of Query Transmission time for Query Q Analysis of Query Transmission time for Query Q Summary Evaluating SPARQL Subqueries Related Work Rational for Correlated Query Transformation Subquery Evaluation Abstract Syntax for SPARQL Subqueries Evaluation and Optimization of SPARQL Subqueries Evaluating Subqueries as graph patterns Evaluating Subqueries in filter constraints Query processing Summary Conclusion and Future Work Conclusion Future Works Bibliography 117 A Eidesstattliche Versicherung 127

11 List of Tables 3.1 Possible atomic query patterns Three compound keys are needed to cover all triple patterns ix

12 List of Figures 1.1 Architecture of Peer to Peer based RDF data stores Example of storage and evaluation RDF data on DHT based RDF data stores RDF graph representation of triples in Listing An 6-bit Chord network consisting of 8 peers Example 3nuts overlay Routing in 3nuts overlay Example of random networks in 3nuts tree A balanced search tree example in 3nuts network Using P2P overlay for storing and querying RDF data Storing triples in Listing 3.1 into a DHT-based RDF store of eight peers Evaluating SPARQL query in Listing 3.2 over a DHT-based RDF store of eight peers Evaluation of the query on peer n Evaluation of the query on peer n Evaluation of the query on peer n RDF triples storage in 3nuts search tree Routing shortcuts in 3rdf Some sample RDF triples, possible triple patterns, and statistics of relevant keys kept at peer responsible for the key (Professor) Example Peers Links Graph describing the relationship between instances of class Professor Implementation overview of a 3rdf peer in a distributed environment Measurement results for resolving the query in Listing Hops needed for the evaluation of given query in Listing Measurement results for resolving the query in Listing Hops needed for the evaluation of given query in Listing Measurement results for resolving the query in Listing Hops needed for the evaluation of given query in Listing x

13 LIST OF FIGURES xi 3.18 Measurement results for resolving the query in Listing Hops needed for the evaluation of given query in Listing Measurement results for the performance boost using 3nuts localities Measurement results for the performance boost using 3nuts localities Measurement results for the performance boost using 3nuts localities Measurement results for the performance boost using 3nuts localities indexing on individual terms Comparison of data distribution on two di erent 3-tuples indexes peers load distribution tuples indexing in DHTs individual/compound indexes comparison #triples/peer for individual/compound indexes query chain in parallel query processing Cumulative query processing load for query in Listing Cumulative query processing load for query in Listing Cumulative storage load Comparison of Sequential and Parallel processing of the query in Listing Comparison of Sequential and Parallel processing of the query in Listing Performance Comparison of Queries in Listings 5.10 and

14 1 Introduction The World Wide Web today is a huge hub of information, containing several billion of documents which are used by more than 300 million users globally. These huge amount of information can play an e ective role in our lives by helping us to make better decisions. However, the continued rapid growth in information amount on the Web makes it more and more di cult to find the relevant information that support us at our tasks. The more information is available, the harder it would be to find any particular part of it. Important information is often scattered across the Web, and keyword-based search engines such as Google, and Yahoo are the main tools used on today s Web to retrieve the relevant documents. However, a serious problem associated with the use of these search engines is the returning of result sets with high recall but low precision. This means that in addition to retrieving huge part of the relevant pages (high recall) they can also retrieve irrelevant documents that includes certain terms in di erent meaning (low precision). Often they also do not retrieve the relevant pages when di erent terms with the same meaning have been used in the content of these pages (low or no recall). Even if the relevant pages are located by search engines, human browsing and reading is required to extract the required information from these documents. An important factor in providing better user support for data re-usability is the extent to which this data is well structured. The data with more regular and well-defined structure can be more easily processed with users tools for reuse. The current Web consist to a large extent of unstructured or semi-structured text, designed to be suitable for human readability. Web sites are mainly created in HTML language, which is more fit in structuring textual information than data. 1

15 CHAPTER 1. INTRODUCTION 2 The markup provided by HTML only provides the way information should be structured and presented on the web pages, but does not refer to its content. This restriction makes the understanding and interpretation of information almost impossible by machines. To provide better user support for extracting useful information from Web, a viable approach is to encode Web content in a machine-readable format and to use intelligent techniques in search engines to take advantage of these encodings. In May 2001, Tim Berners-Lee described his vision of the Semantic Web in [21] as an extension of the current web where, in addition to being human readable, information on web is made more accessible by giving it well-defined meaning in a machine processable way. The motive behind Semantic Web vision is to establish a global data space (Web of Data) [22] bylinkinginformationfromdi erentsources together in order to improve search and data discovery on web. The Resource Description Framework (RDF) [76] datamodelhasbeendevel- oped to represent data distributed across the Web, and to provide a mechanism for linking this data. This mechanism specifies the existence and meaning of connections between items of data. Although a number of additional technologies and meta-data notation languages such as RDF Resource Description Framework scheme (RDFS) [27] and the Web Ontology Language (OWL) [78] arealsothe part of Semantic Web initiative, we mainly tackle the issue of e cient storage and retrieval of RDF data in this thesis. The rapidly rising interest in the Semantic Web calls for the development of RDF data stores customized for the e cient management of such data. High performance centralized RDF data stores such as Sesame [30], Jena [34] and 3store [51] havebeendevelopedthatstoreandevaluaterdfdataonasingle machine. These developments, for the e cient management of RDF data on a single machine, are continued through many new works on non-distributed RDF systems [101, 5, 82, 99, 95] that have been studied during last few years. Although these centralized RDF data stores can handle data sets containing million of triples, the continuously growing size of RDF data on web would exceed the capabilities of these stores. Centralized processing of huge amount of data at the Web scale would create immediate bottleneck in terms of storage capacity and network throughput. Keeping the inherently distributed nature of Semantic Web and aforementioned scalability issues of centralized solutions in view, there is a need for a distributed RDF infrastructure. To cope with the anticipated load of Semantic Web data, several projects have emerged that proposed the idea of using Peer to Peer (P2P) networks for the distributed management of RDF data. The list of such Peer to Peer based RDF data stores includes our 3rdf system [8, 11, 9, 10], BabelPeers [17, 18, 58, 15, 16], Atlas [64, 73, 66, 74, 65], RDFPeers [31], GridVine [7], RDFCube [77], and many

16 CHAPTER 1. INTRODUCTION 3 others. Keeping the inherently distributed nature of the Semantic Web, these Peer to Peer based RDF data stores provide an RDF infrastructure that supports the real distribution of information sources. Instead of using centralized servers for the management of RDF data, participating peers in the network collaboratively manage RDF data without any central control. Figure 1.1 gives a conceptual overview of the architecture of Peer to Peer based RDF data stores. In the base layer, called the Internet layer in the figure, computers are connected to a physical network (e.g. the Internet/Intra-net) to participate as peers in a logical Peer to Peer network. In overlay layer, peers self organize into a Peer to Peer overlay for the e cient storage and retrieval of RDF data. Finally, the semantic application layer at the top exploit the overlay layer to e ciently store and evaluate RDF data across the network. RDF triples are managed and stored through databases at the application layer, however the indexing and search facility for these triples are provided in the overlay layer. Figure 1.1: Architecture of Peer to Peer based RDF data stores.

17 CHAPTER 1. INTRODUCTION Problem Definition The distributed management of RDF data on Peer to Peer based RDF data stores alleviate many issues associated with centralized RDF stores (e.g., scalability, fault tolerance, bottleneck, etc). The responsibility of data management and query evaluation is distributed between many peers in these Peer to Peer based RDF data stores, which increase both the storage capacity and CPU utilization of the system. However, distribution of RDF data across the network imposes new challenges for the processing of complex RDF queries (e.g., conjunctive queries). Therefore, an important topic for the performance of Peer to Peer based RDF data stores is how to e ciently index (store) RDF data among network peers in order to ensure e cient query processing. Tackling the issue of e cient distributed RDF query processing, we focus in this dissertation how to store (index) the RDF data and evaluate complex queries expressed in RDF query languages on top of Peer to Peer network. In the context of distributed evaluation of RDF queries, we consider the processing of basic graph patterns in this thesis, which is a basic building block of the SPARQL query language [85]. Distributed Hash Tables (DHTs), earlier introduced in [67] forrelievinghot spots in the Internet, have become the data distribution method of choice for these Peer to Peer based distributed RDF data stores. Majority of state-of-theart distributed RDF data stores such as RDFPeers, Atlas, and BabelPeers use DHTs as overlay network to store and query RDF data in a distributed manner. To attain an e cient search for RDF triples with the same subject, predicate, or object, the triples are indexed three times for each triple component (subject, predicate, or object) in these distributed RDF databases. DHTs while provide fair load balancing properties with easy data management under churn 1 they also destroy the ordering of the index by using hashing, and along with it the grouping of semantically related data, e.g. data of a university domain cannot be stored on acontiguousintervalandisspreadoverthecompletetable. Thiscancausemore routing when collecting data from the same domain to evaluate a query. When the bandwidth of underlying network in these distributed RDF data stores is high but the ping or delay for data transmission is low, the response time of queries can be driven by these delays, because each routing step in the chain of query processing produce an additional delay. Example 1.1 Consider the indexing (storage) of RDF triples belonging to a particular university domain on a DHT-based RDF data store in the Figure. (1.2). We are just considering the case where triples are stored on their predicates. Though the triples predicates (name, , age), used as indexing keys, share a common 1 Peers entering/leaving the network only invoke local changes and take over/shed data to neighbors.

18 CHAPTER 1. INTRODUCTION 5 namespace or prefix ub, the use of hash function to map these keys to the peers in the network destroys the ordering of these keys and store the corresponding related triples in a highly fragmented manner. The fragmentation of related triples in this way will cause more routing for collecting triples from distant peers (n3, n4, n7, n1), and consequently will take longer time for the execution of the given query in the Figure. (1.2). SELECT?name,? ,?age WHERE {?S rdf:type Student?S ub:name?name?s ub: ? ?s ub:age?age } (S1, rdf:type, Student) (S2, rdf:type, Student) (S5, rdf:type, Student) n2 H(rdf:type) n3 (S1, ub:name, Alex) (S2, ub:name, Ali) H(ub:name) n1 n4 (S1, ub:age, 22) (S3, ub:age, 24) n8 H(ub:age) H(ub: ) n5 n7 n6 (S1, ub: , Alex@abc.com) (S4, ub: , Ali@abc.com) Figure 1.2: Example of storage and evaluation RDF data on DHT based RDF data stores. While state-of-the-art distributed RDF data stores aim to achieve the principle of data independence [59] and focus on enhancing the query processing on application layer, we see a real potential in the interaction of the application and the network. We investigate in this thesis how the distributed evaluation of RDF queries can be optimized by improving the structure of underlying overlay network. We observe another problem of load-imbalances, when RDF triples are partitioned based on indexing of their subject, predicate, and object terms in the state-of-the-art distributed RDF data stores. The frequency distribution of terms

19 CHAPTER 1. INTRODUCTION 6 in RDF triples is highly skewed, some URIs and literals occur very often while others occur only rarely. This term based triples partitioning raises the question of scalability for huge data sets, e.g., peer responsible for triples with predicate rdf:type can be easily overloaded resulting in poor performance. Such high loaded peers receive too many query requests (become bottleneck) and consequently slow the evaluation of queries down. We also address the problem of load imbalances in this thesis, and investigate new triple partitioning techniques to get better data and query load distribution. As RDF triples in Peer-to-Peer based RDF systems are stored on multiple peers, the distributed evaluation of SPARQL correlated queries, where the inner query block is evaluated once for each solution of the outer query, may get very expensive in terms of query response time. We address the problem of evaluating SPARQL 1.1 subqueries and their proposed extensions in Peer-to-Peer environment in this thesis, and propose transformation algorithms to transform correlated queries into equivalent uncorrelated ones, that will make the distributed evaluation of such queries e cient. 1.2 Contribution Within this thesis, we propose solutions to overcome the aforementioned adverse e ect of use of hash functions, in underlying network, on routing time of queries. We also investigate to find new triple partitioning techniques to mitigate the problem of load imbalances caused by partitioning of triples on their individual terms (subject, predicate, object). We propose the use of a search-tree based Peer to Peer network for the e cient distributed management and evaluation of RDF data. The underlying overlay network we use in our work provides features which allow to adapt the network structure to the search structure for reducing communication time and tra c on the application layer. It provides distributed search tree for order-preserving indexing, domain related prefixes (namespaces) in subjects, predicates, and objects of RDF triples order triples of the same domain in the same branches of the search tree, e.g., triples indexed on predicates (ub:name, ub: , ub:age), in Figure. (1.2), are ordered in the same branches. In return, triples belonging to the same domain (e.g., triples sharing namespace ub in their subject, predicate or object) are stored on nearby peers, and consequently the lookup operation for the evaluation of these triples needs less hops (less routing time). We will also show how we can further speed up the evaluation of queries and reduce the network tra c by creating routing shortcuts between triple components which tend to be queried mostly combined in RDF queries. These routing shortcuts can be established in the underlying network with only a constant increase of the peer s routing tables.

20 CHAPTER 1. INTRODUCTION 7 Publication 1 L. Ali, T. Janson, and G. Lausen: 3rdf: Storing and Querying RDF data on Top of the 3nuts Overlay Network, In 10th International Workshop on Web Semantics, Toulouse, France, August 2011, vol. 0, pp Publication 2 L. Ali, T. Janson, G. Lausen, and C. Schindelhauer: E ects of Network Structure Improvement on Distributed RDF Querying, In 6th International Conference on Data Management in Cloud, Grid and P2P Systems (Globe 2013), Prague, Czech Republic, September 2013, Springer-Verlag Berlin Heidelberg 2013, LNCS 8059, pp The frequency of subject, predicate, and object occurrences in triples is not uniformly distributed, while majority URIs and literals occur very rarely some of them occur very frequently ( rdf:type as predicate). The use of these individual terms as index keys for the partitioning of triples in Peer-to-Peer networks results to a very unfair load distribution (e.g., peer responsible for rdf:type is subjected to a very high storage load). We propose a triple partitioning technique to balance the query and storage load of peers, with the basic idea to extend the index keys such that the set of triples with the same key are smaller and load balancing of the Peer to Peer network performs better. We will see in relevant section that this method of fair triple partitioning is not practical in DHT based Peer-to-Peer networks. Publication 3 L. Ali, T. Janson, and C. Schindelhauer: Towards Load Balancing and Parallelizing of RDF Query Processing in P2P Based Distributed RDF Data Stores, In 22nd Euromicro International Conference on Parallel, distributed and network-based Processing(PDP 2014), Turin, Italy, February 2014, vol. 0, pp The introductions of subqueries and negation are the most interesting features included in the latest SPARQL 1.1 [53] specification.existingpeer-to-peerbased RDF storage and querying systems have not studied the evaluation of these newly included SPARQL query features. The evaluation of subqueries in Peer-to-Peer environment may be very ine cient and expensive in terms of query response time, particularly for the correlated queries, where the inner query block is evaluated once for each solution of the outer query. In this thesis, we also study the problem of evaluating SPARQL 1.1 subqueries and their proposed extensions such as subqueries in filter constraints over RDF data stored in Peer-to-Peer networks. We apply optimization techniques, based on the idea of transforming correlated queries to equivalent, uncorrelated ones, in order to make the distributed evaluation of these queries e cient.

21 CHAPTER 1. INTRODUCTION 8 Publication 4 L. Ali and G. Lausen: Evaluating SPARQL Subqueries Over P2P Overlay Networks, In 11th International Workshop on Web Semantics, Vienna, Austria, September 2012, vol. 0, pp Thesis Organization The rest of the thesis is organized as follows. Chapter 2 gives the foundations on relevant Semantic Web technologies and Peer-to-Peer networks that enable us to achieve our goal of building an e cient Peer-to-Peer based RDF system. These includes a discussion of RDF(S), ontologies, the query language SPARQL, and description of the state-of-the-art in Peer-to-Peer systems. Chapter 3 constitutes the first main chapter of thesis and shows the e ects of network structure improvements on optimization of query evaluation in Peer-to-Peer environment. In this chapter, we present the management of RDF data in state-of-the-art DHT-based RDF stores and contrast them against our tree-based RDF system. Chapter 4 deals with the problem of load-imbalances, caused by term based partitioning of triples in state-of-the-art Peer-to-Peer based RDF stores. In this chapter, we present solutions for load-balancing based on the idea of partitioning triples on combination of their terms. In Chapter 5 we study the e cient evaluation of SPARQL subqueries in Peer-to-Peer environment. Finally, Chapter 6 concludes the work with a summary of the contributions.

22 2 Foundations In this chapter, we discuss relevant Semantic Web technologies and Peer to Peer networks which are used within this work to establish the foundation of an e cient Peer to Peer based distributed RDF data store. The chapter is comprised of two sections. In the first sections, we will first see how RDF is a suitable data model for Web content representation. Then after a general introduction of the idea of ontologies and their representation, we discuss SPARQL, the most prominent query language for RDF data. In the following section, we describe Peer to Peer systems which are used as overlay networks in our work for the distributed management of RDF data. The way peers in underlying Peer to Peer network are organized and used to locate and access RDF triples is an important aspect of distributed RDF data stores. In this respect we discuss the features of Peer to Peer systems on the basis of their network structure, i.e., structured networks and unstructured networks. As a main representative of structured overlay network, we explain Chord (a DHTbased network) and 3ntus (a search tree-based network). 2.1 Data Representation The main problem of processing and interpreting Web information is the representation of data on Web. The information on current Web is in a weakly structured form consists of both free and structured text, images, audio and video. The information is formatted for human readers using HTML language. The markup 9

23 CHAPTER 2. FOUNDATIONS 10 provided by HTML provides the facility to structure and present highly heterogeneous information, but does not refer to the content of the information provided. That is the reason when we say, the content of Web is not machine processable. The job of understanding the content and extracting useful information from it is left to the user. Consider the following HTML code example in the Listing 2.1, describing the o ered course Databases and Information Systems. <h2>databases and Information Systems</h2> <i >by Prof. Dr. Gerog Lausen </i ><br> in WS2008/2009< br> at DBIS Listing 2.1: HTML code describing the o ered course Databases and Information Systems. Human reader can understand the information quite easily, but machines will have their problems to retrieve the teacher name, semester o ered and chair name. This is due to the fact that the given HTML code does not contain structural information, i.e., the information about pieces of the code and their relationships. The Semantic Web approach to solving these problems is the replacement of HTML by more appropriate languages that represent Web content in a more machine-processable form. The Extensible Markup Language (XML) [26] and Resource Description Framework (RDF) [76] areamongthemostpopularstan- dard languages proposed by the W3C [2]committeetorepresentthesemi-structure data on Web. We briefly describe XML and then focus on RDF data model XML XML a tree-like data representation was proposed to overcome the shortcomings of HTML that provides only a fixed set of tags for visual representation of information. XML on the other hand allows users to define their own tags in order to indicate the type of content annotated by the tag. The XML document is far more machine-processable because every piece of information and their relations are described through the use of user defined tags. For example, consider the following XML code in the Listing 2.2. The< name >, < teacher >, <semester> and < chair > tags appear with in the <course>tags, describing the properties (attributes) of a particular o ered course. The machine processing this XML document could deduce that name, teacher, semester and chair elements refer to the enclosing course element.

24 CHAPTER 2. FOUNDATIONS 11 <course> <name> Databases and Information Systems </name> <teacher> Prof. Dr. Gerog Lausen </teacher> <semester> WS2008/2009 </semester> <chair> DBIS </chair> </course> Listing 2.2: XML code describing the o ered course Databases and Information Systems. Document Type Definition (DTD) [26] and XML Schema [42, 44, 84] arethe schema languages that provide means of defining the structure of XML documents, emphasizing the idea of data exchange on Web in a structured way. Query languages such as XPath [38],andXQuery[25] areusedtoaccessandretrieveparts of a XML document. As described above XML provides means to define the structure of data through schema languages, however it lacks the mean to provide the semantic (meaning) of data. For example, there is no semantic associated with the nesting of tags, and consequently the interpretation of the nesting is left to the applications. The three di erent XML representation in the Listing 2.3 express the same statement Georg Lausen is a teacher of Databases and Information Systems, which shows that there is no standard way to assign meaning to the nesting of tags. <course name = Databases and Information Systems ' '> <teacher> Gerog Lausen </teacher> </course> <teacher name = Gerog Lausen ' '> <teaches> Databases and Information Systems </teaches> </teacher> <teachingoffering> <course> Databases and Information Systems </course> <teacher> Gerog Lausen </teacher> </teachingoffering> Listing 2.3: 3 XML representations of the statement Georg Lausen is a teacher of Databases and Information Systems. This shortcoming in XML made the ground for the proposition of RDF data model, which provides a simple way to define the semantic (meaning) of data RDF AgoaloftheSemanticWebinitiative,presentedin[21], is to integrate data from Web resources into machine-driven evaluation. The Resource Description Framework (RDF) [76] is the data model that has been proposed by the W3C [2] to

25 CHAPTER 2. FOUNDATIONS 12 encode these heterogeneously structured data originating from multiple sources. The use of URIs (Universal Resource Identifiers), as a global naming convention, in RDF made it possible to express information about things on Web and their relationship. As pointed by Shadbolt et al. [94] thisapproachshiftsthetrend of document centralized Web representation to data centralized Web representation. The structured data locked in relational databases can be unlocked through making it globally addressable on Web. The basic building block of RDF is a statement. The sentence about Lausen in previous section Georg Lausen is a teacher of Databases and Information Systems is such a statement. These statements in RDF are expressed in terms of so-called triples, where each triple is of the form (subject, predicate, object). The subject is the URI that identify the resource for which the statement is made. The object is either a literal value or the URI of another resource that has some relation with the subject. The predicate represents the kind of relation between the subject and the object. Consider the following RDF statements about the resources GraduateStudent7 and GraduateStudent8 in Listing 2.4, whichcontain seven RDF triples in rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 :G7 rdf : type ubd0v0 : GraduateStudent. d0v0 :G7 ubd0v0 :name Peter. d0v0 : G7 ubd0v0 : peter@ub. com. d0v0 :G8 rdf : type ubd0v0 : GraduateStudent. d0v0 :G8 ubd0v0 :name Thomas. d0v0 : course1 rdf : type ubd0v0 : Course. d0v0 : G8 ubd0v0 : takescourse d0v0 : Course1. Listing 2.4: Example RDF statements about resources GraduateStudent7 and GraduateStudent8 encoded in RDF/N3 format. The triples are describing two graduate students named Peter and Thomas. The students are encoded as URIs d0v0:g7 and d0v0:g8, where as there titles are represented with string values (literals). For the first student (d0v0:g7 )the information about his name and address is provided, where as for the second student (d0v0:g8 )thereisnoinformationabouthis address,ratherinaddition with the information on his name the information about the course he takes is given. RDF can model and integrate data from various domains through the use of vocabulary from di erent namespaces, e.g., the use of standard RDF namespace rdf, and user defined namespaces d0v0 and ubd0v0 in RDF triples in Listing 2.4. In addition, we can also see in the example triples that RDF provides a very flexible data formate with out imposing a priori rigid restriction on structure of

26 CHAPTER 2. FOUNDATIONS 13 data. Although both graduate students, d0v0:g7 and d0v0:g8, areofthesame type ubd0v0:graduatestudent, theycomewithdi erentstructure: oneofthem comes with name and attributes, where as for the other one the name and takescourse are provided. This means it is not required to model data using a fixed scheme as required in relational data model. This characteristics of RDF makes it a flexible and suitable data model for representing semi-structured and unstructured data. The set of triples in RDF can also be represented as labeled directed graph, where subjects and objects are the nodes in the graph and each triple is a directed edge (arc) that connects the subject and the object. Figure 2.1 represents the graph corresponding to the RDF triples in Listing 2.4, where URI-type subject and object nodes are represented by ellipses and object with string literals are enclosed in quotation marks. As we will see later in formal definition of RDF triples, subject and predicate positions are represented by URIs (ellipses and edges in the graph), and objects can be represented either by URIs or literals such as strings or numbers. The object position carrying a URI represent a resource and may be linked (with edges) to other resources. For example, the object with URI d0v0:course1 in the given graph represents the value of predicate udd0v0:takescourse for the resource d0v0:g8, and is further connected with another resource ubd0v0:course through predicate rdf:type. RDF building blocks The set of URI references (URIs), literals, and blank nodes form the basis for describing RDF data. We will now give a formal presentation of RDF with the help of these disjoint set of URIs, literals and blank nodes. URI references: A URI from the set U of URIs [3] isagloballyscoped identifier which identifies a logical or physical resource. URI references are expressed as ASCII strings consisting of a namespace and optional fragment identifier. For example, the URI is composed of namespace rdf-syntax-ns# and identifier type. As many URIs share common prefixes, we typically use prefix notation to denote URIs for the sake of brevity and readability. In the above example the prefix rdf: can be used for the namespace which allow to write rdf:type for the given URI. Literals: The set of literals L consists of literals such as numbers, strings or dates. Literals are mainly used as object values to describe properties of resources. For example, the literals are used as objects values to describe the name or age of a person. To distinguish their representation from URIs and

27 CHAPTER 2. FOUNDATIONS 14 d0v0:g8' ubd0v0:takescourse' d0v0:course1' rdf:type' ubd0v0:name' rdf:type' ubd0v0:graduatestudent' ''''' Thomas ' ubd0v0:course' rdf:type' d0v0:g7' ubd0v0: ' ''''' ' ubd0v0:name' ''''' Peter ' Figure 2.1: RDF graph representation of triples in Listing 2.4. blank nodes they are enclosed in quotation marks. RDF [71] distinguishes between plain literals and typed literals. Plain literals consist of Unicode string combined with optional language tags, where as typed literals consist of Unicode string combined with a data type URI. The data type URI identifies the data type of literals, e.g. 10 ˆˆxsd:integer 1. Blank nodes: Elements from set of blank nodes B represent resources that do not hold a globally scoped label (fixed URI). As described in the RDF semantics document [57], blank nodes can be considered as existential variable that may be used to identify resources for which the URI is not given. They hold a locally scoped identifier in an RDF document. The use of blank nodes in RDF is a big hurdle to merge data from multiple sources, due to the fact that there is no URI to be used as a common key. 1 XML Schema datatypes specification is used to define datatype URIs for common data types such as integers, floating point numbers and dates

28 CHAPTER 2. FOUNDATIONS 15 Formalization Given U, B and L the sets of URIs, Blank nodes and literals respectively, a triple (s, p, o) 2 (U [ B) U (U [ B [ L) iscalledanrdf triple. Thefirstcomponent s of an RDF triple is called its subject, thesecondcomponentp its predicate, and the third component o its object. From the definition of an RDF triple we can note that literals can only be assigned as object values, blank nodes may appear as subject or object, and URIs can appear in all three position of an RDF triple. RDF Serialization RDF as an abstract data model needs a concrete syntax in order to be represented and transmitted over Web. There are di erent ways of RDF serialization, among which the most common formats are N-Triples, Notation 3 (N3), andrdf/xml. N-Triples: It is the simplest way of RDF serialization [46], which encode RDF data by listing its triples one by one. In this way of listing, triples are separated by trailing period. Notion 3 (N3): It extends N-Triples format by providing special constructs that allow encoding of triples in a more compact way [20], e.g. its support for namespace prefixes, and listing of triples in a way that allows triples to share common subjects, or subjects and predicates. RDF/XML: It is a complex but probably the most prominent RDF serialization format [19], standardized by the W3C. Being based on XML, this representation inherits the benefits associated with XML. RDF data is encoded as XML tree by using nested XML tags. Resources in RDF data are represented by either rdf-description XML-tags or literals. The attributes rdf:about and rdf:resource are used in these XML-tags to name and refer RDF resources to each other. The RDF/XML format was primarily designed to be processed by machines and now is widely supported by tools that consume Linked Data [22]. Resource Description Framework in attributes (RDFa): In this serialization method RDF triples are embedded in HTML documents [60]. The use of RDFa markups in HTML code make it possible to expose structured data on Web. RDF provides the possibility to make statements about statements through a mechanism known as reification. It can be used to describe belief in other statements or can be used to write meta-data about individual RDF statements. In this reification process a unique identifier is assigned to each statement, which can

29 CHAPTER 2. FOUNDATIONS 16 then be used to refer to the statement. The other vocabulary used in reification includes rdf:statement, rdf:type, rdf:subject, rdf:predicate and rdf:object. An example is given in Listing 2.5. It contains a reification of the statement (d0v0:g8 ubd0v0:takescourse rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> :1 rdf : type rdf : Statement. :1 rdf : subject d0v0:g8. :1 rdf : predicate ubd0v0: takescourse. :1 rdf : object d0v0: Course1. Listing 2.5: Example for RDF Reification of triple (d0v0:g8 ubd0v0:takescourse d0v0:course1). Reification did not become much popular for publishing meta-data about RDF triples because reified statements induce a huge overhead and are cumbersome to query with the SPARQL query language [85]. RDF also provides vocabulary for creating so called container and and collection classes. They are used to collect a number of resources about which we may want to make statements as a whole. For example, for the triples in Listing 2.4 we may want to talk about the courses taken by graduate student d0v0:g8. RDF provides three types of container classes, namely rdf:bag, rdf:seq and rdf:alt. The di erence between these containers is that the order of contained element is irrelevant for rdf:bag, while it is relevant for rdf:seq, where as rdf:alt can provide alternative values for some property. In addition a set of predefined predicates rdf: 1, rdf: 2,... is provided to connect elements to their corresponding containers. As an example consider the set of triples in Listing 2.6 which describe the courses taken by a particular graduate rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 :G8 ubd0v0 : takescourses d0v0 : Courses. d0v0 : Courses rdf : type rdf :Bag. d0v0 : Courses rdf : 1 d0v0 Course1. d0v0 : Courses rdf : 2 d0v0 Course3. d0v0 : Courses rdf : 3 d0v0 Course5. Listing 2.6: Example for RDF Collection Type rdf:bag. Collections di er from containers in that they represent closed sets. Collection is represented as a linked list (of type rdf:list), where the head and tail of list are represented by rdf:first and rdf:rest respectively. The evaluation of SPARQL queries over RDF data that contain container and collection classes are also problematic.

30 CHAPTER 2. FOUNDATIONS Ontologies RDF as an abstract data model provides the facility to describe resources using triples (subject, predicate, object). For this purpose RDF provides a shared vocabulary (e.g. rdf:type, rdf:property ), relying on a common understanding of resources. The problem with the resource description in RDF is that much information about resources remains implicit. When we say that Georg Lausen is a teacher of Databases and Information Systems, then we also have in mind that Georg Lausen is also an academic sta member. This kind of implicit information can not be explicated in RDF. The function of explicating the implicit and hidden knowledge is served by ontologies. This corresponds to one of mostly cited definition of ontologies by Tom Gruber [47]: An ontology is an explicit specification of a conceptualization. In general, an ontology defines the concepts (classes of objects) in a domain of discourse and the relationship between these concepts. For example, In an University domain, students, teachers, sta members and courses are important concepts, and we can relate teachers and sta member with subclass relationship, i.e. teacher is a subclass of academic sta member. These relationships between objects can be used to derive the implicit information contained in the data. For example, given the information Georg Lausen is a teacher of Databases and Information Systems, the above subclass relationship explicate the hidden knowledge Georg Lausen is an academic sta member. Apart from describing relationship between objects, ontologies describe the properties of objects, and specify the values of these properties. Further, it enables the exchange and integration of data from heterogeneous Web sources by providing asharedandcommonunderstandingofadomain. The list of ontologies used in the context of Semantic Web includes FOAF [28], an ontology describing persons, their social activities and connections; Dublin Core [62], used for describing bibliographic entities; Gene Ontology [14], deals with describing biological and molecular processes. In the following, we concentrate on two of the most important ontology languages, RDF Schema and OWL RDF Schema RDF is a data modeling language that provides a syntax specification and a minimalistic set of vocabularies for describing resources of any application domain. However, it does not define the semantic of that particular domain. It is extended by the RDF Schema specification (RDFS) [27], which provides additional vocabulary with predefined semantics. The importance of RDFS can be illustrated with the help of an example. Consider the following information about academic sta members and a course o ered in a particular university domain, encoded in XML

31 CHAPTER 2. FOUNDATIONS 18 and RDF formate, in Listing 2.7 and 2.8 respectively. <academicstaffmember> Bernd Becker </academicstaffmember> <professor> Peter Thiemann </professor> <course name = DBIS ' '> <teacher> Gerog Lausen </teacher> </course> Listing 2.7: XML code describing Academic sta members and the o ered course rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 :P7 rdf : type ubd0v0 : AcademicStaffMember. d0v0 :P7 ubd0v0 :name Bernd Becker. d0v0 : P8 rdf : type ubd0v0 : Professor. d0v0 : P8 ubd0v0 : name Peter Thiemann. d0v0 :C1 ubd0v0 :name DBIS. d0v0 : P9 ubd0v0 : teaches d0v0 : C1. d0v0 : P9 ubd0v0 : name Georg Lausen. Listing 2.8: Example RDF statements describing Academic sta members and the o ered course DBIS. Suppose we are interested to know all academic sta members. We can do this by evaluating the following Xpath or atomic triple pattern query on corresponding XML or RDF data in Listing 2.7 and 2.8 respectively. //academicsta Member?x rdf:type udd0v0:academicsta Member The results of both query will include only Bern Becker as an academic sta member, which is semantically unsatisfactory. The result should also included Peter Thiemann and Georg Lausen because of the fact that Professor are also academic sta members and only academic sta members can teach a course. Such kind of implicit information is in relevance with the semantic of a particular domain and can not be represented in XML or RDF. We will see later in this section that the primitives provided by RDFS for describing relationships between entities (such as rdfs 2 :subclassof, rdfs:subpropertyof) and restricting property domains and ranges (rdfs:domain, rdfs:range), are extremely useful means for deriving the implicit information contained in RDF data. 2

32 CHAPTER 2. FOUNDATIONS 19 RDFS provides vocabularies for describing concepts of a domain in terms of classes and their properties. It then provide the opportunity to define the individual objects of that domain as instances of these classes. A resource can be declared to be a class by typing it as an instance of rdfs:class using the predicate rdf:type. For instance, the resource ubd0v0:academicsta Member in Listing 2.8 can be declared as udc0v0:academicsta Member rdf:type rdfs:class. In the same way, we can assert an object to be an instance of this class, e.g. d0v0:p7 rdf:type ubd0v0:academicsta Member. The following primitives are also provided by RDFS for describing relationships between classes and properties respectively. RDF Schema Entailment rules [57] use these relational primitives to derive new information from given RDF data. rdfs : subclassof is a relation that allows to define one class to be a subclass of another, e.g. class ubd0v0:professor in the Listing 2.8 can be declared as subclass of ubd0v0:academicsta Member. With this statement, all instances of class ubd0v0:professor also become instances of class ubd0v0:academicsta Member. This introduce a new triple d0v0:p8 rdf:type ubd0v0:academicsta Member in given RDF triples set. rdfs : subpropertyof relationship is used between properties to state that resources related by one property are also related by another. For example, the property hassister can be stated as a subproperty of hassibling, i.e. hassister rdfs:subpropertyof hassibling. From this RDFS primitive we can deduce that if a person is related to another by hassister property, then it is also related to the other by hassibling property. rdfs : domain is used to specify the domain of a property, i.e. the resource which has the given property is an instance of one or more classes. For instance, specifying the domain of property ubd0v0:teaches (ubd0v0:teaches rdfs:domain ubd0v0:academicsta Member) assertthatallresourceshaving the property ubd0v0:teaches are instances of ubd0v0:academicsta Member. This caused to addition of a new triple d0v0:p9 rdf:type ubd0v0:academicsta- Member in Listing 2.8. rdfs : range is used to fix the range of a property, i.e. all values of a given property are instances of one or more classes. For example, the range restriction (ubd0v0:teaches rdfs:range ubd0v0:course)statesthatallobjectsappearing as values of property ubd0v0:teaches are instances of class ubd0v0:course. This will introduce a new triple d0v0:c1 rdf:type ubd0v0:course in the given set of triples. The use of RDFS primitives rdfs:subclassof, rdfs:domain and rdfs:range discussed above introduced a new set of triples in the given RDF data as shown in

33 CHAPTER 2. FOUNDATIONS 20 Listing rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 :P7 rdf : type ubd0v0 : AcademicStaffMember. d0v0 :P7 ubd0v0 :name Bernd Becker. d0v0 : P8 rdf : type ubd0v0 : Professor. d0v0 : P8 rdf : type ubd0v0 : AcademicStaffMember. d0v0 : P8 ubd0v0 : name Peter Thiemann. d0v0 :C1 ubd0v0 :name DBIS. d0v0 : C1 rdf : type ubd0v0 : Course. d0v0 : P9 ubd0v0 : teaches d0v0 : C1. d0v0 : P9 rdf : type ubd0v0 : AcademicStaffMember. d0v0 : P9 ubd0v0 : name Georg Lausen. Listing 2.9: Example RDF statements describing Academic sta members and the o ered course DBIS. The evaluation of aforementioned query, for all academic sta member, on new RDF triples set in Listing 2.9 now gives all academic sta members Bernd Becker, Peter Thiemann and Georg Lausen OWL In line with previous discussion on ontologies, we briefly describe the Web Ontology Language (OWL) [78] in this section. As we have seen in the previous section, the main modeling primitives provided by RDFS is limited to subclass and sub-property relationships, domain and range restriction, and instances of classes. However, a number of use cases have been identified on Semantic Web by W3C that require much more expressive representation of data than RDF(S) o er. As a response to this need OWL has been developed that extends the expressibility of RDF(S) with additional set of constructs. Examples of such additional features are the definition of equivalence of classes and properties (owl:equivalentclass, owl:equivalentproperty), disjointness of classes (e.g., male and female classes), cardinality restriction, inverse of property (owl:inverseof), transitive, functional and inverse functional relations (owl:inversefunctionalproperty); and many more. We do not go more deeper into discussion of ontologies as the focus of our work is on e cient evaluation of RDF queries in a distributed environment, and does not deal with RDF reasoning. 2.3 SPARQL After agreeing on representation of Web data in RDF(S) and OWL, there is a need for a query language to extract data from these databases. Considering the

34 CHAPTER 2. FOUNDATIONS 21 graph nature of RDF data model, traditional query languages such as SQL [36], XPath [38] orxquery[25] are not appropriate candidates for querying RDF data. Thus we need a specific RDF query language. To fulfill this demand, over the last few years di erent RDF query languages such as RQL [69], SeRQL, and RDQL [91] have been proposed. A survey of di erent RDF query languages is given in [49]. In order to standardized the most known and common features of these RDF based languages, the W3C recommended the SPARQL Protocol and RDF Query Language [85], abbreviated as SPARQL. SPARQL resembles syntactically to relational query language SQL, having SELECT-FROM-WHERE statements. The basic building blocks of SPARQL query are so-called triple patterns, which are in essence RDF triples with some URIs or literals are replaced by variables. Recalling the definition of a triple t = (s, p, o) 2 (U [ B) U (U [ B [ L), where U, B and L represent the sets of URIs, Blank nodes and literals respectively. Triple patterns extend the definition of triples by introducing variables in subjects, predicates, or objects positions, i.e. t = (s, p, o) 2 (U [ B [ V ) (U [ V ) (U [ B [ L [ V ), where V denotes the set of variables. These triple patterns are matched, during query evaluation, against the RDF triples in the underlying RDF data (RDF graph). The result is valuation (assignment of URIs or literals to variables), such that replacing the variables in query graph (triple patterns) with their assigned values makes this query graph a subgraph of the underlying RDF graph. As an example, consider the following query in Listing The query gives academic sta members and the courses (any) they rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. owl#> SELECT?asm?course WHERE {?asm rdf : type ubd0v0 : AcademicStaffMember.?asm udd0v0 : teaches?course } Listing 2.10: Example SPARQL query returning Academic Sta Members and Courses they teach. The example query comprises the important constructs of SPARQL queries. The SELECT clause specifies the variables whose valuations (bindings) are supposed to return to the user. SPARQL also support a FROM clause to specify the RDF database to be queried. Keeping the global nature of RDF data model in view, SPARQL queries with out FROM clause (e.g. query in Listing 2.10) treats RDF data as a global knowledge base. In our work, we deal with such type of queries as the RDF data consists of all triples stored in underlying Peer to Peer

35 CHAPTER 2. FOUNDATIONS 22 network. The main construct of the SPARQL query is the WHERE clause that is given as a graph pattern. In the example query in Listing 2.10, the WHERE clause is given as basic graph pattern, which is a conjunctive sequence of triple patterns. During the evaluation of query, graph pattern is matched against triples of RDF database and variables inside the graph pattern are assigned the respective values (URIs, literals) in matching triples. Consider the evaluation of SPARQL query in Listing 2.10 on RDF data in Listing 2.9. The evaluation of the first triple pattern against given RDF database binds the variable?asm to URIs d0v0:p7, d0v0:p8, and d0v0:p9. In the same way, the evaluation of second triple pattern binds the variable?asm to d0v0:p9, and variable?course to d0v0:c1. Theoperator. actsasajoinoperatorbetweenthese two triple patterns. The evaluation of the given query thus results to a solution set with valuations?asm = d0v0:p9 and?course = d0v0:c1. The other graph patterns given in WHERE clause include UNION, and OP- TIONAL graph patterns. The UNION graph pattern combines graph patterns in a way that one of many graph pattern may match. In OPTIONAL graph patterns the optional selection of components in graph patterns is allowed. If part of the query graph with out optional part has matches in underlying RDF database, then these matches are included in the solution of the query. However, if no match is found in the optional part of the query graph, then the existing matches are not eliminated from the solution, instead no binding is created for the non-matching components. For example, consider the following changes in query in Listing 2.10, where we specify the second triple pattern as optional, as shown in Listing The resulting query will also include the academic sta members d0v0:p7, and d0v0:p8 in the solution, even though the course specification is missing for these entities in the underlying database (variable?course will remain unbound for these entities in the solution). A more detailed description of semantic of SPARQL graph patterns can be found in rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. owl#> SELECT?asm?course WHERE {?asm rdf : type ubd0v0 : AcademicStaffMember. OPTIONAL {?asm udd0v0 : teaches?course } } Listing 2.11: Example SPARQL query returning Academic Sta Members and optionally the courses they teach. In addition to SELECT queries that return a set of bindings of variables, SPARQL also support CONSTRUCT, DESCRIBE, and ASK queries. CON-

36 CHAPTER 2. FOUNDATIONS 23 STRUCT queries create and return RDF graph for every result, according to the rules that may be mentioned in the construct clause. DESCRIBE queries return an RDF graph that describes additional information related to resulting bindings in the solution. Finally, SPARQL ASK queries are boolean queries that check the presence of a matches for a graph pattern in underlying RDF database and correspondingly return boolean answer. The current working draft of SPARQL 1.1 [53] alsoallowsaselect or Aggregate query within the graph pattern of another query as a possible type of subqueries, and introduces graph patterns {P Filter Exists/ Exists {P 0 }} where P 0 is a graph pattern to test, as a style of negation feature. In addition to SPARQL basic graph pattern queries, we also study the distributed evaluation of SPARQL sub-queries and negation queries in this thesis. 2.4 Peer-to-Peer Networks Peer-to-Peer systems are distributed systems that allow individual peers to exchange resources (e.g. data) with out centralized control. In contrast to client sever architecture, where a distinguished sever serves many clients, the peers in a Peet-to-Peer network act as sever, client, and router at the same time. This property distributes the computing, storage, and network load between many peers and improves failure resilience, since there is no single point of failure. Among the list of features provided by Peer to Peer applications, the information sharing, scalability, and fault tolerance aspects are of most interest in this thesis. Peer-to-Peer networks are overlay networks where peers forms logical connections with each other. Topology is the term used to represent the structure of these overlay networks. On the basis of their topology, Peer-to-Peer networks are commonly classified into unstructured and structured networks. In unstructured overlay networks, e.g. Gnutella [1], there is no constraint on data placement and network topology. The overlay is built by establishing random links among network peers, resulting in a topology which is robust and is easy to maintain under churn [75]: recover quickly when peers join and leave the network without prior notice. On the other hand, the major drawbacks of unstructured overlays are their limited scalability, and longer search time. In these networks, peers do not know exactly in which direction to send a search request, and thus search requests are broadcasted over the network and each peer receiving a search request scans its database for a possible match. This approach of processing a search request leads to a high response time and imposes a lot of network tra c. Structured Peer-to-Peer overlays have been developed to overcome inherent problems of unstructured overlays. In these overlays, peers are organized in a well-defined geometric structure, which allows for more e cient query processing,

37 CHAPTER 2. FOUNDATIONS 24 since each peer knows the network structure and can forward the search request in the right direction. CAN [86], Chord [98], Pastry [88], Tapestry [61], P-Grid [6], and SkipNet [56]areprominentrepresentativesofstructuredoverlays. Mostlythey use Distributed hash table (DHT) functionality for data placement in the network. In a DHT, each peer and data item has an identifier, e.g. network address and file name, which are hashed to a hash key in key space [0, 2 m )foratypical constant m =128for128-bitkeys. Apeerthengetsalldataassignedwhichhas ahashkeybetweenitshashkeyandthenextlargerhashkeyofanotherpeerin the key space ring. In the case of RDF, we will see that URIs and literals are hashed to produce a hash key for the storage of triples. It can be shown that the key space range assigned to any peer is not greater than factor O (log n) asthe expected key space range which is 2 m /n for n peers in the network. This results in a fair load balancing of data IDs. The DHT implementations then provide a routing structure to store and look up data items with a lookup time, for instance, for Chord [98] witho (logn). Fair load-balancing and logarithmic lookup time achieve scalability where the network performance does not decrease considerably with the size of the network. However, the use of hash functions in DHTs destroys possible semantic relations between data items, and the data stored at the same peer are usually completely unrelated. Their support for the evaluation of queries are thus limited to exact match queries. Structured Peer-to-Peer networks are also considered in general to be less robust and harder to maintain under churn than unstructured overlays [90]. To overcome aforementioned shortcomings of structured and unstructured Peer to Peer networks while keeping their individual strength, 3nuts [63] overlaynetwork has been proposed. This network combines the features of unstructured and structured networks through the use of random networks for robustness, a search tree for e cient lookup, and DHTs for load balancing. Like in P-Grid [6], the distributed search tree of 3nuts preserves key ordering. Domain-related prefixes in data keys ensure with their ordering that semantically related data within the same domain is stored on nearby peers or even on the same peer. Beside this same base with P-Grid, 3nuts comes with further features we want to exploit for the e cient evaluation of RDF queries in this thesis. These o er a link structure optimization on the network layer for small latencies between peers during the search. It also allows peers with a special interest in particular search key or path to voluntary participate in managing these paths, which enables the peer to retain fast routing in that path with direct links to other peers in the branch of the path. In this thesis, we evaluate the performance of distributed RDF systems when using either the DHT-based overlay Chord or the search-tree based overlay 3nuts. These two representative implementations of DHT and tree-based networks are described in the following subsections.

38 CHAPTER 2. FOUNDATIONS Chord E cient location of a data item is one of the main features of Peer-to-Peer systems. Chord due to its simple lookup mechanism is often used to explain DHT based overlays. Chord map data and peers into a one-dimensional circular identifier space with modulo 2 m,i.e.from0to2 m -1, where m represents the number of bits used in peers and data identifiers. Every peer and data key in Chord is assigned a unique m-bit identifier by using a hash function such as SHA-1 [4]. Data key k is assigned to the first peer whose identifier is equal to or follows the identifiers of k clockwise in the identifier ring. This peer is called the successor peer of data key k and is denoted by successor(k). Wewilloftenmentioninthisthesisthatthispeer is responsible for key k. Chord, like all other DHTs, supports the Put(key, data) and Get(key) operations, responsible for the storage of and retrieval of the data in the network. For the storage of a data item, i.e. Put(key, data), dataisstoredon the successor peer of its key. Each peer maintains direct links to their successor and predecessor in the ring. Apeersuccessorlistcontainspeerswhichimmediatelyfollowthepeerinthe identifier ring. Routing a search request with the use of successor lists would be possible, by forwarding the request message around the identifier circle until the responsible peer is reached. However, this routing mechanism may require to contact all peers in the network to find the responsible peer. To make the process of routing more e cient, Chord maintains additional routing information with at most m entries, called the finger table. The use of finger table in a peer is to maintain links to the closets m peers which are located at least at a distance of 2 0, 2 1, 2 2,...,2 m 1 on the identifier space. We can note that each peer only needs to maintain links to O (log n) peersinachordnetworkwithn peers. Figure 2.2 shows an 8-peer Chord network with 6-bit circular identifier space. As the value of m = 6 in this example Chord network, it can contain up to 64 peers, i.e. N0, N1,..., N63. The figure shows the visualization of the Chord finger tables for peers N3, N36, and N45. To understand the structure of finger table, consider the table of peer N3. The first three entries in the table would be the neighbor peers N4, N5, and N7, locating at a distance of 2 0,2 1,and2 2 from N3 respectively. As neither of these peers exists, the fingers point to next available peer N9. The peer for the following entry N11 also does not exists, thus finger points to peer N16. Assuming peer N3 in Figure 2.2 issues a lookup request for an data item with key k48, the routing mechanism works as follows. Since Peer N3 does not find the successor peer of key k48 in its successor list, it forwards the request to the peer in it finger table whose identifier most immediately precedes k48. Thus N3 forwards the request to peer N35 that is closets one to k48 in the identifier space. In the same way N35 forwards the request to peer N45 that will forward it to

39 CHAPTER 2. FOUNDATIONS 26 N50. N50 must be the responsible peer for key k48, asthereisnootherpeer between N45 and N50 on the identifier space. Since the peers in a peer n finger table are spaced exponentially around the identifier space, each routing step from n to next peer covers in average half of the remaining distance to the destination peer. So in general, the average routing for a lookup needs O (log n) hops,where n is the number of peers in the network. In our given example, we reached to the responsible peer in three routing steps, while there were 8 peers in total. Lookup for Key K48 N60 N3 N3+1 => N9 N3+2 => N9 N3+4 => N9 N3+8 => N16 N3+16 => N35 N3+32 => N35 N50 N9 N45+4 N3+32 N45 N16 N35+8 N45+1 => N50 N45+2 => N50 N45+4 => N50 N45+8 => N60 N45+16 => N3 N45+32 => N16 N40 N35 N35+1 => N40 N35+2 => N40 N35+4 => N40 N35+8 => N45 N35+16 => N60 N35+32 => N3 Figure 2.2: An 6-bit Chord network consisting of 8 peers. The use of finger tables in Chord for the e cient search process however needed extra maintenance e orts during changes in the network as a result of peers joining and leaving. For that Chord uses a stabilization algorithm to update the finger tables when a peer joins or leave the network. For a detailed description of this stabilization protocol of routing tables in Chord, we refer to the original publication [98] nuts While the use of uniform hash functions in DHTs, e.g. as in Chord described above, provides fair storage load balancing and e cient search for exact match queries, it destroys possible semantic relation between data with similar keys on application level. Since semantically related data items are stored in a highly

CHAPTER 2. FOUNDATIONS 27 fragmented manner in DHTs, the e ciency of range or prefix queries, asking for all data with keys sharing a certain prefix, is significantly spoiled.

40 CHAPTER 2. FOUNDATIONS 27 fragmented manner in DHTs, the e ciency of range or prefix queries, asking for all data with keys sharing a certain prefix, is significantly spoiled. 3nuts [63] Peerto-Peer network address this problem by providing a distributed search tree (trie) for order preserving indexing. A key benefit of using this trie based network is that it clusters data items with similar keys and in turn range queries can be processed more e ciently. 3nuts abstracts a prefix tree (trie) structure defined by the identifiers of data available in the network. Each peer holds only part of the overall tree (trie) through amechanisminwhichthepeerisassignedtoapathfromtheroottoaleafofthe tree. The path of a peer represent its identifier (ID) in the search tree. Every peer is responsible to manage the data in the subtree rooted at the leaf of its path in the tree (i.e. data with the prefix given by its path). For example, the path (ID) of peer P 2 in Figure 2.3 is 010, so it manages all data items whose keys begin with " 1" 0" 1" 0" 1" P1# P3# P5# 00*" 10*" 11*" 0" 1 P2# P4# 010*" 011*" data"with"prefix"010" Figure 2.3: Example 3nuts overlay. To allow e cient routing at each level of the trie a peer maintains branch links (references) to some other peers that are responsible for the other part of the tree at that level. In each rooting step, the query is forwarded to a peer whose common prefix with the identifier key of the data element is larger than with the current peer s ID. For example, in Figure 2.4, foranysearchkey with prefix 00 (00 ) received by the peer P 2, P 2 will forward it to the peer P 1. 3nuts, like any other Peer-to-Peer overlay, supports two basic operations: Get( key) for searching a certain key (or key range) and retrieving the associated data

41 CHAPTER 2. FOUNDATIONS 28 P1# P2# 0" 0" 1" P2# P2# 0" 1 1" Branch"link"to" peer"p5# P5# P2# P4# 010*" data"with"prefix"010" Figure 2.4: Routing in 3nuts overlay. items and Put(key, value) for storing new data items. The key describes a path in the distributed prefix tree of 3nuts. The peer p which received the search request determines the peer responsible for storing key, byfollowingthepathkey in its local view of the tree until a leaf node is reached. This leaf node can be either a branch link (link to a random peer of subtree neighboring p s prefix) or the last node in the peer p s path. In the former case the search request is forwarded to the random peer p 0 in the branch link, which share longer prefix with key than peer p. Inthelattercasetherequestisforwardedtothepeerresponsibleforthe last node. We can note that the distributed search trees of P-Grid [6] and3nuts[63] overlays preserve the ordering of data identifiers (keys). The ordering in the tree can represent the semantical proximity of closely related data (e.g. closely related data items are stored on network-wise close peers). The lookup between two peers sharing the same prefix consequently takes less hops, which is a pre-condition for the e cient processing of range or prefix queries. Janson et al. [63] hasshownthat the distributed search tree of 3nuts provides point and range queries in O (log n) routing hops 3 with high probability where n denotes the number of peers in the network. Besides sharing this feature of information locality with P-Grid, 3nuts comes 3 Multi-hop routing in an overlay network: a request for a search key routed over several peers (hops) from the requesting peer to the target peer which is responsible for the search key and generates the response.

42 CHAPTER 2. FOUNDATIONS 29 with further characteristics that would make it a promising overlay for the e cient management of RDF data. With network locality feature provided in 3nuts, the routing structure of network is optimized for links with low turn-around-times (ping), e.g. peers choose communication partners from their branch links with a low ping for short latencies. 3nuts provides another feature of so-called interest locality, wherepeerswithaspecialinterestinaparticularsearchkey or path can voluntary participate in managing these paths. When co-managing a path and establishing routing there, a peer increases its routing table but retains fast routing in that path with direct links to other peers in the branch of the path. Additionally, the peer may also participate in voluntary managing data in a path (which we do not make use of for the e cient evaluation of RDF queries in this thesis). To improve the churn and fault resilience of the network, all peers that have been assigned to a subtree of 3nuts distributed search tree are connected by a random network. Peers in these random networks update their routing information through Pointer-Push&Pull operation. For example, the nodes of the prefix tree in Figure 2.3 are replaced by random networks 4, shown in Figure 2.5. The root of the tree is replaced by random network containing all five peers (P 1, P 2, P 3, P 4, P 5 ), the peers are then recursively assigned to subtrees, with random networks (P 1, P 2, P 4 ), (P 3, P 5 ), and (P 2, P 4 ), until there is only a single peer left in each subtree. P2# 0" P1# P4# P2# P5# P3# 1" 1" 0" P1# 00*" P1# P4# P3# P5# 1" 0 1" P3# P5# P2# P4# 0" 1" 10*" 11*" P2# P4# 010*" 011*" Figure 2.5: Example of random networks in 3nuts tree. 4 example adapted from [63]

43 CHAPTER 2. FOUNDATIONS 30 The order-preservation feature provided in P-Grid and 3nuts distributed search trees may lead to non-uniform data distribution. Probably some branches of tree hold more data than others and consequently causes a higher storage load to the peers responsible for these branches. The load balancing mechanism used by P- Grid in such situations is based on heuristics, whereas 3nuts o ers a simple and elegant load balancing technique through the use of distributed heterogeneous hash tables (DHHT) [89]. DHHT, an extended form of DHT, is used to support non-uniform weights during the recursive assignment of peers to subtrees in 3nuts overlay. The number of peers assigned to a subtree with DHHT depends on the load of this subtree, i.e. peer choses independently subtree v with weight w v with probability: P [v] = w v P Figure 2.6 shows an example of balanced search tree in 3nuts, generated with the simulator for 10 peers. We can see that the number of peers assigned to each subtree proportionate to the weight of the corresponding subtree. i w i weight=10 peers=5 weight=10 peers=5 0 1 weight=4 peers=2 weight=6 peers=3 weight=6 peers=3 weight=4 peers= weight=3 weight=1 peers=2 peers=1 weight=3 peers=2 weight=3 peers=1 weight=4 peers=2 weight=2 peers=1 weight=4 peers= weight=1 weight=2 peers=1 peers=1 weight=1 peers=1 weight=1 weight=2 peers=1 peers=1 weight=1 peers=1 weight=2 peers=1 weight=4 peers=2 weight=2 peers=1 weight=1 peers=1 weight=3 peers= weight=1 peers=1 weight=2 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 weight=3 peers=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=1 peers=1 weight=2 peers= generated with the simulator and plotted with Graphviz Figure 2.6: A balanced search tree example in 3nuts network.

44 3 P2P based RDF Storing and Querying After the presentation of Semantic Web and Peer-to-Peer networks foundations, this chapter deals with the idea of using Peer-to-Peer overlays for the distributed storage and querying of RDF data. In this RDF data management setting, several heterogeneous sources of RDF data are geographically distributed. We present the distributed management of RDF data in DHT-based RDF data stores and contrast them against the management of RDF data in our tree-based RDF data store (3rdf). The remainder of this chapter is organized as follows. We start with an overview of the architecture of Peer-to-Peer based RDF stores in Section 3.1. Thefollowing Sections 3.2 and 3.3, then describe the storage and querying of RDF data in the state-of-the-art DHT-based RDF stores. In Section 3.4 we analyze the query processing time in a Peer-to-Peer environment, and show the significance of routing delays in processing time of queries. In Section 3.5 shows the e ects of underlying network structure on processing time of queries, and we show a speed-up in query processing by exploiting features of 3nuts, the overlay network we use in our 3rdf system. Section 3.6 then presents our 3rdf system architecture and the mechanisms used to store and query RDF data. Based on our simulator developed for the performance comparison of DHT-based and search-tree based overlays, we present in Section 3.7 simulation results regarding the performance of the distributed RDF system with the performance metrics routing-steps and time for RDF query evaluation. Section 3.8 presents related work in the area of RDF data management. The chapter then concludes with a summary. 31

45 CHAPTER 3. P2P BASED RDF STORING AND QUERYING P2P based RDF data Management With the proliferation of RDF data on Web accelerating, there is an urgent need for the development of RDF data stores customized for the e cient management of such data. High performance centralized RDF data stores such as Sesame [30], Jena [34] and3store[51] havebeendevelopedthatstoreandevaluaterdfdataon asinglemachine.thesedevelopments,forthee cientmanagementofrdfdata on a single machine, are continued through many new works on non-distributed RDF systems [101, 5, 82, 99, 95] thathavebeenstudiedduringlastfewyears. These centralized RDF data stores have demonstrated great performance in handling data sets containing million of triples. However, as the size of RDF data on Web growing continuously, it is no longer feasible to manage RDF data on a single machine with reasonable performance. Keeping the inherently distributed nature of Semantic Web and aforementioned scalability issues of centralized solutions in view, there is a need for a distributed RDF infrastructure. Several projects have emerged that proposed the idea of using Peer to Peer (P2P) networks for the distributed management of RDF data. The list of such Peer to Peer based RDF data stores includes our 3rdf system [8, 11, 9, 10], BabelPeers [17, 18, 58, 15, 16], Atlas [64, 73, 66, 74, 65], RDFPeers [31, 33], GridVine [7, 40], RDFCube [77], and many others P2P based RDF stores Architecture In the application scenarios on the Semantic Web supported by Peer-to-Peer based RDF stores, we distinguish two kind of users, the resource/metadata provider and the resource/metadata consumer. As described in section 2.1.2, W3C has recommended RDF as a standard format for representing resources (metadata) in the Semantic Web, thus the resource descriptions within the network are stored and queried as RDF triples. Figure 3.1 shows, the resource description provider provides Web resources descriptions by submitting RDF data to the system, and the Web resources consumer discovers resources by submitting SPARQL queries to the system. The distributed nature of the Semantic Web for the process of annotating and querying Web resources matches well with the use of Peer-to- Peer networks as underlying overlay in RDF data stores. Peers in underlying overlay network have symmetrical functionality. They can distribute and store RDF triples, submit and evaluate SPARQL queries, and can route the store and query requests to appropriate (responsible) peers. The essence of underlying Peerto-Peer overlay is that a peer in the network can directly exploit resources (RDF triples) present at other peers of the network without central control. Majority of the existing work towards a distributed infrastructure of RDF data management choose the structured Peer-to-Peer networks as the underlying overlay

46 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 33 Figure 3.1: Using P2P overlay for storing and querying RDF data. (e.g. BabelPeers [17], Atlas [64], RDFPeers [31], GridVine [7], RDFCube [77]). Unlike unstructured Peer-to-Peer networks, in these structured Peer-to-Peer overlays, peers are organized in a well defined geometric topology. They exhibit stronger guarantees in terms of search (lookup) time, and also o er their support for decreasing the maintenance cost of the overlays [35]. DHT based structured overlays such as CAN [86], Chord [98], Pastry [88], Tapestry [61] aremainlyusedforthe storage and evaluation of RDF data in existing distributed RDF stores. By providing the put and get methods, they o er a simple and e cient mechanism to store and retrieve data in a distributed environment. The main drawback of using these DHTs as underlying overlays is that they only support exact match queries, i.e., searching for data items matching a given key, by the method get(key). They have limits when it comes to e ciently evaluate the complex queries such as range and conjunctive queries, supported by RDF data model. The format of RDF data model influences the choice of underlying network topology, that can be noticed through data indexing model of most distributed RDF systems in next sections. Keeping this relation between the network topology and data indexing mechanism in view, the selection of underlying network structure greatly a ects the data lookup and consequently the query processing mechanism. In the remaining parts of this chapter, we will show the e ects of chosen network topology on RDF query evaluation in existing distributed RDF stores, and will also show how to exploit the format of RDF data model for the possible improvement

47 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 34 in the structure of underling network. In the following sections 3.2 and 3.3, wedescribethedistributionofrdfdata over the peers of DHT-based network and then describe the evaluation of SPARQL queries over these data. We will show that the indexing and routing mechanisms used in DHTs do not allow these overlays to exploit the format of RDF data model for the e cient evaluation of SPARQL queries. 3.2 Storing RDF triples in DHTs To enable the storage and querying RDF data in distributed RDF stores, peers are connected via a structured Peer-to-Peer network which implements a DHT, as shown in Figure 3.2. The ID space of a DHT is distributed between peers of overlay network to support e cient insert and lookup operations. Each peer in the system can publish (insert) RDF resources into the network. As described in section 2.1.2, intherdfdatamodel,resourcesareexpressedassubject-predicateobject expression, called triples in the RDF terminology. The subject in a RDF triple is an URI which denotes the resource, the object is either a literal value or the URI of another resource, and the predicate expresses the kind of relationship between the subject and object. RDF data is inserted as RDF documents in the network, that can be in an RDF/XML or RDF/N3 format. For example, the following RDF document (RDF/N3 format), in Listing 3.1, containsinformation (seven triples) about the students, the courses they take, and their rdf : <http :// org/1999/02/22 rdf syntax ub: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 : S1 rdf : type ub : Student. d0v0 : S4 rdf : type ub : Student. d0v0 : S1 ub : takescourse d0v0 :C1. d0v0 : S1 ub : takescourse d0v0 :C3. d0v0 : S1 ub : advisor d0v0 :T1. d0v0 : S4 ub : advisor d0v0 :T1. d0v0 :T1 ub : teacherof d0v0 :C3. Listing 3.1: Example RDF statements about students Student1, Student4, their courses Course1, Course3, and advisor Teacher1 encoded in RDF/N3 format. Each RDF document is decomposed into a collection of RDF triples to store them in the network. Since the majority of RDF query languages, including SPARQL, are based on constraints-search of the triple s subject, predicate or object components, each triple is indexed and thus stored three times using its subject, predicate, and object as di erent keys. This triples storage method is by now standard in Peer-to-Peer based RDF stores. The hash values of the sub-

48 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 35 ject, predicate and object are used to compute the identifiers that point to the responsible peers in ID space of a DHT for storing corresponding triples. Triples are stored in the network by using the DHT Put(id, t) operation for each triple t 2 triples, usingaskeythesubject,predicateandobjectrespectively, and the triple it self as the item. The peer interested in storing a set of triples hashes the subject, predicate and object of a triple, using hash functions such as SHA-1, to create the identifier id that leads to the responsible peer, and then transmit the triple there. The responsible peer then manages the triple and insert it into its local database consisting of a single relation with three columns. The peer who publishes their triples in the network may occasionally be the responsible peer for some of their triples. However, the larger the network expands, the less expected will this be to happen. The distribution of the triples in Listing 3.1 over the DHT network, using the aforementioned triple storing method, is visualized in Figure 3.2. Each triple is stored three times, using its subject, predicate, and object as key for the DHT network. For each of the ten indices(keys) occurring in the triples of Listing 3.1, a block of corresponding triples has to be stored. The DHT lookup mechanism determines which block of triples is stored on which peers. A possible distribution of given triples over a DHT network with eight peers is shown in the figure. 3.3 Query Evaluation in DHTs This section presents query evaluation process developed for state-of-the-art DHTbased distributed RDF stores (RDFPeers, Atlas, BabelPeers, and GridVine etc). First we describe the basic protocol used in these systems to evaluate a query consisting of a single triple pattern, then we describe the distributed evaluation of more challenging class of queries, e.g., the evaluation of conjunctive triple pattern queries Evaluation of Atomic Triple Patterns The indexing (storage) of RDF triples in distributed RDF stores, described in previous section, supports the evaluation of all atomic query patterns. An atomic query pattern is a triple pattern where any combination of subject, predicate and object is either specified or a variable. The total number of query patterns can be determined by considering the fact that there exist 2 possibilities for each triple component. Therefore there is total 2 3 possible subsets of query patterns without ordering. Table 3.1 shows all possible query patterns for RDF triple lookups. Query patterns 2 through 7 in Table 3.1 are the most practical ones, used in distributed RDF querying, where at least one triple component is specified, and

49 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 36 Key Responsible peer ub:student d0v0:s1 n3 n4 ub:takescourse d0v0:t1 n5 d0v0:s1, rdf:type, ub:student d0v0:s4, rdf:type, ub:student rdf:type d0v0:c1 n6 ub:advisor d0v0:c3 ub:teacherof d0v0: S4 n7 n1 n2 n3 d0v0:s1, rdf:type, ub:student d0v0:s1, ub:takescourse, d0v0:c1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:s1, ub:advisor, d0v0: T1 n1 n4 d0v0:t1, ub:teacherof, d0v0:c3 d0v0:s4, rdf:type, ub:student d0v0:s4, ub:advisor, d0v0:t1 n8 n5 d0v0:s1, ub:takescourse, d0v0:c1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:s1, ub:advisor, d0v0:t1 d0v0:s4, ub:advisor, d0v0:t1 d0v0:t1, ub:teacherof, d0v0:c3 n7 n6 d0v0:s1, ub:advisor, d0v0:t1 d0v0:s4, ub:advisor, d0v0:t1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:t1, ub:teacherof, d0v0:c3 d0v0:s1, rdf:type, ub:student d0v0:s4, rdf:type, ub:student d0v0:s1, ub:takescourse, d0v0:c1 Figure 3.2: Storing triples in Listing 3.1 into a DHT-based RDF store of eight peers. with at least one unspecified component (variable) used to obtain the desired results. Query pattern 1 is the most expensive query which matches all triples. There is no restriction whatsoever in this pattern, and thus has to be propagated to all peers in the network for the evaluation. The way triples are stored in distributed RDF stores, where each triple is stored three times based on hash values of its subject, predicate and object, the routing algorithm of underlying DHT network can be used to resolve query patterns 2 through 8. Since each of these seven query patterns contains at least one constant value, the requesting peer chooses this constant value and hashes it to create the identifier that will lead the corresponding query pattern to the responsible peer. In the case when there is more than one constant, the requesting peer heuristically selects the keys in the order (subject, object, predicate) based on expected selectivity and the assumption that there will be more distinct subjects or objects values than distinct predicate values. The destination peer then matches this query pattern with the triples in its local triple table and the resulting triples

50 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 37 No Triple patterns 1 (? :? :?) 2 (subject :? :?) 3 (subject : predicate :?) 4 (subject :? : object) 5 (? : predicate :?) 6 (? : predicate : object) 7 (? :? : object) 8 (subject : predicate : object) Table 3.1: Possible atomic query patterns are returned to the query requesting peer. For example, if peer n1 in Figure 3.2 asks the query (?x, ub:takescourse,?y), the hash value of the constant part (ub:takescourse) is used to route the query to the responsible peer n5. Peer n5 filters triples locally using this query pattern, and sends back the matched triple (d0v0:s1, ub:takescourse, d0v0:c3) to the requesting peer n Evaluation of Conjunctive Triple Pattern Queries In this section we describe the evaluation of conjunctive queries composed of triple patterns over RDF data stored in DHTs. As already described in section 2.3, the core construct of SPARQL query language is a conjunction of triple patterns (basic graph pattern). Let U, L, andv represent the pairwise disjoint sets of URIs, literals, and variables, a triple (v 1,v 2,v 3 ) 2 (U [ V ) (U [ V ) (U [ V [ L) is called a SPARQL triple pattern [85]. A basic graph pattern of a SPARQL query is a conjunction of triple patterns, and following [73], we define it as a formula:?x 1,...,?x n : (s 1,p 1,o 1 ) ^ ^(s n,p n,o n ) where?x 1,...,?x n are variables and each (s i,p i,o i )isatriplepattern. Variables?x 1,...,?x n represent the answer variables and each variable x k appears in at least one triple pattern. During the evaluation of a conjunctive triple patter query, these triple patterns are matched against the RDF triples in the underlying input RDF data (RDF graph). The result is a valuation (assignment of URIs or literals to variables), such that replacing the variables in these triple patterns (query graph) with their assigned values makes this query graph a subgraph of the input RDF graph.

51 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 38 Data of several peers has to be combined for the distributed evaluation of conjunctive triple pattern queries in a Peer-to-Peer environment. Majority of existing DHT-based RDF management systems, Atlas [64, 73, 66, 74, 65], RDFPeers [31], GridVine [7], and BabelPeers in their work [16] use query processing strategies based on algorithm, Query Chain (QC), presented in [73]. In query chain (QC) algorithm, triple patterns contained in the query are iteratively resolved by a chain of nodes. The algorithm evaluates one triple pattern at a time and transfers the intermediate results from one peer to the other. The query evaluation starts with asingletriplepatternofthequerybydoingalookupforthepeerresponsiblefor the evaluation of this triple pattern. This peer adds to an intermediate result all triples of its local database qualified for the evaluated triple pattern. The intermediate result is then extended by doing a lookup for a second triple pattern and joining the results. This operation is executed until all triple patterns of the query have been processed. To explain the QC algorithm, we use a running example. The example SPARQ- LqueryshownbelowinListing3.2 is evaluated over the distributed RDF repository given in Figure 3.2. The query asks for the students taking courses taught by their advisors. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX ub : <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. owl#> SELECT?x?y?z WHERE {?x rdf : type ub: Student.?x ub: takescourse?z.?x ub: advisor?y.?y ub: teacherof?z } Listing 3.2: Example SPARQL basic graph pattern query. For the sake of simplicity, we assume that the triple patterns are evaluated in the given order. Owing to the triple distribution method and DHT algorithm, described in section 3.2, aconstantvalueinthetriplepatternisusedasakey for the DHT routing. In case of multiple constants present in the triple pattern, the QC algorithm heuristically prefer the value of subject over the object and the value of object over the predicate to be used as key for the determination of the peer responsible for the evaluation of the triple pattern. This selection is based on the fact that there will be more distinct subject or object values than distinct predicate values in a given RDF data. Figure 3.3 shows the evaluation of the query in Listing 3.2 over a DHT-based RDF store given in Figure 3.2. Thequery requesting peer n2 chooses the constant ub:student as a key to send the query to the responsible peer n3 for the evaluation of first triple pattern.

52 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 39 Key Responsible peer ub:student d0v0:s1 n3 n4 ub:takescourse d0v0:t1 rdf:type d0v0:c1 n5 n6 d0v0:s1, rdf:type, ub:student d0v0:s4, rdf:type, ub:student ub:advisor d0v0:c3 ub:teacherof d0v0: S4 n7 n1 n1 results n2 ub:student n3 ub:takescourse n4 d0v0:s1, rdf:type, ub:student d0v0:s1, ub:takescourse, d0v0:c1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:s1, ub:advisor, d0v0: T1 d0v0:t1, ub:teacherof, d0v0:c3 d0v0:s4, rdf:type, ub:student d0v0:s4, ub:advisor, d0v0:t1 n8 ub:teacherof ub:advisor n5 d0v0:s1, ub:takescourse, d0v0:c1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:s1, ub:advisor, d0v0:t1 d0v0:s4, ub:advisor, d0v0:t1 d0v0:t1, ub:teacherof, d0v0:c3 n7 n6 d0v0:s1, ub:advisor, d0v0:t1 d0v0:s4, ub:advisor, d0v0:t1 d0v0:s1, ub:takescourse, d0v0:c3 d0v0:t1, ub:teacherof, d0v0:c3 d0v0:s1, rdf:type, ub:student d0v0:s4, rdf:type, ub:student d0v0:s1, ub:takescourse, d0v0:c1 Figure 3.3: Evaluating SPARQL query in Listing 3.2 over a DHT-based RDF store of eight peers. The responsible peer n3 finds the valuations of variable?x (d0v0:s1, d0v0:s- 4) from its local database that match the first triple pattern. Peer n3 then sends the query along with the values of variable?x as intermediate results to peer n5 which is responsible for the evaluation of second triple pattern. Peer n5 finds the values of variables?x,?z that matches the second triple pattern and joins them with the intermediate results from peer n3, as shown below in Figure 3.4.!!!!!!!!!!!!?x!!!!!!!!!?x!!!!!!!!!d0v0:S1! d0v0:s1!!!!!!!!!d0v0:s4! d0v0:s1!!!!!!!!?z!!!!!d0v0:c1!!!!!d0v0:c3!!!!!!!!!?x! d0v0:s1! d0v0:s1! Figure 3.4: Evaluation of the query on peer n5.!!!!!!?z!!!!d0v0:c1!!!!d0v0:c3! For the evaluation of the third triple pattern, peer n5 picks the constant ub:advisor as a key, and sends the query and the intermediate results to the

53 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 40 peer n7. Peern7 correspondingly finds the matches for the triple pattern and joins them with the received intermediate results. Figure 3.5 shows the processing of the query on peer n7.!!!!!?x!!d0v0:s1!!d0v0:s1!!!!!?z!!d0v0:c1!!d0v0:c3!!!!!!?x!!d0v0:s1!!d0v0:s4!!!!!!?y!!!d0v0:t1!!!d0v0:t1!!!!!!?x!!!!!?y!!!!!!?z!!d0v0:s1!!d0v0:t1!!d0v0:c1!!d0v0:s1!!d0v0:t1!!d0v0:c3! Figure 3.5: Evaluation of the query on peer n7. Finally peer n1 responsible for the key ub:teacherof evaluates the last triple pattern and joins the valuations of variables?y and?z with the intermediate results, shown below in Figure 3.6. The answer tuple (d0v0:s1, d0v0:t1, d0v0:c3) will then be send by the peer n1 to the query originating peer n2.!!!!!?x!!!!!?y!!!!!!?z!!d0v0:s1!!d0v0:t1!!d0v0:c1!!d0v0:s1!!d0v0:t1!!d0v0:c3!!!!!!?y!!d0v0:t1!!!!!!!!?z!!d0v0:c3!!!!!!?x!!!!!?y!!!!!!?z!!d0v0:s1!!d0v0:t1!!d0v0:c3! Figure 3.6: Evaluation of the query on peer n1. Given that the aforementioned query processing algorithm combines RDF triples distributed across several peers of the network to evaluate user queries, a challenging issue is how to optimize the processing of the discussed query algorithm. We have to investigate what kind of optimizations, on network and application level, are applicable in a Peer-to-Peer environment to speed up the query evaluation process. In the remaining parts of this chapter, we analyze e ects of chosen network structure on performance of query evaluation, and propose improvements in the structure of underlying network to optimize the distributed evaluation of RDF queries. 3.4 Analysis of Query Processing Time We can note that each step of aforementioned distributed query processing incurs a network cost for determining the responsible peer and transferring the intermediate results there. So the processing time of the query is mainly determined by the time the lookup operation takes plus the transfer time of the intermediate results.

54 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 41 Let k = b denote the product of delay (latency) and bandwidth b in a given physical network model. The processing time t of a query with m triples patterns p i is then 0 1 t = mx i=1 c i {z} comp-time +hops i + {z } lookup time l datai m k {z } transmission time where c i is the computation time for evaluating pattern p i,hops i is the number of hops to lookup the peer providing the triples for pattern p i,anddata i is the data size for triples of pattern p i to transfer to the next peer as intermediate results. For c i we neglect the computation time. The processing time is mainly driven by routing delays and the data transfer time. We can note in the above equation that network latencies and bandwidth b of given physical network are the main factors for the determination of query processing time. Bandwidth in a network can always be increased, by adding more pipes, but the latency can not be decreased. When the bandwidth of underlying network is very high, routing delays (hops i )areadominantfactorinprocessing time of queries. Furthermore we have to consider that routing a lookup request for triple pattern p i to a target peer in a DHT network typically costs hops i = O (log n) routing steps. In the scenarios where the network latency and bandwidth b of physical networks are fixed, the improvement in query processing time is only possible P through decreasing the number of routing hops ( m hops i )duringlookupoperation, P and through minimizing the size of intermediate results ( m data i )transmitted during query evaluation. Existing works in the area of distributed SPARQL query optimization, [65] and[16], mainly focused on minimizing the size of intermediate results through finding a sensible order in which triple patterns are processed. They used selectivity-based heuristics [97] in their optimization algorithms which try to minimize the size of transmitted data (intermediate results) produced during query evaluation. P The number of routing hops ( m hops i ), taken during lookup operations, to i=1 resolve a query is the dominant performance metric in a Peer-to-Peer environment. To the best of our knowledge non of the existing works in the area of Peer-to- Peer based RDF querying has presented any optimization technique based on minimizing the number of lookup operations. While state-of-the-art distributed RDF stores focused on optimizing query evaluation on application layer, we see arealpotentialintheinteractionoftheapplicationandthenetwork. Mapping i=1 i=1 C A

55 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 42 the application data relations on the network structure could improve the routing time in the network and consequently the RDF query latency on application layer. We show in the next sections how the lookup operation and consequently the query evaluation in a Peer-to-Peer environment can be optimized by improving the structure of underlying network. 3.5 E ects of Network Structure Improvements on Query Evaluation The RDF data model provides the facility to application domains to use their own vocabularies (collection of URIs) for the representation of their resources. We assume that an application domain chooses a specific set of URIs to represent its domain specific data. For example, DBpedia 1 [23], a Wikipedia based data set, use a DBpedia namespace ( to describe things that are the subject of a Wikipedia articles in German. One possible way to optimize the lookup operation is to store the data of a same application domain (triples sharing the same namespaces in their subjects, predicates or objects) on near by peers of underlying network. As discussed in section 3.1, state-of-the-art Peer-to-Peer based RDF data stores such as RDFPeers, Atlas, and BabelPeers use Distributed Hash Tables (DHTs) to store and query RDF data in a distributed manner. To attain an e cient search for RDF triples with the same subject, predicate, or object, triples are indexed three times for each triple components (subject, predicate, or object) in these distributed RDF stores. Triples with the same index key, such as the subject, are on the same peer. Traditional DHT-based networks such as Chord [98] orpastry[88], which are used as an underlying network in these distributed RDF stores, apply uniform hash functions to map data keys to the peers in the network. This achieves good storage load balancing, i.e., keys are evenly distributed among the peers of the network, but sacrifices the relationship of the keys based on their order. The use of hash functions destroys the ordering of the index keys on application level, and along with it the grouping of semantically-related data, e.g. data of a university domain (sharing the same namespaces in their subject, predicate or object) cannot be stored on a contiguous interval and is spread over the complete table. This can cause more routing when collecting data from the same domain to evaluate aquery. Keyswhicharesemanticallycloseattheapplicationlevelareheavily fragmented in the DHT, and thus the data with di erent index keys stored on the same peer is usually unrelated. Since the related data items belonging to a 1

56 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 43 particular application domain are stored in a highly fragmented manner in DHTs, the e ciency of range queries or queries posed on semantically related attributes is significantly spoiled. For example, consider the storage of RDF triples belonging to a particular university domain over a DHT network in Figure 3.2. The triples are stored on their subject, predicate and object values used as index keys. We can note that, though the related index keys share the same order (namespace), e.g. ub:student, ub:advisor, ub:takescourse and ub:teacherof share a common namespace ub, the use of hash functions destroys the ordering of these keys and stores the relevant triples on distant peers of the network. In response as shown in Figure 3.3, thelookupoperationontheserelatedindexkeys(ub:student,ub:advisor, ub:takescourse and ub:teacherof), for the evaluation of triple patterns of the query in Listing 3.2, costsahighnumberofroutinghops. GridVine [7] isanotherdistributedrdfsystemproposedforthestorageand querying of RDF data. GridVine addresses the aforementioned problem by using a search-tree based Peer-to-Peer network, P-Grid [6], instead of a DHT-based search structure. P-Grid is a Peer-to-Peer lookup system based on a virtual distributed search tree. It clusters semantically close data items by applying orderpreserving hash functions. However, in contrast to other networks like Pastry [88] and 3nuts [63], P-Grid does not provide a routing structure with latency-optimized links for reducing the search time in the network. Aberer et al. [7] hasnotinvestigatedthee ectofusingthesearch-treebased overlay network (P-Grid) on performance of distributed query evaluation in their GridVine RDF system. In this thesis we design and implement a scalable and distributed RDF repository, called 3rdf, for storing and querying of RDF data. 3rdf is built on top of the 3nuts [63] Peer-to-Peernetwork. LikeP-Grid,3nutso ersadistributedsearch tree and a distributed data storage usually for extended meta information besides search keys which we call index data. Thedistributedsearchtreeof3nutsprovides order-preserving indexing. The ordering in the tree can represent the semantical proximity of closely related RDF triples. Domain-related prefixes (namespaces) in subjects, predicates, and objects ensure with their ordering that semantically related data (triples) within the same domain is stored on nearby peers or even on the same peer. In contrast to P-Grid, 3nuts comes with further features we exploit in our 3rdf system to reduce tra c and response time of SPARQL queries. There are two reasons for choosing 3nuts as underlying network in our 3rdf system. First, there is an implementation in Java that we can use for our system. Most other semantic networks that provide range queries except for PGrid are only theoretical. Secondly, the 3nuts network provides further features which allow to adapt the network structure to the search structure for reducing the communica-

57 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 44 tion time and tra c on the application layer. While other RDF systems aim to achieve the principle of data independence [59] andfocusonenhancingthequery processing on application layer, we see real potential in the interaction of application and network. We exploit the so called information, network and interest localities provided in the underlying 3nuts network for the e cient management of RDF data in our 3rdf distributed system Information locality The distributed search tree of 3nuts preserves key ordering. Each peer manages a continuous part of the key space, e.g. all data elements share the same prefix. The similarity of data elements can be mapped one-dimensionally to the data ordering, which we call information locality, e.g. nearbyelementsgetthesameprefixkey (see Listing 3.1). Domain-related prefixes (namespaces) in subjects, predicates, and objects of RDF triples order triples of the same domain in the same branches of the search tree. In return, the data belonging to the same domain is stored on nearby peers (in the metric of the overlay routing structure) or even at the same peer. The benefit here is that a lookup between two peers sharing the same prefix takes less hops. So if a SPARQL query contains triples related to several keys sharing the same prefix (see Listing 3.2), the number of hops required to reach all these keys is reduced (at best, some keys are managed by the same peer). For example, Figure 3.7 shows the storage of RDF triples in Listing 3.1. We can see that domain related prefixes in subjects, predicates, and objects of given triples (i.e. rdf, ub and d0v0 ) order triples of the same domain in the same branches of the tree. In return, for the evaluation of triple patterns with lookup keys sharing the same prefixes (e.g. Student, takescoruse, advisor and teacherof lookup keys in given query in Listing 3.2 share a common prefix ub ), the number of routing hops required to reach all these keys will be reduced. The evaluation of given query in Listing 3.2 starts at an arbitrary peer which might not participate in the subtree with prefix ub in the 3nuts tree and thus needs O (log n) routing hops for the first triple pattern, but when the query enters this subtree once, it stays inside for the rest of the query steps with fast routing Network locality Minimizing the number of routing hops, during the lookup operation through information locality, aloneisnotasu cientexercisetoimprovetheroutingtime. Because a hop connecting peers located in two di erent countries has higher latency than a hop connecting peers in the same building. As discussed in description of 3nuts in section 2.4.2, toallowe cientroutingateachlevelofthe3nutssearchtree

7: RDF triples storage in 3nuts search tree.

The goal is to improve the routing time by selecting a peer out of these referenced peers with small latency.

structure of network is optimized for links with low turn-around-times (ping)

3 Interest locality While we have already placed triples with the same index prefix on nearby nodes in the network with information locality resulting

58 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 45 rdf:% ub:% d0v0:% rdf:ty pe% ub:ta kesco urse% ub:ad visor% ub:te acher Of% d0v0: S1% d0v0: C1% Figure 3.7: RDF triples storage in 3nuts search tree. apeermaintainsbranchlinks(references)tosomeotherpeersthatareresponsible for the other part of the tree at that level. The goal is to improve the routing time by selecting a peer out of these referenced peers with small latency. With network locality feature provided in 3nuts, the routing structure of network is optimized for links with low turn-around-times (ping), e.g. peers choose communication partners from their branch links (references) with a low ping for short latencies Interest locality While we have already placed triples with the same index prefix on nearby nodes in the network with information locality resulting in a fast routing time in between, triples index with more diverse prefixes still need routing time of O (log n) hopsin the underlying Peer-to-Peer network (e.g. rdf:type and ub:... in List. 3.2). The 3nuts network also allows a peer to have additional routing structures in certain paths in the tree. In this so called interest locality, peerswithspecialinterestfor a certain search key or prefix range can voluntary manage data there or simply have fast routing in these paths with an additional routing structure. This interest locality feature supported by the 3nuts network gives us the opportunity to speedup the routing between more diverse prefixes by placing routing shortcuts. For

59 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 46 example, as shown in Figure 3.8, thepeerresponsibleforthetriplecomponent rdf:type places a routing shortcut to the path with prefix ub:... for the purpose of getting a speedy routing between these diverse prefixes. Figure 3.8: Routing shortcuts in 3rdf. Of course, this benefit comes with the extra costs of maintaining additional routing links there. The more routing shortcuts we place, the more the network structure is extended, which results in higher maintenance costs for routing links. However, the good news is that the tra c produced in the query execution can be reduced by using the routing shortcuts instead of the original longer routing paths. Thus, when a placed routing shortcut is heavily used during query evaluation, the routing time can be shortened and the overall tra c is reduced at the same time. So to reach the three goals fast query latency, reasonable small tra c and network structure, we can only place a limited amount of routing shortcuts based on the frequency of routing paths used between relevant index keys during query evaluation. In a nutshell, if we know that certain routing paths between some RDF keys (index keys) are frequently used, we can establish routing shortcuts with interest locality to reduce tra c and query response time. Of course, one could also keep such shortcut links between applications on the RDF layer (application layer). However, on the network layer the shortcuts are automatically maintained in the dynamic network scenario and integrated in the query algorithm of the network with all its backup techniques for failed routing. The creation of shortcuts on the network layer require the interaction of the application and the network. For the creation of shortcuts the related keys in the set of RDF triples, stored in the network, are identified and peers maintain

60 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 47 statistics of these related keys. For example, in the set of stored triples describing professor instances in Figure 3.9(a), the following set of keys ( type, name, teach, author, P1...P5, Pub1, Pub2, Course1 etc.) are identified as related keys. Peers create statistics between these related RDF keys with the help of the keys they are responsible for. For example, in Figure 3.9(c) peer responsible for the the key Professor maintains statistics of other related keys ( name, author, teach, Course1, Course2, Pub1, Pub2 ). In the second step, we apply algorithm that looks for keys, out of these related keys, that often occur together; i.e., sets of subjects tend to have these RDF keys as predicates or objects. For example, for the triples in Figure 3.9(a), type, name, and author tend to be defined as predicates for similar subjects that represent professor entities. The related RDF keys (search keys) normally appear as constants in successive occurring triple patterns of SPARQL queries. For example, considering the evaluation of triple pattern at peer responsible for key Professor, shown in Figure 3.9(b), RDF keys name, author, or Course1 are the relevant keys with key Professor, and might be used as lookup keys for the evaluation of next possibly occurring triple pattern. For the e cient evaluation of SPARQL queries in our tree-based 3rdf system, apeerresponsibleforardfkey(indexkey)createsandmaintainsstatistics of other related RDF keys, and the keys with high occurrences (frequencies) are selected as potential candidates for shortcuts. For instance, in Figure 3.9(c), the peer responsible for the index key Professor keeps the statistics of frequencies of other relevant RDF keys used as subjects, predicates or objects of triples in Figure 3.9(a). On the basis of occurrences of these relevant keys, the peer might choose the name and author RDF keys for the creation of routing shortcuts to peers responsible for these index keys. APeerkeepsstatisticsonlyforthoseRDFkeyswhichhavesomerelationwith the keys it is responsible for. In these statistics maintained for relevant RDF keys, we define the frequency of a key with value v as the total number of occurrences of value v in the set of relevant triples stored in the network. For example, in the statistics maintained by at a peer (responsible for key Professor ) in Figure 3.9(c), the frequency of key name (i.e., freq=5) denotes the number of professor instances having the name property in the given set of stored triples. The local statistics maintained by individual peers turn out to be global statistics required by the network layer for the creation of routing shortcuts. This global statistics can be repressed through Peers Links Graphs defined as follows: Definition 3.1 A Peers Links Graph G is a tuple (N, E) comprising a set N of vertices or nodes together with a set E of Edges. Peers and RDF keys (index keys) comprise the set of nodes N, and E is the set of directed weighted edges from peers

61 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 48 Subj! Pred!!Obj!!!P1!!type! Professor!!!P1!!name! ABC!!!P1!!teach! Course1!!!Pub1!!type! Publication!!!Pub1!!author! P1!!!P2!!type! Professor!!!P2!!name! XYZ!!!Pub2!!type! Publication!!!Pub2!!author! P2!!!P3!!type! Professor!!!P3!!name! LMN!!!P3!!teach! Course2!!!Pub2!!author! P3!!!P4!!type! Professor!!!P4!!name! PQR!!!P5!!type! Professor!!!P5!!name! RST! (a)!some!example!rdf!triples!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!key!(professor)!!!!!!!!?p! type! Professor!!!!!!!!!?P! name!?n!!!?pub! author!?p! (b)!next!possible!triple!patterns! Peer!!PPPP>!!Key!(Professor)!!!!?P! teach! Course1! Key! name! author! teach! Course1! Course2! Pub1! Pub2! freq!!!!!5!!!!!!3!!!!!!2!!!!!!!1!!!!!!!1!!!!1!!!!2! (c)!statistics!kept!at!peer!responsible!for!key!(professor)! Figure 3.9: Some sample RDF triples, possible triple patterns, and statistics of relevant keys kept at peer responsible for the key (Professor). to their related RDF keys in graph G. The weight of an edge E from a peer to a key with value v is the total number of occurrences of value v in the set of relevant stored triples. In the above defined set E of directed weighted edges from peers to relevant RDF keys, the number of tail nodes is bounded to the total number of peers in the system, and the number of head nodes is bounded to the number of distinct terms (keys) appear as subject, predicate and object values of stored triples. Several Peers Links Graphs are created in this way for di erent clusters of related triples. Each of these Peers Links Graphs is also connected with the rest of the peers in the network. An example partial Peers Links Graph created for the peers sharing the instances of Professor entity is shown in Figure The edge from Peer1 to a relevant RDF key (course1) with weight 2 indicates that only two professor instances teach the course course1. We assume that RDF keys that tend to occur together in a given set of RDF triples also tend to be queried together over these triples. We are therefore interested to look for the keys that often occur together, and thus only consider the top weighted edges of a Peers Links Graph for the creation of shortcuts. For example, in Figure 3.10, the edge from Peer1 to predicate name with weight 80 indicates

62 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 49 Peer * Peer 2 key name?p name 80 Peer 1 key Professor?P type Professor Peer 3 key course1?p teach course1 100 Peer 4 key author?pub author?p 2 Figure 3.10: Example Peers Links Graph describing the relationship between instances of class Professor. that majority of professor instances have the name property, and thus has a high chance to be selected for the creation of a shortcut than the edge from Peer1 to object course1 with weight 2. These routing shortcuts heavily speedup the query routing process, but the tradeo is the extra cost of maintaining additional routing links. Each peer in our system is thus allowed to place a limited number of routing shortcuts. In the performance evaluation section 3.7, wewillshowa reduction of up to 50% in query time with creation of only 3 shortcuts per peer in the underlying 3nuts overlay. 3.6 Prototype Architecture In this section we elaborate our system 3rdf, a Peer-to-Peer based RDF repository for the distributed management of RDF data. We describe the general architecture of 3rdf, the architectural design of each 3rdf peer, and the API supported by 3rdf rdf architecture To enable the storage and querying of RDF data e ciently, peers in 3rdf system are organized according to the search-tree based 3nuts network [63] protocol. Any peer in the system can accept requests from providers to store RDF data in the system, and can also accept requests from a consumers to evaluate SPARQL queries over the data stored in the network. Providers can insert RDF data in the network as RDF documents, that can be in an RDF/XML or RDF/N3 format. The peer which receive a request for storing an RDF document, first decomposes it into a collection of RDF triples and then stores these triples using the indexing technique described in Section 3.2.

63 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 50 Alternatively, when a peer receives a SPARQL query request, it transforms it into asequenceoftriplepatternsandthenpeersin3rdfcooperatetofindrdftriples that match triple patterns of the query. The query processing algorithm we use for the evaluation of SPARQL queries in our system is described in Section rdf peer architecture A higher level view of each 3rdf peer s architecture is shown in Figure It is a two layer model with the 3nuts network layer as the basis for the distributed application at the bottom and the 3rdf layer for RDF storing and querying on top. In the 3nuts network, each 3nuts peer has a local view on the Search Tree which enables the peer the search in the distributed tree of the entire network. For routing, the peers use the UDP protocol. Based on the overlay network, there is a distributed Index Management which provides operations for putting and getting triples from the distributed network. Here, the search functionality of the 3nuts network in the search tree is used to place triples at the correct peers responsible for the corresponding index keys and on the other hand for downloading triples for a certain index key from the responsible peers. TCP/IP connections are used here for triple exchange. The same network connections are shared by the 3rdf query processing for exchanging queries and results between 3rdf peers. Input for the distributed RDF storage are RDF documents which are converted to tuples (key, triple) in the RDF Triple Processor in order to inject them into the Index Management with the Put-operation. Three tuples are created for each triple with the di erent keys for subject, predicate, and object to index all three parameters. Each 3rdf peer is then responsible for a range of index keys, and the Index Management stores the corresponding tuples for these index keys. As we cannot directly perform SPARQL queries on the internal data structures of the Index Management, we synchronize the triples from the Index Management with a local database. This enables us to state SQL queries on the triples in the database. We have used SQLite 2,alightweightrelationaldatabaseforthispurpose. To perform SPARQL queries, we first transform a SPARQL statement into a sequence of so called triple patterns in the Query Parser. This separation of the query into smaller partial queries reflects the single steps of execution at di erent 3rdf peers only with their local database and some intermediate results. The triple pattern sequence is then passed to the Query Processor which controls the distributed execution of the query. There are basically two cases in the distributed execution. In the first case, the 3rdf peer will execute the next triple pattern in the sequence if the 3rdf peer is capable of resolving it with its local database because the peer is responsible for a given subject, predicate, or object and has 2

64 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 51 SPARQL request RDF documents Query Parser query, partial result transmission DB with triples Query Processor local SQL query search(key):address RDF Triple Processor put(key,triple) 3rdf layer TCP Connections synchronize local triples Index Management 3nuts layer UDP Socket 3nuts Peer Search Tree Figure 3.11: Implementation overview of a 3rdf peer in a distributed environment. the corresponding triples in its database. Otherwise it will use the search operation of the 3nuts peer to find the peer that can execute the query and transmit the query and some intermediate results to that 3rdf peer. For transmission it uses the TCP connections of the 3nuts network layer which can be established between the peers in the network on demand. 3.7 Performance Analysis In this section, we analyze the performance of distributed RDF systems in a Peerto-Peer environment. We compare the performance of Peer-to-Peer networks based on DHTs and search-tree based networks. For this, we have implemented the DHT (Chord) layer as an extension to prototype of our tree-based 3rdf system described in Section 3.6. The performance di erence of DHTs and search-tree based networks can be exposed more plainly for networks with a large number of peers, as the routing mechanism of DHTs gets the more expensive the more peers are involved. Owing to the lack of access to resources with large number of peers, we use the prototype of our 3rdf system for the simulation, where we can run multiple peers on one machine. We simulate our distributed RDF system using

65 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 52 either the DHT-based overlay Chord or the search-tree based overlay 3nuts and compare the performance between both. The results obtained by the experimental performance evaluation of both Chord and 3nuts done with a simulator might be transferable to the complete class of both DHT-based and search-tree based Peerto- Peer networks. Our java-implemented Simulator 3 includes a simulation for the physical network, the overlay network, with the distributed RDF system on top. It shares the triple storage and query evaluation code with the 3rdf prototype, and maintains a separate triple store and query processor for each peer in the network. During the triple storage and query processing operations, when a message is sent via the the network, this message is intercepted and answered by a simulated instance of the peer running on the same machine. In our analysis, the simulation supports the experiments by giving results for networks containing up to peers. We repeated all experiments several times but the variation of result values was negligible and is thus not presented here. We are mainly interested in determining the e ects of underlying overlay structures on response times of quires. For the testing we use the Lehigh University Benchmark (LUBM) by Guo et al. [48]. Lehigh University Benchmark consists of a university domain ontology that describe an academic setting with professors, students, courses, departments, publications and other concepts that can be found in an academic setting. LUBM provides synthetic RDF datasets of arbitrary sizes, and we generated data-set of two universities, each with 16 departments. There were 223, 510 triples in total, which were indexed 3 times in the system. The network contained up to 16, 285 peers, and all peers run on a sever machine with two processors of type 6c Intel Xeon E and 96GB memory. As discussed in Section 3.5, applicationdomainsusuallymarktheirownon- tologies in RDF data with a unique namespace. Accordingly, each university department in the test set gets a distinctive namespace. In the example triples in Listing 3.3, department 0 of university 0 uses the namespaces ubd0v0 and d0v0 for the description of its rdf : <http :// org/1999/02/22 rdf syntax ubd0v0: <http ://www. lehigh.edu/zhp2/2004/0401/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 :P3 rdf : type ubd0v0 : Professor. d0v0 :P3 ubd0v0 :name Georg. d0v0 :P3 ubd0v0 : georg@ub.com. d0v0 :P3 ubd0v0 : teach d0v0 : course1. Listing 3.3: RDF triples about a resource d0v0:p3 encoded in RDF/N3 format. 3

66 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 53 We have generated 5 query sets (based on LUBM queries) that sum up to 130 unique queries in total. Each set shares queries of common structures with distinctive namespaces for each department. The first query set for example comprises queries of types discussed in Section 3.7.1, andwevaryforinstancethenamespace ubd0v0 for 32 departments to generate more queries. In order to demonstrate more clearly the e ects of information and interest localities, provided by underlying 3nuts network, on query processing at application layer, each query in our sets of custom queries had between 5 to 10 triple patterns. The queries were posted sequentially with only a single query active at any time Discussion of Queries With the help of following representatives quires, we analyze severe aspects of our search-tree based RDF system, such as: the e ect of information and interest locality on the number of routing steps the e ect of information and interest locality on the query processing time Query Q1 SPARQL query Q1 in Listing 3.4 is a chain type query. As described in section 3.2, a constant value in the triple pattern is used as a key for routing, the constant ubd0v0:undergraduatestudent is used as a lookup key to determine the peer responsible for the evaluation of the first triple pattern. After the evaluation of the first tripe pattern the corresponding peer use the constant ubd0v0:name in the second triple pattern to route the query to the peer responsible for the evaluation of this triple pattern. In the same fashion, constants ubd0v0:advisor, ubd0v0:name, ubd0v0:teacherof and ubd0v0:takescourse in the third, fourth, fifth and sixth triple patterns are respectively used as keys to send the query to peers responsible for the evaluation of these triple patterns. In the seventh triple pattern we have two constants, rdf:type as predicate and ubd0v0:course as object value. For the evaluation of this triple pattern we use the constant ubd0v0:course as lookup key, as our query algorithm, discussed in section 3.3.2, prefers the value of subject over object and the value of object over the predicate to be used as key for query routing. Finally, the constant ubd0v0:name in the last triple pattern is used to find the peer responsible for its evaluation. We can note that all of the aforementioned bound components (constants) in the successive triple patterns of the query, which are used as lookup keys in the underlying network, share a common namespace ubd0v0. Since corresponding triples indexed on these components are stored in 3nuts in the same subtree with the path ubd0v0, the peers managing this subtree have fast lookup to each other (i.e.,

67 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 54 information locality). Hence, we expect a small number of routing hops needed for the evaluation of triple patterns in the given query. On the other hand, lookup keys (ubd0v0:undergraduatestudent, ubd0v- 0:name, ubd0v0:advisor and ubd0v0:takescourse) intriplepatternsofthe given query are the triple components that often occur together for describing undergraduate student instances. So, there is a high chance of having routing shortcuts between these lookup keys in underlying 3nuts overlay. In return, we can foresee a significant reduction in the number of routing hops and consequently in response time of the given query in our 3nuts based RDF system. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX ubd0v0 : <http ://www. lehigh.edu/zhp2/2004/0401/univ SELECT?std name?teacher name?course name WHERE {?X r d f : type ubd0v0 : UndergraduateStudent.?X ubd0v0 : name? std name.?x ubd0v0 : a d v i s o r?y.?y ubd0v0 : name? teacher name.?y ubd0v0 : teacherof?z.?x ubd0v0 : takescourse?z.?z rdf : type ubd0v0 : Course.?Z ubd0v0 : name? course name } bench. owl#> Listing 3.4: SPARQL query returning names of the students their advisors and courses taken, provided that courses taken are taught by their supervisors. Query Q2 Query Q2 inlisting3.5 is a star-shape query with all its triple pattern sharing the same subject variable, while not all lookup keys of successive triple patterns share a common namespace. As described the query processing above, the constant ubd0v0:graduatestudent in the object position of the first triple pattern, and constants ubd0v0:name and ubd0v0: address in the second and third triple patterns are respectively used as lookup keys for determining responsible peers. For the evaluation of the fourth triple pattern the object value < is used as lookup key. Lastly, for the evaluation of remaining two triple patterns constants ubd0v0:advisor and ubd0v0:takescourse in the corresponding triple patterns are respectively used as routing keys. We can note that all of the lookup keys in the successive triple patterns till the fourth one share a common prefix ubd0v0. For the processing of the fourth triple pattern the bound component < is used as key, which does not share a common prefix with the lookup keys of its predecessor and successor triple patterns. The last triple pattern again shares a common prefix ubd0v0 with its predecessor triple pattern.

68 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 55 For the evaluation of such queries, our system exploits the information locality of 3nuts to some extent, for instance the first three triple patterns share a common prefix ubd0v0 intheirlookupkeysandthusthecorrespondingpeershavefast routings between each other. The lookup key in the fourth triple pattern does not share common prefix with the key of its predecessor triple pattern, and similarly the fifth triple pattern also does not share common prefix with its predecessor, thus the evaluation of these two triple patterns can not exploit the information locality of underlying 3nuts overlay and would need a high number of routing hops. The last triple pattern again share common prefix ubd0v0 with its predecessor triple pattern and thus needs a small number of routing hops for its evaluation. In addition, lookup keys in the triple patterns of the given query (ubd0v0:graduatestudent, ubd0v0:name, ubd0v0: address, ubd0v0:advisor and ubd0v0:takescourse) arethosetriplecomponentswhichfrequentlyoccur together for describing graduate students. Hence, there is a high chance of having routing shortcuts between these lookup keys in underlying 3nuts overlay, and our system would exploit these shortcuts to reduce the routing hops and consequently the response time of the given query. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX rdfs : <http :// org/2000/01/rdf schema#> PREFIX ubd0v0 : <http ://www. lehigh.edu/ zhp2/2004/0401/univ bench. owl#> PREFIX d0v0 : <http ://www. Department0. University0.edu/> SELECT?name? Address?Adv?course WHERE {?X r d f : type ubd0v0 : GraduateStudent.?X ubd0v0 : name?name.?x ubd0v0 : address? address.?x ubd0v0 : memberof <http ://www. Department0. University0.edu>.?X ubd0v0 : a d v i s o r?adv.?x ubd0v0 : takescourse? course } Listing 3.5: SPARQL query returning names of the students their addresses, advisors and courses taken. Query Q3 For query Q3 in Listing 3.6, constants ubd0v0:graduatestudent in the object position of the first triple pattern, GraduateStudent100 in the object position of the second triple pattern, ubd0v0:takescourse in the predicate position of the third triple pattern, d0v0:associateprofessor2 in the object position of the fourth triple pattern and ubd0v0:publicationauthor in the predicate position of the fifth triple pattern are respectively used as lookup keys. We can note that none of these lookup keys in successive triple patterns of the given query shares acommonnamespace(prefix)withitspredecessor. Sincecorrespondingtriples

69 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 56 indexed on these components (keys) are stored in 3nuts in di erent subtrees, the peers managing these subtrees do not have fast lookup to each other (i.e., absence of information locality). We will see in the next section that the 3nuts network information locality feature does not has any e ect on optimization of such queries. However, since every peer in 3nuts is allowed to create a constant number of shortcuts to relevant triple components, there is the possibility of having only a shortcut from peer responsible for the key GraduateStudent100 to the peer managing the key ubd0v0:takescourse. We will see in the evaluation section that the 3nuts overlay has almost no performance gain over Chord overlay for the processing of such unusually posed queries. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX rdfs : <http :// org/2000/01/rdf schema#> PREFIX ubd0v0 : <http ://www. lehigh.edu/ zhp2/2004/0401/univ PREFIX d0v0 : <http ://www. Department0. University0.edu/> SELECT?course?publication WHERE {?X r d f : type ubd0v0 : GraduateStudent.?X ubd0v0 : name GraduateStudent100.?X ubd0v0 : takescourse? course.?x ubd0v0 : a d v i s o r d0v0 : A s s o c i a t e P r o f e s s o r 2.?publication ubd0v0:publicationauthor?x } bench. owl#> Listing 3.6: SPARQL query returning the courses and publications for the Graduatestudent100. Query Q4 Query Q4 inlisting3.7 is a star-shape query. Constants ubd0v0:undergraduatestudent, < ubd0v0:name, d0v0:course4, ubd0v0: address and d0v0:assistantprofessor1 are used as lookup keys in successive triple patterns of the given query. As in the case of query Q3, none of these lookup keys in query Q4 sharesacommonprefix (namespace) with its predecessor lookup key. So, there is no chance of exploiting the information locality feature of 3nuts for the optimization of such queries. However, a large portion of triples, stored in the network, describing undergraduate students share the components (keys) (ubd0v0:undergraduatestudent, < ubd0v0:name, d0v0:course4 and ubd0v0: address) intheirpredicateorobjectpositions. Therefore, there is high chance of having routing shortcuts in the 3nuts network from the key ubd0v0:undergraduatestudent to the relevant key < from< to the key ubd0v0:name, andashortcut between keys d0v0:course4

70 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 57 and ubd0v0: address. We will see that though there is no e ect of information locality on query processing, the routing shortcuts between these relevant lookup keys optimize the processing time of such queries to a big extent. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX rdfs : <http :// org/2000/01/rdf schema#> PREFIX ubd0v0 : <http ://www. lehigh.edu/ zhp2/2004/0401/univ bench. owl#> PREFIX d0v0 : <http ://www. Department0. University0.edu/> SELECT?name? Address WHERE {?X r d f : type ubd0v0 : UndergraduateStudent.?X ubd0v0 : memberof <http ://www. Department0. University0.edu>.?X ubd0v0 : name?name.?x ubd0v0 : takescourse d0v0 : Course4.?X ubd0v0 : address? address.?x ubd0v0 : a d v i s o r d0v0 : A s s i s t a n t P r o f e s s o r 1 } Listing 3.7: SPARQL query returning the names and addresses of particular undergraduate students. In the next section we experimentally evaluate the performance of DHTs and search-tree based RDF systems with the performance matrices routing-steps and time for processing aforementioned SPARQL queries Analysis of Routing Distributed query processing in a RDF system consists of searching, transferring and evaluating RDF data. The distributed search structure of P2P-networks achieves scalability but slows the search down typically to O (log n) routingsteps (called hops) in a network with n peers compared to a server with su cient resources. When the bandwidth of modern networks is high but the ping or delay for data transmission is low, the response time of queries can by driven by these delays, because each routing step in chain produces an additional delay. As mentioned above RDF resources are normally represented by URIs, and resources belonging to a particular application domain usually share common namespaces or prefixes (e.g. the Professor, name, , and teach keys in triples of Listing 3.3 share a common prefix ubd0v0 ). Thus, the support of e cient range or prefix queries in tree-based overlays, which is equivalent to short lookup times between network keys with the same prefix, achieves short querying time when RDF data with the same prefix is associated in a query (see the example query in Listing 3.4). Routing Analysis for Query Q1 Figure 3.12a shows the mean numbers of hops of lookup-operation for resolving the query Q1 oflisting3.4 depending on the network size n. The linear func-

71 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 58 Routing hops nuts Chord #peers [log] (a) Comparison of 3nuts and Chord Hops reduction Ratio nuts Chord Triple patterns of query (b) Impact of information locality Figure 3.12: Measurement results for resolving the query in Listing 3.4. tions in log-line scale specially for Chord indicate a logarithmic hop number. The triple components of this query are all in the domain ubd0v0. Consequently, corresponding triples are stored in 3nuts in the same subtree with path ubd0v0 and peers managing this subtree have fast lookup to each other. This really keeps the hop numbers down for small networks up to 64 peers where single or only a few peers manage the data for that domain. For larger networks more routing is needed within the subtree and we see a linear increase in routing hops for number of peers starting at 128 like in Chord. While Figure 3.12a showed the total numbers of hops during processing a query, Figure 3.12b shows the numbers of hops needed for routing of the next triple pattern in chain of triple patterns in the query (percentile to the expected hop numbers O (log n)). In the Chord overlay, we need almost the same numbers of hops to lookup all indexed triple components. On the contrary, in the 3nuts overlay, the hops for lookup decrease after the first step by 50% and the reason is again the information locality: the first step starts at an arbitrary peer which might not participate in the subtree of domain ubd0v0 inthe3nutstreewith O (log n) hops,butwhenthequeryentersthissubtreeonce,itstaysinsideforthe rest of the query steps with fast routing. As discussed query Q1 insections 3.7.1, there is a high chance of having routing shortcuts between lookup keys (ubd0v0:undergraduatestudent, ubd0v0:name, ubd0v0:advisor and ubd0v0:takescourse) inunderlying3nutsoverlay. Figure 3.13 shows further reductions in the numbers of hops needed for evaluating the query Q1 inlisting3.4 with the creation of up to 5 shortcuts by each peer in 3nuts network. If a peer interested in evaluation of a triple pattern maintains a shortcut to the lookup key of this triple pattern, then it takes it 0 hops to reach this lookup key. The establishment of constant numbers of shortcuts for each peer to frequently occurring relevant triple components (RDF keys) in the network could result to a significant reduction in numbers of hops, e.g. the numbers of

72 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 59 hops needed for our example query reduced from 26 to 10 with creation of only 3 shortcuts. 25 Routing hops #shortcuts Figure 3.13: Hops needed for the evaluation of given query in Listing 3.4. Routing Analysis for Query Q2 Routing hops nuts Chord Hops reduction Ratio nuts Chord #peers [log] Triple patterns of query (a) Comparison of 3nuts and Chord (b) Impact of information locality Figure 3.14: Measurement results for resolving the query in Listing 3.5. Triple components (ubd0v0:graduatestudent, ubd0v0:name and ubd0- v0: address) used as lookup keys in the first three triple patterns of the query Q2inListing3.5 are all in the domain ubd0v0. The triple component used as lookup key for the evaluation of the fourth triple pattern is < andlookupkeys(ubd0v0:advisor and ubd0v- 0:takesCourse)forthelasttwotriplepatternsareagaininthedomain ubd0v0. The information locality of 3nuts keeps the numbers of hops of lookup-operation

73 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 60 down for the evaluation of successive triple patterns sharing the prefix ubd0v0 in their lookup keys (components). Consequently, Figure 3.14a shows that the total hops needed for the evaluation of the given query on 3nuts are considerably less than the hops needed in Chord overlay, in particular for large networks starting with 64 peers. Figure 3.14b shows the numbers of hops needed for routing of the next triple pattern in chain of triple patterns in the query in Listing 3.5. IntheChordoverlay, we need almost the same numbers of hops to lookup all indexed triple components. On the contrary, in the 3nuts overlay, the lookup hops after the first step decreases for the evaluation of second and third triple patterns by 40% (information locality). For the fourth and fifth triple patterns, we need almost the same number of hops as needed in Chord overlay, the reason is again the lookup keys of these triple patterns do not share a common prefix with their predecessors. The lookup hops for the last triple pattern decreases again by 40%. As discussed query Q2 (Listing 3.5) in previous section, majority of the triple components of the example query (ubd0v0:graduatestudent, ubd0v0:name, ubd0v0: address, ubd0v0:advisor and ubd0v0:takescourse)frequently occur together in triples stored in the network. The presence of routing shortcuts between these relevant triple components (lookup keys) in the network further reduces the number of hops needed. Figure 3.15 shows reductions in the numbers of hops, needed for evaluating the query in Listing 3.5, from25to12withthe creation of only 3 shortcuts. 25 Routing hops #shortcuts Figure 3.15: Hops needed for the evaluation of given query in Listing 3.5.

74 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 61 Routing hops nuts Chord Hops reduction Ratio nuts Chord #peers [log] Triple patterns of query (a) Comparison of 3nuts and Chord (b) Impact of information locality Figure 3.16: Measurement results for resolving the query in Listing 3.6. Routing Analysis for Query Q3 Triple components (ubd0v0:graduatestudent, GraduateStudent100, ubd0v0:takescourse, d0v0:associateprofessor2 and ubd0v0:publicationauthor) usedaslookupkeysinsuccessivetriplepatternsofthequeryq3 inlisting 3.6 do not share at all common prefixes with their predecessors. Consequently, corresponding triples indexed on these keys are stored on di erent subtrees of 3nuts, and peers managing these subtrees do not have fast lookup to each other (absence of information locality). Figure 3.16a shows that for all network sizes (2 to peers) both 3nuts and Chord overlays take almost the same number of hops to resolve the example query. Figure 3.16b also shows the same lookup performance for both Chord and 3nuts overlays for the processing of the example query. In both networks the numbers of hops needed for routing of the next triple pattern in chain of triple patterns of the example query are almost the same as the numbers of hops to lookup all indexed triple components. The aforementioned triple components of this query also occur very rarely together in triples stored in the network. Based on the opportunity of establishing a constant number of shortcuts from each peer to the relevant triple components, there is only the possibility of having a shortcut from the peer responsible for the key GraduateStudent100 to the peer managing the key ubd0v0:takescourse in the next triple pattern. Figure 3.17 shows a slight reduction in the numbers of hops needed for the evaluation of the example query with the creation of up to 5 shortcuts in 3nuts network.

75 CHAPTER 3. P2P BASED RDF STORING AND QUERYING Routing hops #shortcuts Figure 3.17: Hops needed for the evaluation of given query in Listing 3.6. Routing hops nuts Chord Hops reduction Ratio nuts Chord #peers [log] Triple patterns of query (a) Comparison of 3nuts and Chord (b) Impact of information locality Figure 3.18: Measurement results for resolving the query in Listing 3.7. Routing Analysis for Query Q4 Analogous to the query of Listing 3.6, the triple components (ubd0v0:undergraduatestudent, < ubd0- v0:name, d0v0:course4, ubd0v0: address and d0v0:assistantprofessor1) insuccessivetriplepatternsofthequeryq4 inlisting3.7 also do not share acommonprefixwiththeirpredecessorkeys.consequently,peersresponsiblefor these keys (components) do not have fast lookup to each other. Figures 3.18a and 3.18b shows that the 3nuts overlay has no performance gain, in terms of lookup operation, over the Chord overlay for the processing of the given query. On the other hand, majority of the triple components of this query frequently occur together in triples stored in the network. Based on the occurrences of relevant triple components in stored triples, we can foresee a shortcut from the peer respon-

76 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 63 sible for the key ubd0v0:undergraduatestudent to the key < and a next shortcut from this key to the peer managing the component ubd0v0:name. Thereisthepossibilityofanothershortcut between the keys d0v0:course4 and ubd0v0: address. Though the triple components of this query, used as lookup keys, could not exploit the information locality of 3nuts overlay, the shortcuts between aforementioned components reduce the required hops to a big extent. Figure 3.19 shows that the numbers of hops needed for the processing of the query in Listing 3.7 is reduced from 35 to 16 with creation of only 2 shortcuts in 3nuts overlay. Routing hops #shortcuts Figure 3.19: Hops needed for the evaluation of given query in Listing Analysis of Query Response Time Our simulation does not provide the simulation of a real-world physical network and it is impossible to find one suitable model covering all possible fields of application such as the Internet, Intra net, and so on. In our analysis we use connections between peers with homogeneous delay and bandwidth. Therefore, we will present in this section how to derive the query response time from the given experimental routing data with a given physical network model. Let k = b denote the product of delay and bandwidth b in the physical network, and p i represent a triple pattern, then response time t of a query with m triple patters evaluated on top of the p2p network is the sum of computation

77 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 64 time, routing time, and data transmission time defined as follows: 0 1 mx l B datai m C t c {z} i +hops i + A {z } k i=1 comp-time {z } lookup time transmission time where c i is the computation time for evaluating pattern p i locally, hops i is the number of hops to lookup the peer providing the triples for pattern p i,anddata i is the data size for triples of pattern p i to transfer to the next peer as intermediate results. Computation time c i is comparatively very small then the lookup and transmission times, and we can ignore this time factor in our calculation of the query time. For c i we neglect the computation time. The query time t thus becomes: mx l datai m t = hops i +. k i=1 Response Time Analysis for Query Q1 According to the query time equation discussed above, the response time of a query is mainly driven by routing delays and data transfer time. When the bandwidth is very high, routing delays are a dominant factor and reducing routing hops with the 3nuts overlay has a big influence. We can reduce the response time of query Q1 in Listing 3.4 to 65% for networks with delay 0.2 s and bandwidth > 100 kb/s, see Figure 3.20a where we calculate the ratio of the query response time for the actual routing of the query and the expected time for an arbitrary routing in the network. The routing in query processing in the Chord overlay almost needs the same time like arbitrary routing steps between arbitrary peers with nearly no e ect. The improvement in query time due to the information locality of 3nuts, shown in Figure 3.20a,canbeenhancedwiththecreationofconstantnumbersofshortcuts for each peer to frequently occurring relevant triples components (RDF keys) in underlying 3nuts network, e.g. the query time in our example can be further reduced to 30% with creation of 4 shortcuts for each peer as shown in the same Figure 3.20a. Thecreationofshortcuts reduces the network tra c and peers routing load as well because a peer maintaining shortcuts to relevant RDF keys has direct links to the peers responsible for these keys. Figure 3.20b shows the ratio of improvements in response time of the query in Listing 3.4 with the creation of up to 5 shortcuts by each peer in the 3nuts overlay. The response time of the example query is reduced almost to 50% with creation of 3 shortcuts in the underlying 3nuts overlay.

78 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 65 Time reduction [%] Chord 3nuts locality 3nuts shortcuts bandwidth [kb/s] (a) Impact of information locality and shortcuts time reduction [%] #shortcuts (b) Time reduction using shortcuts (for query of Listing 3.4) Figure 3.20: Measurement results for the performance boost using 3nuts localities. Response Time Analysis for Query Q2 For the processing of query Q2 in Listing 3.5, the information locality of 3nuts keeps the number of lookup hops down for those successive triple patterns sharing the common prefix ubd0v0 in their lookup keys, and consequently as shown in Figure 3.21a reduces the response time of example query to 80%. As discussed previously, majority of the triple components of the example query frequently occur together in stored triples. The creation of shortcuts between these relevant components (lookup keys) in 3nuts can further improve the processing time of the given query, e.g., the reduction in query time shown in Figure 3.21a can be further reduced to 45% with creation of 3 shortcuts for each peer as shown in the same figure. Figure 3.21b shows the ratio of improvements in response time of the query in Listing 3.5, and we can see a reduction in query time to 60% with creation of 3 shortcuts in the underlying 3nuts overlay. Response Time Analysis for Query Q3 Triple components (ubd0v0:graduatestudent, GraduateStudent100, ubd0v0:takescourse, d0v0:associateprofessor2 and ubd0v0:publicationauthor) insuccessivetriplepatternsofthequeryq3 inlisting3.6 do not share common prefix with their predecessors. Consequently, the characteristic of information locality in 3nuts has no influence on processing time of such queries. Figure 3.22a shows that the query response times for the actual routings in both

79 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 66 Time reduction [%] Chord 3nuts locality 3nuts shortcuts bandwidth [kb/s] (a) Impact of information locality and shortcuts time reduction [%] #shortcuts (b) Time reduction using shortcuts (for query of Listing 3.5) Figure 3.21: Measurement results for the performance boost using 3nuts localities. 3nuts and Chord overlays are the same as expected time for an arbitrary routing in the network, i.e., no improvement in query response time. The triple components of this query also occur very rarely together in stored triples, and as discussed query Q3 in previous sections, there is only a chance of shortcut between lookup-keys GraduateStudent and ubd0v0:takescourse. Figure 3.22a shows a slight reduction in response time of the example query (i.e., reduced to 85%) with creation of 5 shortcuts for each peer in the network. Figure 3.22b shows that the response time of the query in Listing 3.6 can be reduced to only 85% with creation of up to 5 shortcuts in the underlying 3nuts overlay. Response Time Analysis for Query Q4 Analogous to the query Q3 in Listing 3.6, the triple components(ubd0v0:undergraduatestudent, < ubd0- v0:name, d0v0:course4, ubd0v0: address and d0v0:assistantprofessor1) in successive triple patterns of the query Q4 inlisting3.7 also do not share a common prefix with their predecessor keys. Therefore, the information locality of 3nuts can not be utilized to improve the processing time of this query. Figure 3.23a shows the same performance for both 3nuts and Chord overlays in terms of query processing time for the example query. However, majority of triple components of this query (ubd0v0:undergraduatestudent, < ubd0v0:name, d0v0:course4 and ubd0v0: address) frequentlyoccurtogetherin

80 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 67 Time reduction [%] Chord 3nuts locality 3nuts shortcuts bandwidth [kb/s] (a) Impact of information locality and shortcuts time reduction [%] #shortcuts (b) Time reduction using shortcuts (for query of Listing 3.6) Figure 3.22: Measurement results for the performance boost using 3nuts localities. stored triples. The creation of shortcuts between these frequently occurring relevant components (lookup keys) in 3nuts can improve the response time of the example query. Figure 3.23a shows that the response time of the given query can be reduced to 55% with creation of only 2 shortcuts for each peer in the network. Figure 3.23b also shows a reduction to 55% in processing time of the query in Listing 3.7 with creation of 2 shortcuts in the underlying 3nuts network. Time reduction [%] Chord 3nuts locality 3nuts shortcuts bandwidth [kb/s] (a) Impact of information locality and shortcuts time reduction [%] #shortcuts (b) Time reduction using shortcuts (for query of Listing 3.7) Figure 3.23: Measurement results for the performance boost using 3nuts localities.

81 CHAPTER 3. P2P BASED RDF STORING AND QUERYING Related Work RDF data management have been the focus of much research activity during the past few years. The following sections give an overview of RDF stores developed for the e cient management of RDF data Centralized RDF Data Management RDF, being just a logical data model, does not prescribe a physical storage organization. The majority of RDF data stores developed in the past use relational DBMs to store RDF data. The simplest representation of RDF in these centralized solutions is a three column triples table, where a triple is stored in a row with the columns subject, predicate, and object. In the followings sections we give a very brief introduction to well known centralized RDF stores that store RDF data in relational databases. Jena2 Jena2 [101, 34] usesrelationaldatabasesintheback-endandusesspecialproperty tables in combination with three-column triple table. The idea of using property tables is to denormalize the table of three-column schema to save the storage consumption and the cost of many self joins during query evaluation. A property table contains properties (attributes) of subjects that commonly occur together (e.g. title, author, and isbn are common properties of subjects representing book entities). Thus a property table containing subject as the key and other attributes title, author, and isbn as further columns might be created to store book entities in Jena2. The use of property tables gives a small storage saving because the property URI is not stored in the table, and for the cluster of properties in the property table the subject is only stored once. It also speed-up the query processing over stored triples, because no joins are required on such a table if all attributes mentioned in a query are found in a single property table. Sesame Sesame [30] isacentralizedrdfstoredevelopedforthee cientstorageand querying of huge amount of RDF data. It reduce the storage cost by mapping URIs and literals into integer identifiers, the identifiers are stored as values of triple components in one table and the corresponding mappings in another table. This reduce the storage cost because long URIs or strings are not needed to be stored in triples table several times any more; instead the shorted versions (identifiers) are stored in triples table. To allow developing Sesame on top of variety of storage

82 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 69 devices, the persistence layer (DBMS-specific code) is encapsulated in a Storage And Interface Layer (SAIL). The SAIL provides an interface to add, delete, and query operations on RDF triples and translate these operations to specific calls of underlying DBMS. 3store 3store [51, 52] usesansqlengine(mysql)asaback-endstorage.likesesame, it also maps URIs and literals to identifiers for reducing the storage cost of triples. However, the identifiers (64 bit numbers) in 3store are created by applying a hash function to each URIs and literals. The resulting hash values are stored as subject, predicate and object components of triples in a huge relational database table. Where as URIs and literals are stored in a two columns separate table with a hash of their values used as the key. For the evaluation of SPARQL queries, the queries are translated into SQL queries and submitted to the underlying relational database system. RDF Data Management Using Vertical Partitioning Abadi et al. review in [5] the use of relational databases for the storage of RDF data in aforementioned RDF stores (Jena, Sesame, 3store), and in Oracles [37]. In his review he particularly addressed the pros and cons of usage of property tables, and highlighted the following issues associated with this technique: RDF data is semi-structured, and not all subjects share a common set of properties. Consequently there would be many NULL entries for the attributes of subjects stored in the property table. For example for Student subjects, there would be many NULL entries for supervisor column. These NULLs causes a substantial space overhead. On the other hand, if less sparse property tables only for most highly correlated attributes are created, then many property tables might needed to be joined for the evaluation of a query. It is not easy to find a balance between these two factors. Asecondissuewithpropertytableisthepresenceofmulti-valued attributes in RDF data. For example, a student would have many values for the column coursetaken in the property table. To address the above mentioned limitations of property tables, Abadi et al. [5] proposed vertically portioned databases, whereatwo-columntableiscreatedfor each unique predicate (property) of RDF triples. In such a table for a specific predicate, the first column contains subjects and the second column contains the objects of all triples that share this predicate. For the students record example,

83 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 70 separate tables would be created for name, coursetaken, and address properties. The advantages of this technique is its support for multi-valued attributes, heterogeneous tuples (i.e. NULL entries), and e cient query evaluation, i.e. only those properties mentioned in query are read. Although many joins are needed to evaluate queries over multiple properties, fast (linear) merge joins can be used, if each table is sorted by its subject column. He further demonstrated that Columnoriented databases are appropriate for this kind of data representation Distributed RDF Data Management The use of Peer-to-Peer networks for managing RDF data has become an active research topic during the past few years. A comprehensive survey in this research area can be found in [96]. On the basis of the way underlying Peer-to-Peer overlay is organized and used to store and locate triples, Peer-to-Peer based RDF stores are broadly classified into those using unstructured Peer-to-Peer networks, and those using structured Peer-to-Peer networks. RDF stores using unstructured Peer-to-Peer Networks Edutella [80] isadistributedrdfrepositorywhichusesgnutella-like[1] unstructured Peer-to-Peer network as underlying overlay. In this system RDF data remains at the peer controlled by the information provider, and the network is only used to increase the data access. The advantages of this kind of RDF store is that the data modification remains under the control of information providers, and that several queries might find the matching triples (answers) in local database of asinglepeer,insteadofaccessingandjoiningrelevanttriplesstoredacrossseveral peers. A disadvantage is, however, that the underlying overlay does not provide any search strategy and the query needed to be flooded through the entire network and each peer receiving the query has to scan its local database for possible match. This approach of query evaluation leads to a high response time and impose a lot of network tra c. AsubsequentworkofEdutella,providesbetterscalabilitybyintroducinga super-peer based network architecture [81]. In this new architecture, some peers are chosen to act as super-peers, while the other peers (client peers) are connected in a star-like fashion to these super-peers. Each client peer is connected to a single super-peer, and the super-peers are also connected in a super-peer network. Super-peers are responsible for maintaining indices and query routing. Each peer provides the indices (description) of its data to its super-peer, while the actual data is held by the peer its self. Super-peers use these data indices to route the query request to the peers who may provide the data. Furthermore, this system also supports schema base routing, i.e. peers are clustered at super-peers according

84 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 71 to the schema of data they provide, and this schema information is used for query optimization and routing. Bibster [50] isasemantic-basedpeer-to-peersystemimplementedasanexample application of the SWAP platform which is an unstructured Peer-to-Peer system based on JXTA 4.ThisPeer-to-Peersystemisbuiltforexchangingbibliographic metadata (e.g. BibTeX entries) in academic communities. Bibster peers share ontologies like, the ACM topic hierarchy 5 and the Web Research Community ontology (SWRC 6 ), for importing data and query processing. It maintains a Sesame based RDF repository [30] foreachpeerinthenetworktostorethebibliographic data, and uses ACM topic hierarchy to advertise the semantic description of each repository in the network. In addition, each peer uses Sesame RDF Query Language (SeRQL) [29] for data querying. During the processing of a query, a peer first evaluate it against its local database, and in case of no answer found decides to which peer the query should be forwarded. The subject of the query which specifies the required expertise to evaluate the query is checked, and the query is correspondingly send to the peer with the appropriate expertise. RDF stores using structured Peer-to-Peer Networks Majority of existing distributed RDF stores are based on structured Peer-to-Peer overlays. In these systems peers do not only store RDF triples, but also cooperate to find the triples asked in the query. The triples are distributed over the network in a manner that they can be searched in a most e cient way. Although these RDF systems share almost the same data indexing technique (i.e., hashing the RDF triple components), they notably di er in their underlying overlay topology and in query processing strategies. RDFPeers [31, 33] was the first work to propose the use of DHTs for the distributed storage and querying of RDF data in a Peer-to-Peer environment. It is built on top of MAAN [32], which extends the Chord protocol [98] toe ciently evaluate multi-attribute and range queries. RDFPeers use each RDF triple component as DHT key, and store each triple three times in the network by applying a hash function on the subject, predicate and object. For triple components with string values SHA-1 hash function [4] isused. Tosupportrangequeriesontriple components with numeric values, the order preserving hash function is used to index numeric values. This system supports the evaluation of atomic triple pattern queries, disjunctive and range queries, and conjunctive queries with the same variable subject and possibly di erent constant predicates. The general idea of algorithms used to evaluate aforementioned query types is that they use constants

85 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 72 in triple patterns of queries to create identifiers that will lead to responsible peers for the storage of relevant triples. The Atlas project [64, 73, 66, 74, 65] has a similar approach as RDFPeers for distributed RDF management, and uses the Bambo overlay network [87], a DHT optimized for high churn rate. Triples are stored at three locations, as in RDFPeers, using the hash values of subject, predicate, and object. Triples with a specific subject, predicate, or object are then obtained during query evaluation by computing the hash value of that specific key again to resolve the peer storing these triples. The query processing [73] ofatlasextendedtheevaluationofconjunctive multi-predicate queries implemented in [31] tofullclassofconjunctivequeriesand presented a query processing algorithm, Query Chain(QC), for this purpose. In QC algorithm, the triple patterns contained in the conjunctive query are iteratively resolved by a chain of nodes. The query evaluation starts with a single triple pattern of the query by doing a lookup for the peer responsible for the evaluation of this triple pattern. This peer adds to an intermediate result all triples of its local database qualified for the evaluated triple pattern. The intermediate result is then extended by doing a lookup for a second triple pattern and joining the results. This operation is executed until all triple patterns of the query have been processed. The query processing in Atlas is further improved in [65] withthe introduction of new query optimization techniques for reducing query response time and bandwidth usage. They used selectivity-based heuristics [97] intheir optimization algorithms which try to minimize the size of intermediate results produced during query evaluation. BabelPeers [17, 18, 58, 15, 16] is another DHT-based system for the distributed management of RDF data. As in RDFPeers and Atlas systems described above, triples in BabelPeers are indexed three times in the network. The query processing of the system for conjunctive triple pattern queries is described in [58]. It works in two phases, in the first phase the set of RDF triples that are potently relevant for the query are retrieved from the network to the query originating peer, in the second phase then these relevant triples are processed locally to find the actual answer to the query. A set of rules based on look-ahead technique and on Bloom filters [24] isalsousedinthefirstphasetoavoidthetransferofuselesstriplesthat will never be used in the final answer of the query. Battré et al. [16] proposean additional e ort of querying the network to determine a sensible order of triple patterns evaluation such that the cost of network transfer during query processing is minimized. Since a peer can now forecast the number of triples for a triple pattern with the help of query pre-processing technique proposed in [16],itcan also decide at run time whether to fetch triples matching the triple pattern or to migrate the query processing with intermediate results to the peer storing these triples.

86 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 73 The GridVine project by Aberer et al. [7, 40] likeour3rdfsystemusesasearchtree based overlay networks. It is based on the P-Grid [6] network. Thefundamental di erence when using a search tree instead of a hash table is omitting any hashing of data keys in order to preserves the order of data keys in key space. In addition to logarithmic point queries this achieves e cient range queries in key space, whereas hash tables do not provide an e cient implementation for range queries and the complete table has to be searched. Therefore, when applications such as GridVine RDF system can organize data in key space so that RDF triples needed during query evaluation have nearby keys in key space, (e.g. a similar prefix in the search key equivalent to a small key range), the advantage of range queries can be exploited and lookup time is significantly reduced. GridVine also supports the semantic integration of heterogeneous data by providing to each peer the possibility to create a mapping between two schemes. In this way, a query posed against a given schema can be reformulated into a new query against a semantically similar scheme. Yet Another RDF store (YARS) by Harth et al. [54, 55] extendsthestandard RDF data model (subject, predicate, object) to a data model called quad (subject, predicate, object, context), where context denotes the source (provenance) of the RDF triple. The YARS query language explicitly deals with the context concepts by using yars:context predicate. To support the evaluation of Select-Project-Join queries, it implements a B+-tree [39] andshowsthatonly6indexesareneeded to cover all possible (16) access patterns, where as an access pattern is a quad where any combination of subject, predicate, object, context is either specified or avariable. 3.9 Summary In this chapter we presented our 3rdf system for the distributed management of RDF data and contrasted it against the state-of-the-art DHT based RDF stores. We showed that how the network, information, and interest locality provided by the underlying 3nuts overlay can be exploited to improve query processing time in 3rdf system. The e ects of this network structure improvement on optimization of distributed RDF querying have been analyzed and has shown the practical application by simulation with the benchmark (LUBM) data and queries. We have shown that using search-tree based instead of DHT-based Peer-to-Peer networks improves the query response time by up to a factor of two if RDF data of the same domain, for instance data of the same university, is grouped in the same branch of the tree and the query only combines data of such a limited domain. We also tested the usage of shortcuts in the 3nuts overlay network, where the peers increase their

87 CHAPTER 3. P2P BASED RDF STORING AND QUERYING 74 routing structure by a constant factor. This technique achieves a further speed-up of query response times of factor for instance 2.5 for 3 shortcuts per peer.

88 4 Load Balancing The Peer-to-Peer based approaches for the distributed management of RDF data presented in the previous chapter distribute the storage and query load over several peers by partitioning RDF triples across the peers of the network. The partitioning of triples in these distributed RDF systems is typically based on hashing (indexing) of triples terms (subject, predicate, object) that map these terms to peers, triples with the same term are grouped on the same peer, with the advantage that it o ers aconstraintsearchforaspecificsubject,predicateorobjectinalocaldatabaseof a single peer. However, this comes with the drawback that we leverage the load balancing techniques of most overlays, because the triples may not be stored on peers of underlying networks uniformly due to the non-uniform frequency distribution of subject, predicate, and object occurrences in these triples. The frequency distribution of terms in RDF triples is highly skewed, some URIs and literals occur very often (e.g., peer responsible for rdf:type is overwhelmed with RDF triples) while others occur only rarely. Such an unbalanced triples distribution would limit the scalability of the distributed RDF system [41, 72]. In this chapter, we study a solution based on grouping of triples on combination of their terms (e.g., subject+predicate, predicate+object, object+subject). Partitioning of triples on combination of their terms reduces the size of the triples sets obtained by grouping triples on their individual terms (subject, predicate, object), and hence reduces the load of peers responsible for the highly occurring terms. The remainder of this paper is organized as follows: Section 4.1 provides an overview about the term based triples partitioning in the state-of-the-art, and 75

89 CHAPTER 4. LOAD BALANCING 76 discuss the proposed solutions towards the problems of load imbalances in distributed RDF stores. In Section 4.2, we describe the limitations of the state-ofthe-art individual key index scheme for the distribution of triples in Peer-to-Peer based RDF stores. We then discuss compound keys index scheme for a balanced load distribution in Section 4.3. The following Section 4.4, explainstheparallel query processing technique for a fair query load distribution including speed-ups by bundling the computation resources and bandwidth. Based on this parallel query execution technique, we present in Section 4.5 our simulation results regarding the performance of the Peer-to-Peer based RDF stores with the performance metrics query load distribution and response time for RDF query evaluation. Finally, we conclude in Section Related Work Since RDF query languages mainly support constraint search of the triples terms (subject, predicate or object), existing distributed RDF stores [64, 17, 31, 7, 8] index (store) triples 3 times in the network for each of these terms. This indexing technique provides the possibility to find triples based on any search criteria as long there exist at least one constant (term) in a triple pattern. Partitioning of triples in this way sends all triples with some term to the same peer, the triples with the same subject, for example, will be stored on the same peer, with the advantage that it o ers a constraint search for the given subject value in the local database. However in practice, the frequency distribution of terms in RDF data is highly skewed and this term based triples partitioning leads to the problems of load imbalances. Several works have already addressed the problems of load imbalances in DHTs [98, 68, 45]. However, these solutions are mainly concerned with the evenly distribution of keys among the peers of the network by adjusting the key ranges of these peers, where as the load imbalances problems in RDF systems are because of unevenly distribution of the data associated with keys. RDF data are highly skewed and non of above mentioned load-balancing approaches is capable to handle the hotspots created due to the frequently occurring terms in RDF triples. Cai et al. [31] identified hotspots in DHT- based RDF stores and addressed this issue of load-balancing by limiting the storage of overly popular URIs and literals based on the local capacity of peers. This comes of course at the cost of possibly losing the complete result, when the query is on these popular values. The load-balancing strategy proposed by Battré et al. [16] is based on data relocation, and an overlay tree among peers is used to keep track of the relocated triples. The overlay tree is constructed over DHT positions of only overly popular terms in RDF triples, e.g., DHT position which stores triples with predicate

90 CHAPTER 4. LOAD BALANCING 77 rdf:type. Thistriplesrelocationstrategysplitsthetriplesstoredonanoverloaded peer into two or more triples sets and disseminates them to less loaded peers. The queries posed on these overly popular terms have to be forwarded into all branches of corresponding overlay trees and have to be executed on each peer in the path of these branches. This type of load balancing is also very fragile in the case of a peer failure in the overlay tree, which results to the loss of whole branch of the tree. Mietz et al. [79] solutionforloadimbalancesissueisbasedonusageofrandom hash depth for indexing of triples in the network. They showed a huge di erence among the peer s data load (number of triples stored per peer), when triples are indexed 3 times in the state-of-the-art distributed RDF stores using a fixed hash depth of 1, i.e., hash(subject), hash(predicate), hash(object). To improve the data load distribution among peers they proposed the idea to index triples using a random hash depth, for example, when using a random hash depth of h=4, there would be 4 potential location keys for each term in RDF triples. This means with the use of random hash depth of h=4, triples containing a particular term in their subject position are distributed among 4 peers, instead of grouping on one peer. They showed that the higher the value of hash depth is chosen, the better the triples distribution among peers would be. However, this comes at the cost of network communication during query evaluation, since many peers (equivalent to value of hash depth used) have to be queried for the evaluation of a triple pattern. Finding an optimum hash depth is also missing in their work. The terms which frequently occur in RDF triples can be assumed to frequently occur in RDF queries as well. Thus the peers responsible for the frequently occurring terms will become hotspots, i.e., they will contain a large portion of RDF dataset and their data will be requested very frequently during query evaluation. Liarou et al. [73] addressedthisproblemofqueryloadimbalancesbyreplicating triples to several peers and distribute the evaluation of a query among these peers. For this purpose, they additionally indexed triples on combinations of terms in RDF triples subject+predicate, subject+object, predicate+object, and subject+predicate+object, with 7 replications of each triple in total. In contrast, our compound keys index scheme for load balancing proposed in Section 4.3, index triples only three times on combinations of terms, i.e., subject+predicate, predicate+object, object+subject. Being triples indexed on compound keys, the evaluation of triple patterns with two known terms, e.g., predicate and object, is very simple in our indexing mechanism. RDF queries where only one term is known get more challenging in compound indexing technique, if not all triples for a specific term are on the same peer and have to be collected from many peers. To evaluate such queries, we leverage the fact that our search tree based overlay (3nuts) provides support for range or prefix queries. Where as, range queries are

91 CHAPTER 4. LOAD BALANCING 78 not practical in traditional DHT-based overlays. Therefore, Liarou et al. [73] index triples on individual terms subject, predicate and object as well. They used this extra triples storage overhead for the distribution of the query processing load among many peers, but have not studied the utilization of this overhead for the improvement in response time of queries. The use of combinations of terms as keys for indexing RDF data in centralized RDF stores is proposed in literatures [100, 54]. Harth et al. [54] usedthenotionof quad (subject, predicate, object, context) to represent the RDF data, and proposed an optimized index structure to support the evaluation of queries over these quads. He showed that only 6 indexes (s+p+o+c, p+o+c, o+c+s, c+s+p, c+p, o+s) are needed to cover all possible (16) access patterns of quads, where as an access pattern is a quad where any combination of subject, predicate, object, context is either specified or a variable. Indexing of triples on compound keys (combinations of terms) for a balanced data distribution in the state-of-the-art DHT-based RDF stores is not an option. With such a triples indexing method, DHTs can provide limited query evaluation functionality, i.e., the evaluation of triple patterns with only a single term (1 constant) is only possible through range queries which are not practical in traditional DHT-based overlays. Where as the indexing of triples on combinations of terms is possible in our 3rdf system, leveraging the fact that the underlying search-tree based overlay (3nuts) provides support for range or prefix queries. 4.2 Limitations of Indexing on Individual Keys RDF triples in distributed RDF data stores are indexed (stored) in such a way that it can answer all possible triple patterns. A triple pattern is a triple where any combination of subject, predicate and object is either specified or a variable. For example, a triple pattern could be a triple where only predicate is specified, and subject, object are variables. The total number of triple patterns can be determined by considering the fact that there exist 2 possibilities for each triple component. Therefore there is total 2 3 possible subsets of triple patterns without ordering. Table 3.1 shows all possible triple patterns for RDF triple lookups. To find all triples with a given term (subject, predicate or object), the triples are stored in a Peer-to-Peer network using the terms as location keys. Since all three terms of a triple can be specified, this makes three storages for keys with subject, predicate, and object identifier. The search structures of Peer-to-Peer networks are originally designed to store single data elements at a unique key and it is not provided to balance several data elements with the same key on several peers for better load distribution. In distributed RDF stores, a peer stores a set of triples with a given term as key.

92 CHAPTER 4. LOAD BALANCING 79 The frequency distribution of terms in RDF triples is highly skewed, some terms occur very often (e.g., rdf:type ) while others occur only rarely. In their respective work [31, 72], the authors measured term frequencies in typical RDF datasets (Open directory project, DBpedia, Geonames, DBLP) and found that the term occurrences in these datasets follow a power-law. This skewed occurrences of terms in RDF triples leads to the problems of loadimbalances in distributed RDF stores, for instance the peer responsible for the term rdf:type is subjected to a high storage load and the built-in load balancing is not able to balance this higher load. State-of-the-art distributed RDF data stores (see Section 4.2.1) donottacklethisloadbalancingproblem. InSection4.3, we present our own solution with the basic idea to extend the keys such that the set of triples with same key are smaller and load balancing of the Peer-to-Peer network performs better State-of-the-art individual keys index scheme To attain an e cient search for triple patterns, triples are indexed (stored) three times for each term (subject, predicate, object) separately in existing distributed RDF data stores. However, this comes with the drawback that we leverage the load balancing techniques of most overlays because of the fact that the frequency of subject, predicate, and object occurrences in triples is not uniformly distributed, and triples with the same subject, predicate and object are managed on the same peer. The peers responsible for the highly occurring terms will have to store a large portion of the complete RDF triples, while other peers store only few triples. To analyze the load distribution using the state-of-the-art triples indexing, we ran various simulations. For measurement the data of Lehigh University Benchmark (LUBM [48]), for one university with 100,000 triples in total, have been stored in the network of 1000 peers. The e ect of this unfair RDF triples partitioning is illustrated in Figure 4.1. Thebarsforindividual-keysshowthestatistics of data load (number of triples) per peer using this index scheme. We observe a huge di erences among the maximum, average and median load. The number of triples on heavily loaded peer (21489 triples) is about 73 times higher than the peers average load (289 triples), this makes almost 7% of the total numbers of stored triples. 4.3 Solution: Indexing on Compound Keys When a set of data elements is mapped to the same key in the network, we can only achieve fair load distribution if either the network provides mapping several peers on the same key and distribute the load of the key or we make the keys

93 CHAPTER 4. LOAD BALANCING Median Average Maximum #triples individual-keys Indexing Techniques Figure 4.1: indexing on individual terms. on application layer unique such that the load-balancing of Peer-to-Peer networks only supporting one peer for one key fully applies. In our solution, we decide for the second option and reduce the set of triples with same key by extending the keys which are originally the subject, predicate, and object term. First idea coming to mind is adding an arbitrary key extension, for instance a small hash value, and keys would be divided into subsets resulting into better load distribution. However, these subsets would be unstructured and if we evaluate a query containing the original key, we have to process it at all extended keys, for instance subject+hash. So we have decided to extend the keys with another term in the triple, e.g,. the key consists out of subject+predicate. A possible drawback might be a non-uniform fragmentation of the triples sets but the big advantage is that the triple sets are structured and if subject and predicate are already specified in a triple pattern, it has to be performed only at the location of subject+predicate key. To achieve a fair triple distribution with the same triple storage cost as in the state-of-the-art index scheme where triples are indexed 3 times on their individual term (subject, predicate, object), we present an index scheme 3-tuples indexing, based on the idea of using compound indexes in [54] tocoverallpossibleaccess patterns for RDF triples. In this indexing technique, 3 compound indexes are created on combination of terms in RDF triples (subject+predicate, predicate+object and object+subject). This 3-tuples index scheme is based on the notion of tuple index. Definition 4.1 (Tuple index) A tuple index concatenates the identifiers of two terms as key for the storage of a triple.

94 CHAPTER 4. LOAD BALANCING 81 #triples [log] s+p,p+o,o+s s+o,p+s,o+p Peer s rank [log] Figure 4.2: Comparison of data distribution on two di erent 3-tuples indexes. Selection of a particular ordering of terms in triples for the creation of compound keys (indexes) has no impact on the distribution of triples in the network. For example, Figure 4.2 shows that the natural fragmentation of the data of both indexes on subject+predicate, predicate+object, object+subject and on subject+object, predicate+subject, object+predicate has nearly no impact on the data distribution, and both are beneficial for a fair triple distribution. We will use the compound indexes on subject+predicate, predicate+object and object+subject for the triple distribution in the rest of the chapter. The creation of such compound routing indexes on combinations of subject, predicate and object divides the data load of a heavily load peer to many peers. For example, all triples containing predicate type are subdivided according to all possible classes, and in the example of Listing 4.1 there is one specific peer managing all types with object GraduateStudent but not necessarily types for other objects. So if we already have a constraint for this object in a query all triples on this peer are su rdf : <http :// org/1999/02/22 rdf syntax ub: <http ://www. lehigh.edu/zhp2/univ bench. d0v0: <http ://www. Department0. University0.edu#> d0v0 : S3 rdf : type ub : GraduateStudent. d0v0 : S3 ub :name Alex. d0v0 : S3 ub : alex@ub.com Listing 4.1: RDF triples about a student d0v0:s3 encoded in RDF/N3 format.

95 type CHAPTER 4. LOAD BALANCING type+undergradstudent type+graduatestudent type+. type+. type+publication Figure 4.3: peers load distribution. Figure 4.3 shows that in our experiment, with indexing triples on its predicate, for instance, there was only one peer responsible for the storage of triples with predicate type. However, indexing triples on predicate+object subdivided this storage load to many peers storing triples of the classes (UndergradStudent with 5916 triples, GraduateStudent with 1874 triples, Publication with 5999 triples and so on) respectively. The evaluation of triple patterns with two known terms, e.g., predicate and object, is very simple in this new compound keys indexing mechanism. The search functionality of underlying 3nuts network in the search tree is used to locate the peer responsible for the compound key predicate+object, and the corresponding triples are then downloaded. RDF queries where only one term is known get more challenging in this new indexing technique, if not all triples for a specific term are on the same peer and have to be collected from many peers. To evaluate such queries, we leverage the fact that our search tree based overlay (3nuts) provides support for range or prefix queries. For example, a lookup for triple pattern (? : predicate :?) resolves to a prefix query for a specific predicate on the predicate+object index, where we go to an arbitrary peer in the predicate s path in the search-tree and scan the subtree for all predicate-object combinations only using direct routing links. In contrast, as mentioned in Section 4.1, rangequeriesarenotpracticalintra-

96 CHAPTER 4. LOAD BALANCING 83 n2 n3 S1, type, UndergradStudent S4, type, UndergradStudent S6, type, UndergradStudent SELECT?x WHERE {?x, type,?y } n1 n4 SELECT?x WHERE {?x, type, Publication } n8 n5 S12, type, GraduateStudent S15, type, GraduateStudent S18, type, GraduateStudent n7 n6 P1, type, Publication P2, type, Publication P3, type, Publication Figure 4.4: 3-tuples indexing in DHTs. ditional DHT-based overlays. Therefore, we see a trade-o in DHT-based systems with two options, either a more balanced data distribution with compound indexes but limited functionality (evaluation of triple patterns with only 1 constant is not possible) or all functions but more unbalanced data load with the state-of-the-art individual keys index scheme. For example, consider the partitioning of triples based on the combination of their predicates and objects (predicate+object) in a traditional DHT overlay, shown in Figure 4.4. As we know traditional DHTs apply random hash functions on triples keys (e.g., Hash(type+Publication)) to locate the responsible peers (e.g., n5), that will store the corresponding triples. In the given example, if predicate and object are given in a query (e.g.,?x, type, Publication) then the routing algorithm in DHT will easily determine the responsible peer n5. However, for queries where the predicate is only given (e.g.,?x, type,?y), the use of hash function Hash(type) will not be able to support the access of matching triples from the responsible peers n3, n5, n8. Table 4.1 shows that the 3-tuples index scheme covers all triple patterns 2 through 7 in Table 3.1. Thereisnorestrictionwhatsoeverintriplepattern1,and thus we have to propagate it to all peers in the network for the evaluation. A lookup for the triple pattern 8 resolves to a query for instance on the subject+predicate index, where we go to a peer responsible for the subject+predicate path, and use

97 CHAPTER 4. LOAD BALANCING Median Average Maximum #triples individual-keys 3-tuples Indexing Techniques Figure 4.5: individual/compound indexes comparison. the object constraint to scan the local database for the matching triples. subject+predicate predicate+object object+subject (subject:predicate:?) (?:predicate:object) (subject:?:object) (subject :? :?) (? : predicate :?) (? :? : object) Table 4.1: Three compound keys are needed to cover all triple patterns The e ect of using 3-tuples index scheme is reflected in Figure 4.5. Formeasurement we have stored LUBM [48] dataset, for one university with 100,000 triples in total, in the network of 1000 peers. The bars of 3-tuples indexing show the number of triples per peer using this index scheme. Comparing it with state-ofthe-art individual-keys index scheme, we can observe a significant reduction in the di erences among the maximum, average and median loads. The number of triples stored on the heavily loaded peer in individual-keys indexing was triples which is reduced to 8330 triples in 3-tuples indexing. Improving the load of median peer to 200 triples, which was 85 triples in individual-keys indexing, shows that the triples are stored now comparatively on more peers of the network. Figure 4.6 shows the triples managed by the peers in decreasing order to the number of triples per peer, e.g. the peer with rank 1 using individual-keys indexing has triples and 3-tuples indexing has 8330 triples. When we use compound keys for indexing (3-tuples indexing) instead of individual keys and compare the top ranked peers, we can in fact prevent hotspots where peers are overloaded by data and query requests, slowing down the system. The curve of 3-tuples indexing shows that the triples are distributed to more peers of the network as compared to individual-keys indexing. Certainly, in an optimal case, all peers

98 CHAPTER 4. LOAD BALANCING 85 #triples [log] individual keys 3 tuples Peer s rank [log] Figure 4.6: #triples/peer for individual/compound indexes. would manage the same amount of triples indicated by the constant function of the average value. Both in individual keys indexing and 3-tuples indexing the average number of triples managed by a peer is 289 (i.e., both store the same number of triples in the network). The median peer in 3-tuples index scheme has 200 triples, and in individual keys indexing it holds only 85 triples, indicating a better load distribution for the 3-tuples index scheme, without bearing an extra triple storage cost. 4.4 Improving Query Processing We can assume that a term, which frequently occur in RDF triples, will also frequently occur in RDF queries. Thus, in addition to having an unbalanced storage load distribution by partitioning triples on its individual subject, predicate and object components, this state-of-the-art triple distribution technique also leads to a very unfair query processing load distribution and results to a hight query processing time. The peers responsible for the frequently occurring terms will become hotspots, i.e, they will contain a large portion of RDF dataset and their triples will be requested very frequently during RDF query evaluation. The storage of large portions of RDF triples on heavily loaded peers takes these peers a long processing time to find the matching triples for a query. The state-of-the-art individual keys index scheme allows the use of only a single term in a triple pattern as a key for routing the query to the responsible peer. For example, the evaluation of given query in Listing 4.2 is carried out by sending the query to the peer responsible for the predicate rdf:type. The presence of large portion of RDF triples with predicate rdf:type on the corresponding peer causes

99 CHAPTER 4. LOAD BALANCING 86 a long processing time to respond these triples. In contrast to aforementioned state-of-the-art term based triples partitioning, the compound keys indexing reduces the hotspots in the context of query evaluation. We can exploit the 3-tuples index scheme to improve the response time and processing load of queries. Triples indexed on combination of subjects, predicates and objects are distributed relatively on a larger portion of network peers, consequently the evaluation of RDF queries are carried out on large part of the network peers (fair query load distribution). The peers responsible for the compound keys (combination of subjects, predicates and objects) are also supposed to contain relatively small number of triples. The corresponding peers thus have to spent less time for computing the answer triples. For example, as a result of partitioning triples on compound keys, the evaluation of given query in Listing 4.2 will be carried out on the peer responsible for the key rdf:type+ub:graduatestudent. The corresponding peer definitely contains less triples than the one responsible for the key rdf:type, and thus takes relatively shorter query processing time. PREFIX rdf : <http :// org/1999/02/22 rdf syntax ns#> PREFIX ub : <http ://www. lehigh.edu/zhp2/univ bench. owl#> SELECT?X WHERE {?X r d f : type ub : GraduateStudent } Listing 4.2: SPARQL query returning graduate students. Partitioning of triples on a single term (subject, predicate, object) also limits the processing of RDF queries to sequential ones. The existing distributed RDF data stores, despite their di erences on query processing strategies, evaluate RDF queries sequentially. Majority of these RDF stores (RDFPeers, Atlas, GridVine, 3rdf) use Query Chain (QC) query processing algorithm, originally presented in [73], which moves the query processing in sequence from one peer to another and intersect the candidate sets in this way. In QC, the triple patterns contained in the query are iteratively resolved by a chain of nodes. The query evaluation is started with a single triple pattern of the query by doing a lookup for peer responsible for the evaluation of this triple pattern. This peer adds to an intermediate result all triples of its local database qualified for the evaluated triple pattern. Then, the intermediate result is extended by doing a lookup for a second triple pattern and joining the results. This operation is executed until all triple patterns of the query have been processed. We exploit the compound keys indexing technique (3-tuples indexing) to parallelize the processing of RDF queries. Parallelism could be a real boost for local computation (computation time) and data transfer (intermediate results transfer time), since we can bundle the computation resources and bandwidths of several

100 CHAPTER 4. LOAD BALANCING 87 peers in parallel. This will also support to achieve a better query load distribution, through distributing the execution of a query among many peers. For the parallel processing of RDF queries, we adopt the query processing algorithm, Spread By Value (SBV), originally presented in [73], it extends the ideas of QC by exploiting the values of matching triples found during processing triple patterns incrementally, it rewrites the next triple pattern and distributes the responsibility of evaluating it to more peers than QC. Figure 4.7 shows the parallel processing of an example query. As in the QC, the first triple pattern in the query is evaluated by the peer responsible for the key type+professor. From this point on, the query plan produced by SBV is created dynamically by using values of matching triples that peers find at each step. For example, the peer responsible for the evaluation of the first triple pattern will use the matched values of variable?x (p1, p2, p3, p4) to bind variable?x in the the second triple pattern and produces a new set of queries that will jointly find answers to the second triple pattern. Peers responsible for the evaluation of the second triple pattern then will use the newly found values of variable?y (c1, c2, c3, c4, c5) to bind variable?y in the third triple pattern and sends the resulting query set to responsible peers for evaluation. Multiple chains of peers will be involved for query evaluation in this way, and the peers at the leaf of these chains will deliver partial results back to the peer who submitted the initial query. As in the parallel query execution each peer in the path of query execution has to sent the query and relevant data (intermediate results) through multiple chains, the network tra c increases. However, as the size of intermediate results for the evaluation of a triple pattern in sequential execution is equal to the size of all partial results for the same triple pattern in parallel execution, and the size of atriplepatternitselfismuchsmallerthanthecorrespondingintermediateresults, we can expect the increase in network tra c to be of little influence. 4.5 Performance Analysis In this section, we compare the performance of state-of-the-art individual keys indexing and the compound keys indexing (3-tuples indexing) in the context of load balancing and query processing in Peer-to-Peer based RDF stores. Again, owing to the lack of access to resources with large number of peers, we use the prototype of our 3rdf system for the simulation where we can run multiple peers on one machine. In our performance analysis, we present experimental performance evaluation done with a simulator for our RDF system using either state-of-the-art indexing or novel index scheme, and 3nuts [63] asoverlaynetwork. Werepeated all experiments several times but the variation of result values was negligible and is thus not presented here.

CHAPTER 4. LOAD BALANCING 88 Select?x?z Where {?x type professor.?x teach?y.?y name?z } type+professor?x has 4 matching values for (?x teach?y), and?y has 2 to 3 matching values for each value of?x. p1+teach p2+teach p3+teach p4+teach c1+name c2+name c1+name c3+name c4+name c2+name c4+name c5+name c1+name c3+name Figure 4.

101 CHAPTER 4. LOAD BALANCING 88 Select?x?z Where {?x type professor.?x teach?y.?y name?z } type+professor?x has 4 matching values for (?x teach?y), and?y has 2 to 3 matching values for each value of?x. p1+teach p2+teach p3+teach p4+teach c1+name c2+name c1+name c3+name c4+name c2+name c4+name c5+name c1+name c3+name Figure 4.7: query chain in parallel query processing. For the testing we use the Lehigh University Benchmark (LUBM [48]) data-set of one university. There were triples in total, which were indexed 3 times in the system for both individual keys and compound keys index schemes. The network contained up to 1000 peers. We have generated 5 query sets (based on LUBM queries) that sum up of to 130 queries in total. The example query in Listing 4.3 represents one type of such queries Discussion of Queries Through the evaluation of following representative queries, we analyze the e ect of parallel query processing on query load distribution and on data (intermediate results) transfer time. The performance di erence between sequential and parallel query processing can be shown more evidently through the evaluation of queries which carry a large amount of intermediate results during their processing. For this, we have selected the following two representative queries based on the size of their intermediate results.

Semantic Web Technologies

Semantic Web Technologies 1/57 Introduction and RDF Jos de Bruijn debruijn@inf.unibz.it KRDB Research Group Free University of Bolzano, Italy 3 October 2007 2/57 Outline Organization Semantic Web Limitations of the Web Machine-processable