Answering Aggregate Queries Over Large RDF Graphs

Size: px
Start display at page:

Download "Answering Aggregate Queries Over Large RDF Graphs"

Transcription

1 1 Answering Aggregate Queries Over Large RDF Graphs Lei Zou, Peking University Ruizhe Huang, Peking University Lei Chen, Hong Kong University of Science and Technology M. Tamer Özsu, University of Waterloo Dongyan Zhao, Peking University Technical Report: TR-DB-ICST-PKU Institute of Computer Science and Technology, Peking University, Beijing, China

2 Answering Aggregate Queries Over Large RDF Graph Lei Zou 1,Ruizhe Huang 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 Peking University, China 2 Hong Kong University of Science and Technology, China 3 University of Waterloo, Canada {zoulei,huangruizhe,zhaody}@pku.edu.cn, leichen@cse.ust.hk,tozsu@cs.uwaterloo.ca ABSTRACT In this paper, we develop a graph-based methodology for processing aggregate SPARQL queries. The recent version of SPARQL specification (SPARQL 1.1) includes aggregation queries as a a core requirement. However, few existing works and SPARQL engines consider aggregate queries. We decompose each query into a set of star aggregate queries whose results are combined to achieve the final aggregate results. We develop index structures and processing algorithms for efficient evaluation of aggregation queries. Extensive experiments on both real and benchmark datasets confirm the efficiency of our method. 1. INTRODUCTION As a core concept in Web of Data, RDF (Resource Description Framework) has attracted considerable interest. Generally speaking, a RDF dataset is a collection of triples, each of which is denoted as SPO (sub ject, property, ob ject). Figure 1 shows a flattened representation of a RDF dataset that we use as a running example. In order to query RDF repositories, SPARQL query language has been proposed by W3C. The number of RDF datasets and their volumes have been increasing significantly. For example, in data.gov project, many open government datasets have been released in RDF format. There are also a number of biological RDF datasets, such as Bio2RDF (bio2rdf.org) and Uniprot RDF (dev.isb-sib.ch/projects/uni prot-rdf). In order to answer SPARQL queries efficiently, a number of query engines (e.g., Jena [17], RDF-3x [11]) and processing algorithms (e.g. [1, 19, 3, 20, 9, 14]) have been developed. However, so far, existing SPARQL query optimization techniques have not considered aggregation operators. W3C has recently proposed aggregate functions for SPARQL that extend the way the solution set is constructed. Aggregate functions within SPARQL provide the ability to partition a solution set into one or more groups based on rows that share specified values, and then to create a new solution set that contains one tuple per aggregated group. Example 1 demonstrates the semantics of SPARQL aggregation. Example 1. Consider the following aggregate queries over the RDF dataset in Figure 1. Subject Property Object Person1 Male Person1 $50,000 Person1 Person1 School1 Person1 authored Paper1 Person2 Person2 Male Person2 $100,000 Person2 worksin School2 Person2 School1 Person2 authored Paper2 Person2 American Person3 Person3 Male Person3 $40,000 Person3 School2 Person3 worksin School1 Person3 authored Paper1 Person4 Female Person4 $45,000 Person4 Canadian Person4 worksin School2 Person4 School1 Person4 Person4 authored Paper3 Person5 Male Person5 $50,000 Person5 American Person5 Canadian Person6 Associate Person6 Canadian Person6 $45,000 Person6 authored Paper2 Person6 School1 Person6 School2 School1 China School1 No.1 University School2 located in USA School2 Top University School3 located in China School3 Rank 1 University Paper Paper1 CIKM Paper1 Mining Algorithm Paper Paper2 VLDB Paper2 Spatial Database Paper Paper3 ICDM Paper3 Frequent Mining Figure 1: RDF Triple Table T (Q 1 ) What is the total for different groups of individuals who have the same and. SELECT? t? g SUM(? s ) WHERE {?m < t i t l e >? t.?m <>? g.?m <s a l a r y >? s. } GROUP BY? t,? g (Q 2 ) Group all individuals by their s, and the locations of the schools that they graduated from. Report all groups whose total salaries are higher than $95,000. SELECT? t? n? p SUM(? s ) WHERE {?m < t i t l e >? t.?m <n a t i o n a l i t y >? n.?m <s a l a r y >? s.?m < >?g.? g <l o c a t e d I n >? p. } GROUP BY? t,? n,? p HAVING SUM(? s ) >95000.

3 Each aggregate SPARQL query corresponds to one query graph the query graphs and corresponding aggregate result sets (R(Q i )) of Q 1 and Q 2 are shown in Figures 2a and 2b, respectively. Note that, one group is eliminated by the HAVING statement in R(Q 2 ). v v2 v1 (a) Query Q 1 SUM() Male $100,000 Male $90,000 Female $50,000 SUM() Male China $100,000 Male China $50,000 Male USA $40,000 Female USA $50,000 (b) Query Q 2 Figure 2: Aggregate Result Tables Obviously, applications can take an original solution set and calculate aggregate values by themselves. However, this leads to expensive data transfer between SPARQL endpoints and applications, and this should be avoided. It is preferable to compute aggregates within the RDF engine, since aggregate queries usually result in significantly smaller solution sets. In this paper, we focus on enabling SPARQL engines to calculate aggregates efficiently. This is a topic that has not yet been sufficiently studied. 1.1 Related Work and Possible Solutions Few SPARQL query engines consider aggregate queries, and to the best of our knowledge only two proposals exist in the literature [10, 12]. We discuss these below Using Existing SPARQL Query Engines Given an aggregate SPARQL query Q, a straightforward method to answer Q is to transform it into a SPARQL query Q without aggregation predicates, find the solution set to Q by existing query engines, then partition the solution set into one or more groups based on rows that share specified values, and finally, compute the aggregate values for each group [10]. Although it is easy for existing RDF engines to implement aggregate functions this way, the approach is problematic, since it misses opportunities for query optimization. For example, although there are only two aggregate tuples in the final aggregate result set R(Q 2 ), this method needs to find the whole solution set for query Q 2 (disregarding aggregate predicates in Q 2 ) and then partition them into three groups. Then, it computes an aggregate value for each group and filters out the group that cannot satisfy the HAVING condition. Obviously, this is not the optimal way to process this query. Furthermore, it has been pointed out [12] that this method may produce incorrect answers. For example, if we convert Q 2 to a traditional SPARQL query Q 2 by removing aggregate predicate in the SELECT clause and the GROUP BY, the solution set R(Q 2 ) is given in Figure 3(a). Computing aggregation over R(Q 2 ) does not produce the correct result of Q 2 the total of the third group is computed as $135, 000, but the correct result is $90, 000. The reason is that Person6 graduated from two schools that are both located in China. Thus, the of Person6 is included in the aggregation twice in Figure 3(b). Simply removing duplicates in R(Q 2 ) also leads to incorrect answers (the total for the third group is $45, 000), since Person4 has exactly the same dimension values with Person6. Seid and Mehrotra [12] studies the semantics of group-by and aggregation in RDF graph and how to extend SPARQL to express?t?n?p?s American USA $100,000 American China $100,000 Canadian China $45,000 Canadian China $45,000 Canadian China $45,000 Computing aggregation over R(Q 2 )?t?n?p?s American USA $100,000 American China $100,000 Canadian China $135,000 Figure 3: Problem in Using Existing SPARQL Query Engine grouping and aggregation queries. They do not address the physical implementation or query optimization techniques Using Relational Systems An alternative approach is to dump RDF triples into a relational system and convert the SPARQL query to SQL. Then, we can use aggregate SQL queries over tables [8, 6] to answer aggregate SPARQL queries. Generally speaking, there are three ways to store RDF triples in relational tables: 1) One big triples table: Build a single large three-column (SPO) table (like Figure 1), convert the SPARQL query to an SQL equivalent, which is run on this table. In order to speed up query processing, an exhaustive index based on all permutations of S,P,O columns can be built [11, 16]. Let us consider how to answer Q 1 by this method. Assume that there are 10,000 faculty members in table T (as shown in Figure 1). Although there are at most 6 groups ({male,female} {assistant professor, associate professor, professor}) in the final aggregate result set, we need to join two temporary tables with 10,000 triples. Obviously, it is expensive to do that in an online query. Traditional data warehousing systems materialize aggregate views (i.e., computing data cubes) to speed up online aggregate queries. However, it is impossible to define a data cube over one big triples table, since RDF data may have too many dimensions (i.e., properties). For example, DBPedia data has more than 1,000 dimensions. 2) Vertical partitioning: Vertical partitioning, as proposed in the SW-store system [1], builds a two-column (S-O) table for each property that is ordered by column S allowing fast S-S merge join. However, this solution also has expensive online join cost for S-O joins. 3) Property table+leftover table: Jena2 [17] proposes the use of property tables to speed up query processing over RDF data. Two types of property tables are proposed. The first type is called a clustered property table, and contains clusters of properties that tend to be defined together. The second type of property table, called a property class table, exploits the rdf:type property of subjects to cluster similar sets of subjects together in the same table. Both of these have similar structure. Property table approach reduces the number of join steps. Furthermore, for each property table T i, we can also materialize the data cube defined by all dimensions in T i. However, if an aggregate query involves more than one property table, this approach cannot be used to speed up query processing, since it requires joins or unions to combine data from several tables. Furthermore, property tables may include too many NULL values, and they cannot handle multi-valued attributes [1]. Finally, the RDF data tend not to be very structured. For example, each subject in the same type need not have the same properties. In Figure 1, Person4 has property, but Person1 does not. This facilitates pay-as-you-go data integration, but prohibits the application of classical relational approaches to speed up aggregate query processing. For example, materialized views, which is a commonly used optimization approach [8], may not be used easily. Assume that we have a materialized view V 1 over dimensions (A,B,C). In this case, given an aggregate query over dimensions (A,B), we can get the solution set by only scanning view V 1 instead of scanning the original table. However, we

4 cannot do the same thing in RDF as Example 2 demonstrates. Example 2. Considering the RDF dataset in Figure 1, assume we have the following aggregate query: (Q 3 ) What is the total for different groups of individuals who have the same, and. SELECT? g? t? n SUM(? s ) WHERE {?m <>? g.?m < t i t l e >? t.?m <s a l a r y >? s.?m <n a t i o n a l i t y >? n} GROUP BY?g,? t,? n v (a) GROUP BY Pattern P 3 SUM() Female Canadian $45,000 Male American $100,000 (b) Aggregate Result Set R(Q 3 ) Figure 4: Query Q 3 Figure 4(b) shows the aggregate result set R(Q 3 ) for Q 3. Although group-by dimensions in Q 1 is a subset of group-by dimensions in Q 3, it is impossible to get the aggregate result set R(Q 1 ) by scanning R(Q 3 ), as would be possible with relational materialized views. The main reason is that SPARQL semantics is based on subgraph matching, which is not well captured in the relational representation. Consider Person1 in RDF graph, which can match query Q 2 but cannot match query Q 3, since Person1 does not have property (i.e., dimension) Others There is an extensive body of work on group-by and aggregation on XML data (e.g., [18, 5, 2]). These mainly focus on supporting group-by and aggregation at logical [2] or physical levels [18, 5]. A key difference between XML and RDF is that XML is tree-structured, where RDF is graph-based. Thus, some aggregation methods, such as the merge algorithm [5], cannot be used in aggregate SPARQL queries. 1.2 Proposed Approach and Contributions In order to answer aggregate SPARQL queries efficiently, we follow the graph matching approach, where both the RDF data and the SPARQL query are represented as graphs and the result is found by subgraph matching. Specifically, given an aggregate query Q, we first decompose it into several star aggregate queries S i (i = 1,..., n), where each star aggregate query is formed by one vertex (called center) and its adjacent properties (i.e., adjacent edges). For example, queries Q 1 and Q 3 are star aggregate queries whose graph patterns are shown in Figure 2a and 4(a), respectively. The formal definition of star aggregate query is given in Section 4. We propose T-index to process star aggregate queries efficiently without performing joins. A T-index is a trie, where each node N has a materialized set of tuples M(N). A star aggregate query can be answered by grouping materialized tables associated with nodes in T-index. T-index and star aggregate query processing are discussed in Sections 4.1 and 4.2, respectively. Once the results of star aggregate queries S i (i = 1,..., n), are obtained, we employ gstore [20], which we had previously proposed as a graph matching-based SPARQL query engine, to find all relevant nodes for each star center. Then, based on these relevant nodes, we can find the final result of Q. Experiments that we have conducted (Section 7) demonstrate that the performance of our method is superior to existing methods in answering aggregate queries over large RDF data. 2. PROBLEM FORMULATION In this section, we review the terminology that we use in the paper, and formally define our problem. Spatial Database VLDB School Frequent Mining Paper3 ICDM 2000 Paper2 Canadian authored authored Person6 Person4 $45,000 American worksin Female $50,000 $100, School2 USA Person2 No. 1 University Top University 010 Male School1 Male Person1 Male $50, China 005 Person5 American $50,000 Canadian 012 authored Rank 1 University China authored 007 Paper CIKM worksin 003 Person Mining Algorithm $40,000 Male Figure 5: RDF Graph DEFINITION 2.1. A RDF graph is denoted as G = V, L V, E, L E, where (1) V = V e V l is a collection of vertices that correspond to all subjects and objects in the RDF graph, where V e is a collection of entity and class vertices that are represented by URIs, and V l is a collection of literal vertices; (2) L V is a collection of vertex labels assigned as follows: The label of a vertex v V l is its literal value, the label of a vertex v V e is its corresponding URI; (3) E is a collection of directed edges that connect the corresponding subjects and objects; (4) L E is a collection of edge labels, where the label of an edge (v 1, v 2 ) E is its corresponding property. If (v 1 V e v 2 V e ), (v 1, v 2 ) is called a link property edge. If (v 1 V e v 2 V l ), (v 1, v 2 ) is called an attribute property edge. Note that we do not distinguish class entity vertices from other entity vertices in this definition, since class entities play the same role as other entity vertices in our solution. In the remainder, the term entity vertex refers to this more general definition unless clarification is needed. Figure 5 shows the RDF graph corresponding to data in Figure 1, in which rectangle nodes denote entity vertices, and others are literal vertices. The number besides an entity vertex is the vertex ID that is introduced to simplify the description. For example, properties,,,,,,, and are attribute properties, where, worksin and authored are link properties. DEFINITION 2.2. An aggregate SPARQL query Q consists of three components: 1) Query pattern P is a set of triple statements that form one query graph. 2) Group-by dimensions and measure dimensions are pre-defined object variables in query pattern P. 3) (Optional) HAVING condition specifies the condition(s) that each group must satisfy in the solution set. Figure 6 demonstrates the three components of the aggregate query Q 2, and the corresponding answer set R(Q 2 ) is given in Figure 2a. Note that, we first assume that all group-by dimensions correspond to attribute property edges (e.g.,?t,?n and?p in Figure 6), and we will relax this assumption in Section 6.

5 SELECT? t? n? p SUM(? s ) WHERE {?m < t i t l e >? t.?m <n a t i o n a l i t y >? n.?m <s a l a r y >? s.?m < >?g.? g <l o c a t e d I n >? p. } GROUP BY? t,? n,? p HAVING SUM(? s ) > Measure dimension Query pattern Grouping dimension Having condition (optional) Figure 6: Three Components in Aggregate Queries DEFINITION 2.3. Given an aggregate query Q and its query pattern P, R(P) and R(Q) denote solution sets to P and Q, respectively. R(P) is a set of tuples that project all matches of P in the RDF graph G into group-by dimensions and measure dimensions. R(Q) partitions R(P) into one or more groups based on tuples that share values on specified group-by dimensions. Each group corresponds to an aggregate tuple, which is formed by group-by dimension values and the aggregate value in this group. R(Q) is a set of these aggregate tuples. u 1 u 1 Q u 2 u 2 S 1 S 2 L L Canadian {004,006} USA {011} American {002} China {010,012} R(S 1 ) R(S 2 ) L 1 L 2 SUM() American USA {002} {011} 100,000 American China {002} {010} 100,000 Canadian China {004,006} {010,012} 90,000 R(Q) Figure 7: Overview of Our Solution 3. OVERVIEW OF OUR SOLUTION We illustrate our method using query Q in Figure 7, which is the same as Q 2 without the HAVING clause. We decompose Q into two star aggregate queries S 1 and S 2, defined as follows. DEFINITION 3.1. A Star Aggregate (SA) query (u, g = {p 1,... p d }, m = {p d+1,...p n }) consists of a central vertex u, a set of groupby dimensions {p 1,...p d }, and a set of measure dimensions {p d+1,...p n }, where {p 1,...p d,...p n } are all the attribute properties adjacent to u. The query pattern, P, of a SA query is a star, and each match of star S over RDF graph G is a single entity vertex (see Definition 2.1) together with its neighbors. In Figure 7, S 1 is a SA query, which can be denoted as (u 1, {, }, {}). Measure dimensions are optional for a SA query, since users may be interested in the count aggregation (the number of matches to SA queries). For example, S 2 has no measure dimension. In order to answer a star aggregate query efficiently and without joins, we propose T-index, which is a trie (i.e., a prefix tree). Each node of T-index corresponds to an entity vertex in the RDF graph and its adjacent attribute properties (called a transaction). Figure 8a shows all transactions in the RDF graph of Figure 5, which form the transaction database D. Any path beginning from the root is the prefix (Definition 4.2) of at least one transaction. For example, path root--- (the path reaching node N 3 ) is the prefix of four transactions: 001,002,003 and 004. Furthermore, each node has a materialized set for these transactions (Figure 8b). We answer a star aggregate query by accessing the relevant materialized sets in T-index without performing joins. For example, we answer star query S 1 by accessing materialized sets M(N 4 ) and M(N 7 ). Since group-by dimensions in M(N 4 ) are,,, and group-by dimensions in query S 1 are,, we compute a temporary aggregate set M (N 3 ) on dimensions,. Similarly, we project M(N 7 ) over dimensions, to get M (N 7 ). Finally, we combine M (N 4 ) and M (N 7 ) to answer star query S 1. Details of the SA query algorithm are in Section 4. Figure 7 shows result sets R(S 1 ) and R(S 2 ) for the two star aggregate queries, respectively. Specifically, R(S i ) groups all tuples that share the same values on specified group-by dimensions, and each group corresponds to an aggregate tuple. In order to answer aggregate query Q in Figure 7, we need to join R(S 1 ) and R(S 2 ). In order to speed up the join, we find all matching vertices of vertex u 1 and of vertex u 2, where u 1 and u 2 are S 1 and S 2 s star centers. We use the gstore system [20] for this match. Based on star query results R(S 1 ) and R(S 2 ), we partition these matches into different groups, where all tuples in one group share the same values on group-by dimensions. The whole process is discussed in Section STAR AGGREGATE QUERIES We first discuss the aggregate index, T-index, (Section 4.1) and then present the query algorithm based on T-index (Section 4.2). We also discuss, in Section 4.3, the maintenance of T-index if there are frequent updates to the RDF graph. 4.1 T-Index For each entity vertex v in a RDF graph G, all distinct attribute properties (Definition 2.1) adjacent to v are collected to form a transaction. For example, the adjacent attribute properties of entity vertex 001 are,,. Thus, we have a transaction T(001) =,,. All transactions are collected to form a transaction database D, as shown in Figure 8a. Each entity vertex v in RDF graph G corresponds to one transaction T(v) in D. In each transaction, properties are ordered in their frequency descending order in D, where property frequency is defined as follows. DEFINITION 4.1. The frequency of a property p in a transaction database D is Freq(p) = {T(v) p T(v) T(v) D} For example, the frequencies of, and are 6, 5 and 4, respectively. An attribute that occurs multiple times in an entity vertex (e.g., for Person5) appears once in that transaction and is counted once in frequency calculation as per Definition 4.1. Thus, precedes, which precedes in the relevant transactions. This order is important since the construction of paths follows this order. Frequency ordering leads to fewer nodes being inserted since there is a higher probability that more prefixes will be shared among different transactions. It also facilitates query processing, as discussed in Section 4.2. Note that, if multiple dimensions have the same frequency, their order is arbitrarily defined (the effect of this is experimentally studied). Furthermore, for ease of presentation, we use terms entity vertex and transaction interchangeably, as well as attribute property and dimension. DEFINITION 4.2. Given a transaction T, the length-n prefix of T is the first n dimensions in T. DEFINITION 4.3. A T-index is a unordered tree structure defined as follows:

6 Vertex ID Entity vertex Adjacent Attribute Properties 001 Person1,, 002 Person2,,, 003 Person3,, 004 Person4,,, 005 Person5,, 006 Person6,, 007 Paper1,, 008 Paper2,, 009 Paper3,, 010 School1, 011 School2, 012 School3, N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N 3 ) L $100,000 Male {002} $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N 4 ) L $45,000 Female Canadian {004} $100,000 Male American {002} M(N 5 ) L $50,000 Male Canadian {005} $50,000 Male American {005} M(N 7 ) L $45,000 Canadian {006} Dimension N11 List DL N9 N12 N10 (a) Transaction Database D 1. It consists of one root labeled as root. 2. Each node N in T-index denotes a dimension. 3. Node N j is a child of node N i if and only if there exists at least one transaction T, where the path reaching node N i is length-n prefix of T and the path reaching node N j is length- (n + 1) prefix of T. 4. Each node N in T-index has a vertex list N.L registering the IDs of all transactions T i, where the path reaching N is a prefix of T i. Figure 8b shows an example of T-index. When inserting T(001) =,, into the T-index, the path N 0 -N 1 -N 2 -N 3 is followed. 001 is registered to N i.l where i = 1, 2, 3. Furthermore, since,, is a prefix of four transactions ( 001,002,003,004 ), the corresponding vertex list to node N 3 is N 3.L = {001, 002, 003, 004}. Note that the storage of the vertex lists can be optimized by a variety of encoding techniques. We don t discuss this tangential issue any further. Besides T-index, there are two associated data structures: dimension list DL and materialized aggregate sets M(N). Dimension list DL records all dimensions in transaction database D. When introducing a node into T-tree, according to the dimension of the node, we register the node to the corresponding dimension in DL, similar to building an inverted index. Consequently, the dimensions in DL are ordered in their frequency descending order. Each node N has an aggregate set M(N). Let the dimensions along the path from the root to node N be (l 1, l 2,..., l N ). According to N.L, one can find all transactions represented by the portion of the path reaching this node. M(N) is a set of aggregate tuples that group these transactions based on shared values on dimensions (l 1, l 2,..., l N ). Each aggregate tuple t in M(N) has two parts: the dimensions t.d and the vertex list t.l that stores the vertex IDs in this aggregate tuple, as shown in Figure 8b. Consider node N 3 in Figure 8b. We partition all transactions represented by the path N 0 -N 1 -N 2 -N 3 into four groups based on group-by dimensions that share specified values along (,,). Each group corresponds to one aggregate tuple t. These pre-computed aggregate sets speed up SA queries, as discussed in the next section. Figure 8: T-index (b) T-Index We now discuss the construction of T-index and computing aggregate sets given in Algorithm 1. Initially, a scan of D derives a list of all dimensions, and the dimension list DL is created. We introduce a root node into T-index (Lines 1-2 in Algorithm 1). When we insert a transaction T(v) into T-index, we find a node N i, where the path reaching N i is the maximal prefix of T(v) (Line 4). The maximal prefix means that (1) the path reaching node N i is a length-n prefix of T(v) and (2) there exists no path that is a length-(n+1) prefix of T(v). Let {l 1,..., l Ni } be the dimensions along the path from the root to node N i and {p i,..., p k } be T(v) s dimensions. If {l 1,..., l Ni } {p 1,..., p k }, then we need to add new nodes to extend the path (Lines 5-6). Specifically, we introduce a new node N j as a child of node N i to denote the (n + 1)-th dimension of T(v). Iteratively, we introduce ( T(v) n) new nodes (i.e., nodes corresponding to dimensions {p 1,..., p k }\{l 1,..., l Ni }) to extend the path from node N i. We register these new introduced nodes into the corresponding dimensions in DL (Line 7). Furthermore, we update all vertex lists of each node along the path (Line 9). We iterate the above processing by inserting all transactions into T-index (Lines 3-10). Algorithm 1 T-index and Materialized Aggregate Sets Require: Input: Transaction Database D and RDF graph G Output: T-tree and aggregate sets associated with each node. 1: Scan D to find all dimensions and build dimension list DL. 2: Introduce a root node into T-tree 3: for each transaction T(v) in D do 4: Find a node N i in T-tree, where the path reaching N i (i.e.,{n 0,..., N i }) is the maximal prefix of T(v). Let {l 1,..., l Ni } be dimensions along path {N 0,..., N i }. 5: if {l 1,..., l Ni } {p 1,..., p k }, where {p 1,..., p k } are T(v) s dimensions then 6: Introduce T(v) n nodes {p 1,..., p k } \ {l 1,..., l Ni } to extend the path from node N i. 7: Register these new introduce nodes into the corresponding dimensions in DL 8: end if 9: Record the ID of transaction T(v) into vertex lists N i.l, where i = 0,..., k 10: end for 11: for each child N i of the root do 12: Call Function PostOrderVisit(N i ) (Algorithm 2) 13: end for

7 THEOREM 4.1. The structure of T-index does not depend on the order of inserting transactions into T-index. PROOF. (Sketch) Consider T-indexes T 1 and T 2 that are built by two different insertion orders. Since a T-index is an unordered tree, we can prove that T 1 is isomorphic to T 2 by finding a bijective function F from nodes in T 1 to nodes in T 2. We can define function F as follows: Given a node N in T 1, assume that N is the n-th dimension in one transaction T(v) in D. N s image node in T 2 is defined as node N, where is N is also the n-th dimension in transaction T(v) in D. Algorithm 2 Function: PostOrderVisit(N) 1: for each child node N i of N do 2: Call Function PostOrderVisit(N i ). 3: end for 4: for each child N i of N do 5: compute M (N i ) = (p 1,p 2,...,p n) M(N i ) by Algorithm 4 6: end for 7: M(N) = i M (N i ) 8: for each vertex v in N.L but not in i N i.l do 9: F (v) = (p 1,...,p n) v 10: if there exists some aggregate tuple t, where t.d = F (v) then 11: Register vertex ID of v in vertex list t.l 12: else 13: Generate a new aggregate tuple t, where t.d = F (u) and insert vertex ID of v into t.l 14: Insert t into M(N) 15: end if 16: end for Obviously, for each node N in T-index, we can access all entity vertices in N.L and their attribute properties to build M(N). However, some computation and I/O can be shared for computing M(N) of different nodes. For example, if M(N 3 ) and M(N 5 ) have been materialized, M(N 2 ) can be computed by merging M(N 3 ) with M(N 5 ) rather than accessing the original data. THEOREM 4.2. Given a non-leaf node N in the trie-index that has n child nodes N i, (i = 1,..., n), the following statement holds: for any transaction T, if the path reaching node N i is a prefix of T, the path reaching node N is also a prefix of T. PROOF. It can be proven according to (3) of Definition 4.3 (about T-index). Given a node N with n child nodes N i, (i = 1,..., n), according to Theorem 4.2, we know that N.L i N i.l. Consequently, its aggregate set M(N) can be computed from the aggregate sets associated with its child nodes. Assume that the properties along the path reaching node N are (p 1,..., p m ). Initially, M(N) = φ. For each child node N i of N, we compute M (N i ) = (p 1,p 2,...,p n) M(N i ) (Lines 4-6 in Algorithm 2). Then, we compute M(N) = i M (N i ) (Line 7). Furthermore, for each vertex v that is in M(N).L but not in i M(N i ).L, we need to access the property values of vertex v on dimensions p 1,..., p n in the RDF graph (Lines 8-17). Specifically, we define function F(v) = (p 1,...,p n) v, which means projecting v s adjacent properties over (p 1,..., p n ) (Line 10). We insert F(v) into M(N). If there exists some aggregate tuple t, where t.d = F(v), we register vertex ID of v in vertex list t.l (Lines 11-12). Otherwise, we generate a new aggregate tuple t, where t.d = F (v) and insert vertex ID of v into t.l (Lines 14-15). THEOREM 4.3. Any entity vertex u in RDF graph G is accessed once in computing aggregate sets of trie-index by Algorithm 1. PROOF. (sketch) According to T-index s structure, each entity vertex v is only in one path of T-index. Assume that v corresponds to node N in T-index. The dimension values of v only need to be accessed when computing M(N). Aggregate sets of N s ancestor nodes can be computed from its children nodes without accessing the original data. We illustrate the construction of T-index using an example. First, a scan of D (Figure 8a) derives a list of all dimensions in D and their frequencies, and the dimension list DL is constructed. The root of T-index (N 0 ) is created and labeled as root. Then, we insert all transactions of D into T-index one-by-one. 1. The scan of the first transaction T(0012) leads to the construction of the first branch of the tree: (root,,, ), inserting nodes N 1, N 2 and N 3. Initially, we set N 1.L = {001}, N 2.L = {001} and N 3.L = {001}. 2. T(002) shares a common prefix (root,,, ) with the existing path, and adds one new node N 4 () as a child of node N 3 (). It also causes updating the corresponding vertex list of each node along the path. 3. The above process is iterated until all transactions are inserted into T-index. 4. Finally, for each node N i in T-index, we build aggregate sets M(N i ) by post-order traversal over T-index. Specifically, we first compute M(N 4 ) by assessing entity vertices 002 and 004 and their dimension values. Then, we compute M(N 3 ) by merging the projection of M(N 4 ) over dimensions (,, ) and assessing entity vertices 001 and 003. The process is iterated until all aggregate sets are computed. M(N 3 ), M(N 4 ) and M(N 5 ) are given in Figure 8b as examples. Note that, Person5 is a citizen of two countries resulting in two tuples for Person5 in M(N 5 ). 4.2 Star Aggregate Query Algorithm Given a SA query S = (u, g = {p 1,...p d }, m = {p d+1,...p n }) (see Definition 3.1), we answer S using T-index by Algorithm??. Let P = {p 1,..., p d,...p n }. Given the set of properties P, we find their match (N 1,..., N n ), where N i is a node in T-index and all nodes N i (i = 1,..., n) are in the same path from the root, and the property associated with N i equals p i (Line 1 in Algorithm??). It is possible to have multiple matches for a given P. For each match (N 1,..., N n ), we use N to denote the farthest node from the root (Line 3 in Algorithm??). Note that, this is not necessarily N n, since node identifiers are arbitrarily assigned to help with presentation. The aggregate set associated with node N is denoted as M(N). We compute M (N) by projecting M(N) over group-by dimensions {p 1,..., p d } (i.e., M (N) = (p 1,p 2,...,p d ) M(N)) using Algorithm 4 (Lines 4-5 in Algorithm??). Initially, M (N) is set to be φ (Line 1 in Algorithm 4). As mentioned earlier, each aggregate tuple t M(N) has two parts: t.d and t.l. The former denotes the values over the set D of dimensions, and the latter denotes the vertex list. For each aggregate tuple t M(N), we compute F(t) = (p 1,...,p d ) t.d (Line 3 in Algorithm 4). If there exists an aggregate tuple t M (N), where t.d = F(t), we update t.l = t.l t.l (Lines 4-5 in Algorithm 4). Otherwise, we insert a new aggregate tuple t into M (N), where t.d = F(t) and t.l = t.l (Lines 6-7 in Algorithm 4). The M (N) from all the matches are merged to form the final result to SA query S, i.e., R(S ) (Lines 6-7 in Algorithm??). For example, given a SA query Q 1 (u 1, {, }, {}) in Figure 7, there are two matches (N 1, N 3,N 4 ) and (N 1, N 6, N 7 ) in T-index, which can be found by dimension list and node links. In

8 Algorithm 3 SA Query Algorithm Require: Input: A Trie-Index and a SA query S (u, g = {p 1,..., p d }, m = {p d+1,..., p n }) Output: An aggregate result set M for a SA query Q. 1: Locate all matches of properties (p 1,..., p d,..., p n ) in T-index 2: for each match m i do 3: Let N i denote the node in match m i that is farthest from the root 4: Let M(N i ) denote the aggregate set associated with node N i. 5: M (N i ) = (p 1,p 2,...,p d ) M(N i ) by Algorithm 4 6: end for 7: M = i M (N i ) 8: Return M Algorithm 4 Projection Algorithm Require: Input: M(N) and dimensions (p 1, p 2,...p d ) Output: M (N) = (p 1,p 2,...,p d ) M(N) 1: M (N) = φ 2: for each aggregate tuple t M(N) do 3: F(t) = (p 1,...,p d ) t.d 4: if there exists an aggregate tuple t in M (N), where t.d = F(t) then 5: t.l = t.l t.l 6: else 7: insert t into M (N), where t.d = F(t) and t.l = t.l 8: end if 9: end for 10: Return M (N) the first match, N 4 is the farthest node from the root. Since M(N 4 ) is an aggregate set over dimensions (,,, ), we compute a temporary aggregate set M (N 4 ) on dimensions (, ) by projecting M(N 4 ) over these dimensions. Similarly, we project M(N 7 ) over dimensions (,) to get M (N 7 ). Finally, we obtain R(S 1 ) by merging M (N 4 ) and M (N 7 ). Figure 9 illustrates the process. M(N 4 ) L 45,000 Female Canadian {004} 100,000 Male American {001} M (N 4 ) L Canadian {004} American {002} M (N 7 ) L Canadian {006} M(N 7 ) L 45,000 Canadian {006} L SUM() Canadian {004,006} $90,000 American {002} $100,000 Figure 9: Answering SA Query 4.3 Maintenance of T-index Updates of the RDF data requires efficient maintenance of T- index. Obviously, updates can be modeled as a sequence of triple deletions and triple insertions. As mentioned earlier, all dimensions are ordered in their frequency descending order in dimension list DL to improve SA query evaluation performance (also confirmed by experimental results in Section 7). However, RDF data updates may change the order of dimensions in DL requiring special care during index maintenance. Therefore, we consider updates of T- index in two cases based on whether or not the order of dimensions in DL changes Dimension List DL s order does not change Consider the insertion of a new triple (s, p, o). If p is a link property, we do not need to update the T-index. Thus, we only consider the case when p is an attribute property, i.e., dimension. If s does not occur in the existing RDF data, we introduce a new transaction into D. Then, we insert the new transaction into one path of T-index following Definition 4.3. Accordingly, we need to update the materialized aggregate sets M(N i ) along the path. The detailed steps are the same as Lines 4-9 in Algorithm 1. If s is already in the existing RDF graph, assume that s s existing dimensions are {p 1,..., p n } and Freq(p i ) > Freq(p) > Freq(p i+1 ) in dimension list DL. Again, in the case of equality, order is chosen arbitrarily. This means that the new inserted dimension p should be inserted between p i and p i+1. We locate two nodes N i and N n, where the path reaching node N i (and N n ) has dimensions (p 1,..., p i ) (and (p 1,..., p i,...p n ) 1 ). We remove subject s from all materialized sets along the path between nodes N i+1 and N n, where N i+1 is a child node of N i. Then, we insert dimensions (p, p i+1,..., p n ) into T-index from node N i, and update the materialized sets along the path. Example 3. Insert a triple (Person6,, Male ) into RDF triple table T shown in Figure 1. N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N3) L $100,000 Male {002} $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} $45,000 Male {006} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N4) L $45,000 Female Canadian {004} $100,000 Male American {002} $45,000 Male Canadian {006} Figure 10: Example 3 Insert triple: (Person6,, Male ) N11 N9 N10 N12 Dimension List DL Although inserting the triple changes the frequency of dimension, it does not lead to changing the order in DL. Since Person6 s dimensions are (,, ), we delete Person6 (i.e., 006) from path N 0 N 1 N 6 N 7 in the original T-index. Then, we insert Person6 s new dimensions (,, ) into path N 0 N 1 N 2 N 3 N 4 where and obtained from path N 0 N 1 N 6 N 7. Path N 6 N 7 is deleted, since the updated aggregate sets in N 6 and N 7 are empty. Figure 10 shows the updated T-index after inserting the triple. Suppose now that we need to delete a triple (s, p, o), where p is an attribute property as discussed above. Assume that s s existing dimensions are {p 1,..., p n } and p = p i, i.e., p and p i are the same dimension. We locate two nodes N i and N n, where the path reaching node N i (and N n ) has dimensions (p 1,..., p i ) (and (p 1,..., p i,...p n )). We remove subject s from all materialized sets along the path between nodes N i+1 and N n. Then, we insert dimensions (p i+1,..., p n ) into T-index from node N i, and update the materialized sets along the path. Again T-index itself does not need to be modified. 1 Although dimension list is a set, when the order is important, we specify them as a list enclosed in ( ).

9 N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N3) L $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N4) L $45,000 Female Canadian {004} Delete triple: (Person2,, Male ) N11 N9 N10 N12 Dimension List DL {004} M(N 3 ) L $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N4 {001,002,003,004,005,006} {001,002,003,004,006} N 2 {001,003,004} N 3 N1 {005} N7 N0 N5 {002,006} N8 N 2 M(N4) {005} L $45,000 Female Canadian {004} M(N 2 ) L $50,000 {001} $45,000 {004,006} $40,000 {003} $100,000 {002} N11 N9 N10 N12 M(N 2 ) L $50,000 Male {005} Dimension List DL (a) Phase 1 Delete triple Figure 11: Example 4 (b) Phase 2 Swap & Dimension List DL s order changes Some triple deletions and insertions that affect dimension frequency will lead to changing the order of dimensions in DL. Assume that two dimensions p i and p j need to be swapped in DL due to inserting one triple (s, p, o) or deleting one triple (s, p, o). Obviously, j = i + 1, i.e., p i and p j are adjacent to each other in DL. Updates are handled in two phases. First, we ignore the order change in DL and handle the updates using the method in Section Second, we swap p i and p j in DL, and change the structure of T-index and the relevant materialized sets. We focus on the second phase. There are only three categories of paths that can be affected by swapping p i and p j : (1) path has both dimensions p i and p j, i.e., path has dimensions (p 1,.., p i, p j,..., p m ); (2) path shares a prefix (p 1,..., p i ) with a path in the first category; and (3) path shares a prefix (p 1,..., p i 1 ) with a path in the first category and p i 1 s child is p j. We discuss below how to update the three categories of affected path. We first locate all paths in the first category, i.e., find all paths that have both dimensions p i and p j. Consider one such path H with dimensions (p 1,..., p i, p j,..., p n ). Nodes N i and N j have dimensions p i and p j, respectively. We re N i as N i and N j as N j. Moreover, we set N i s corresponding dimension to be p j and N j s corresponding dimension to be p i. We compute aggregated set M(N i ) by projecting M(N j) over dimensions (p 1,..., p i 1, p j ), i.e., M(N i ) = (p 1,...,p i 1,p j ) M(N j ) and set M(N j ) = M(N j), since they have the same group by dimensions. If node N i has a branch (i.e. N i has other child nodes in addition to N j ) in the original T-index, which is exactly the case belonging to the second category, we introduce a node N i as a child of N i 1, where N i 1 is a parent of N i in the original T-index, and set the dimension in N i as p i. We move all child nodes of N i except for N j to be children of N i. We compute aggregate set M(N i ) = M(N i ) \ (p 1,...,p i ) M(N j ). Specifically, for each aggregate tuple t in M(N i ), we check whether there exists an aggregate tuple t (p 1,...,p i ) M(N j ), where t.d = t.d. If there exists such an aggregate tuple, we generate a new aggregate tuple t, where t.d = t.d, and t.l = t.l \ t.l, and insert t into M(N i ). Otherwise, we insert t into M(N i ) directly. If node N i 1 has a child N m whose corresponding dimension is p j, where N i 1 is a parent of N i in the original T-index, which is a case belonging to the third category, we merge node N j and N m. Specifically, we remove N m and move N m s child nodes to be N i s child nodes. Then, we compute M(N i ) = M(N i ) M(N m). Specifically, for each aggregate tuple t in M(N m ), we check if there exists an aggregate tuple t in M(N i ), where t.d = t.d. If so, we set t.l = t.l t.l. Otherwise, we insert t into M(N i ) directly. Example 4. Delete a triple (Person2,, Male ) from the RDF triple table T shown in Figure 1. This changes the order of and in dimension list DL, since the frequency of is changed to 4. We first assume that the order does not change, and we delete the triple using the method in Section Figure 11a shows the updated T-index after the first phase. Now, we swap and. Path H 1 = N 0 N 1 N 2 N 3 N 4 is a path in the first category, since N 2 and N 3 s corresponding dimensions are and. Path H 2 = N 0 N 1 N 2 N 5 is a path in the second category, since it shares prefix (N 0 N 1 N 2 ) with H 1. Path H 3 = N 0 N 1 N 6 N 7 is one path in the third category, since it shares prefix N 0 N 1 with H 1 and N 1 s child node s dimension is. In the first step of phase 2, we first find the path N 0 N 1 N 2 N 3, where the dimensions in N 2 and N 3 are and, respectively. Then, we re N 3 as N 3 with dimension, and re N 2 as N 2 with dimension. Aggregate sets M(N 3 ) = M(N 3) and M(N 2 ) = (,) M(N 3 ). Since transaction 005 does not have dimension, we introduce a new node N 2, and M(N 2 ) = M(N 2 ) (,) M(N 3 ). Finally, we merge nodes N 6 with node N 2, since they share the same prefixes. Figure 11b shows the final updated T-index. 5. GENERAL AGGREGATE QUERIES As noted in Section 3, we decompose a GA query Q into several SA queries {S i }, i = 1,..., n, where each star center u i is an entity vertex in Q, and a link structure J (Definition 5.1). The result of each S i (R(S i )) is computed using the approach discussed in Section 4. We then join these R(S i ) to compute the result of Q, i.e., R(Q) = i R(S i ). In this section, we focus on the last step of computing R(Q). DEFINITION 5.1. Let GA query Q consist of n SA queries. The link structure J of Q is a subgraph induced by all star centers u i, i = 1,..., n. Specifically, J is denoted as J(V = {u i }, E = {e j }, Σ = {P(e j )}), where vertex u i (1 i n) is a star center, e j (1 j m) is an edge whose endpoints are both star centers, and P(e j ) is the label (link property) of the edge e j. Note that, J is a connected subgraph, since all entity vertices (in Q) are connected together by link properties. For each R(S i ), we

10 Algorithm 5 General Aggregate (GA) Query Algorithm Require: Input: A GA query Q Output: Query Result R(Q) 1: Each entity vertex u i (in Q), i = 1,..., n, together with its adjacent attribute properties form a SA query S i 2: for each SA query S i do 3: Call Algorithm?? to find R(S i ), i = 1,..., n. 4: T i = t R(S i ) t.l 5: end for 6: All entity vertices u i together with link properties between them form the link structure J 7: Find all subgraph matches of J over RDF graph 8: U = {g 1 }, where g 1 includes all subgraph matches 9: for each entity vertex u i in J, i = 1,..., n do 10: set U = φ 11: for each group g in U do 12: for each aggregate tuple t R(S i ) do 13: Select a group g of matches M g {M[i] t.l and M[i] refers to the i-th vertex in M} 14: Insert group g into U 15: end for 16: end for 17: Set U = U 18: end for Assume that the measure dimension is associated with u i 19: for each group g in U do 20: Find all matching vertices to u i for all matches in g 21: Access the measure values of these matching vertices 22: Compute the aggregate value in measure dimension in this group, and insert it into R(Q) 23: end for 24: Report final result R(Q) can find a vertex list L i that includes all vertices in R(S i ). Specifically, we get T i = t R(S i ) t.l, where t is an aggregate tuple in R(S i ). It means that all vertices in T i are candidate matching vertices of u i. To compute R(Q), we need to find all subgraph matches of J over the RDF graph, where a subgraph match is defined as follows. DEFINITION 5.2. Given a link structure J(V = {u i }, E = {e j }, Σ = {P(e j )}) in Q and a subgraph G (V = {v i }, E = {e j }, Σ = {P(e j )}) in RDF graph G, where v i is an entity vertex in G, e j is an edge whose endpoints are both in V, and P(e j ) is the label (link property) of the edge e j, G is called a subgraph match of J in RDF graph G if and only if: 1. v i T i 2. e j (u i1, u i2 ) E e j (v i 1, v i2 ) E and P(e j ) = P(e j ) Consider query Q in Figure 7. The results of S 1 and S 2 and the T i lists are shown in Figure 12(a), while the link structure J is shown in Figure 12(b). S 1 u1 S 2 R(S 1 ) Title L American {002} Canadian {004,006} u2 R(S 2 ) L USA {011} China {010,012} T1 = {002, 004, 006} T2 = {010, 011, 012} u1 u2 {002} {010} {002} {011} {004} {010} {006} {010} {006} {012} (a) SA Query & Answers (b) Link Structure & Matches Figure 12: Query Decomposition u1 Link Structure J Any subgraph isomorphism algorithm can be utilized to find all subgraph matches (as defined in Definition 5.2) of J over the RDF u2 graph, such as VF2 [4] or ULLMAN [15] algorithms. In this work, we utilize our previous system gstore [20] for this purpose, since it is optimized for subgraph isomorphism matches over the RDF graph. For example, we find five subgraph matches of J over RDF graph G, where Figure 12(b) shows the flatten representation of these matches. Then, we need to partition these subgraph matches into one or more groups based on subgraph matches that share specified values in group-by dimensions, and create a new solution set R(Q) that contains one tuple per aggregated group. Specifically, we try to get U = {g j }, j = 1,..., m, where all subgraph matches in group g j share the same values over group-by dimensions. Obviously, a straightforward method works as follows: for each match M, we access the corresponding entity vertices and their group-by dimension values in RDF graph. Then, we partition these matches into different groups based on subgraph matches that share specified values in group-by dimensions. This method suffers from a large number of random I/Os. Furthermore, partitioning subgraph matches is an expensive task, if there are a large number of subgraph matches of J over the RDF graph. In order to improve the performance, we would like to utilize star aggregate query results R(S i ) to partition subgraph matches, which helps reduce I/O accesses. Furthermore, scanning aggregate sets in T-index (in SA query algorithm) requires sequential access, which is much faster. More importantly, R(S i ) has partitioned subgraph matches based on group-by dimensions associated with each vertex u i in query Q. Therefore, we propose to utilize R(S i ) to find final partitions. Initially, we assume that all subgraph matches are in the same group g 1, and set U = {g 1 } (Lines 7-8 in Algorithm 5). Then, we perform a multi-level partitioning over these subgraph matches. At the first level, we consider group-by dimensions in R(S 1 ). For each group g U, we partition matches in g into some new groups g i, i = 1,..., m, where each new group g i has matches that share the same values over group-by dimensions in R(S 1 ). Obviously, g = i g i. Specifically, in order to partition matches in group g, we sequentially scan R(S 1 ). For each aggregate tuple t R(S 1 ), we find a new group g i of subgraph matches M, such that M[i] t.l and M[i] refers to the i th vertex in subgraph match M (Lines 11-12). We insert these new groups g i into U (Line 13). We repeat the above process (Lines 11-16) for all groups g in U. Then, U is the first-level partition. Obviously, g U{all matches in group g}= g U { all matches in group g }. Iteratively, we consider other R(S i ) for other level partitions (Lines 9-18). Finally, for each aggregate group, we compute the aggregate value in aggregated dimension, and insert it into final result R(Q). Assume that the measure dimension is associated with vertex u i in Q. For each group g, we find a list of distinct vertices matching u i. We access the measure dimension values of these matching vertices and compute the aggregate value for this group. Let us recall the decomposition of Q given in Figure 7. We discuss how to partition all matches of J into different groups based on group-by dimensions to find the final result R(Q). Initially, all matches are in the same group g, i.e., g = {(002, 010), (002, 011), (004, 007), (006, 008), (006, 009)} and U 0 = {g}, where (002, 010) is a flattened representation of a match. Based on R(S 1 ), we can get the first-level partition U 1. Specifically, we partition matches of g into fine-grained groups. According to the first aggregate tuple t 1 in R(S 1 ) (Figure 12(a)), we create a new group g 1 = {(002, 010), (002, 011)}, since 002 t 1.L. Due to the same reason, we create another group g 2 = {(004, 007), (006, 008), (006, 009)}. Then, the first level partition is U 1 = {g 1, g 2 }, shown in Figure 13(b). Then, based on R(S 2 ), we can perform the second-level partition U 2. Specifically, we partition each group g i (i = 1, 2) into more fine-grained groups. Figure 13(c) shows the second-level par-

gstore: A Graph-based SPARQL Query Engine

gstore: A Graph-based SPARQL Query Engine Noname manuscript No. (will be inserted by the editor) gstore: A Graph-based SPARQL Query Engine Lei Zou M. Tamer Özsu Lei Chen Xuchuan Shen Ruizhe Huang Dongyan Zhao the date of receipt and acceptance

More information

Processing SPARQL queries over distributed RDF graphs

Processing SPARQL queries over distributed RDF graphs The VLDB Journal DOI 10.1007/s00778-015-0415-0 REGULAR PAPER Processing SPARQL queries over distributed RDF graphs Peng Peng 1 Lei Zou 1 M. Tamer Özsu 2 Lei Chen 3 Dongyan Zhao 1 Received: 30 March 2015

More information

Answering Pattern Match Queries in Large Graph Databases Via Graph Embedding

Answering Pattern Match Queries in Large Graph Databases Via Graph Embedding Noname manuscript No. (will be inserted by the editor) Answering Pattern Match Queries in Large Graph Databases Via Graph Embedding Lei Zou Lei Chen M. Tamer Özsu Dongyan Zhao the date of receipt and acceptance

More information

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph.

Trees. 3. (Minimally Connected) G is connected and deleting any of its edges gives rise to a disconnected graph. Trees 1 Introduction Trees are very special kind of (undirected) graphs. Formally speaking, a tree is a connected graph that is acyclic. 1 This definition has some drawbacks: given a graph it is not trivial

More information

Binary Decision Diagrams

Binary Decision Diagrams Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table

More information

March 20/2003 Jayakanth Srinivasan,

March 20/2003 Jayakanth Srinivasan, Definition : A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs of distinct elements of V called edges. Definition : In a multigraph G = (V, E) two or

More information

arxiv: v4 [cs.db] 21 Mar 2016

arxiv: v4 [cs.db] 21 Mar 2016 Noname manuscript No. (will be inserted by the editor) Processing SPARQL Queries Over Distributed RDF Graphs Peng Peng Lei Zou M. Tamer Özsu Lei Chen Dongyan Zhao arxiv:4.6763v4 [cs.db] 2 Mar 206 the date

More information

Figure 4.1: The evolution of a rooted tree.

Figure 4.1: The evolution of a rooted tree. 106 CHAPTER 4. INDUCTION, RECURSION AND RECURRENCES 4.6 Rooted Trees 4.6.1 The idea of a rooted tree We talked about how a tree diagram helps us visualize merge sort or other divide and conquer algorithms.

More information

The Structure of Bull-Free Perfect Graphs

The Structure of Bull-Free Perfect Graphs The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint

More information

A graph is finite if its vertex set and edge set are finite. We call a graph with just one vertex trivial and all other graphs nontrivial.

A graph is finite if its vertex set and edge set are finite. We call a graph with just one vertex trivial and all other graphs nontrivial. 2301-670 Graph theory 1.1 What is a graph? 1 st semester 2550 1 1.1. What is a graph? 1.1.2. Definition. A graph G is a triple (V(G), E(G), ψ G ) consisting of V(G) of vertices, a set E(G), disjoint from

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Adjacent: Two distinct vertices u, v are adjacent if there is an edge with ends u, v. In this case we let uv denote such an edge.

Adjacent: Two distinct vertices u, v are adjacent if there is an edge with ends u, v. In this case we let uv denote such an edge. 1 Graph Basics What is a graph? Graph: a graph G consists of a set of vertices, denoted V (G), a set of edges, denoted E(G), and a relation called incidence so that each edge is incident with either one

More information

Star Decompositions of the Complete Split Graph

Star Decompositions of the Complete Split Graph University of Dayton ecommons Honors Theses University Honors Program 4-016 Star Decompositions of the Complete Split Graph Adam C. Volk Follow this and additional works at: https://ecommons.udayton.edu/uhp_theses

More information

Section 8.2 Graph Terminology. Undirected Graphs. Definition: Two vertices u, v in V are adjacent or neighbors if there is an edge e between u and v.

Section 8.2 Graph Terminology. Undirected Graphs. Definition: Two vertices u, v in V are adjacent or neighbors if there is an edge e between u and v. Section 8.2 Graph Terminology Undirected Graphs Definition: Two vertices u, v in V are adjacent or neighbors if there is an edge e between u and v. The edge e connects u and v. The vertices u and v are

More information

CPCS Discrete Structures 1

CPCS Discrete Structures 1 Let us switch to a new topic: Graphs CPCS 222 - Discrete Structures 1 Introduction to Graphs Definition: A simple graph G = (V, E) consists of V, a nonempty set of vertices, and E, a set of unordered pairs

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching

Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Optimal k-anonymity with Flexible Generalization Schemes through Bottom-up Searching Tiancheng Li Ninghui Li CERIAS and Department of Computer Science, Purdue University 250 N. University Street, West

More information

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems CSIT5300: Advanced Database Systems L10: Query Processing Other Operations, Pipelining and Materialization Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science

More information

Rigidity, connectivity and graph decompositions

Rigidity, connectivity and graph decompositions First Prev Next Last Rigidity, connectivity and graph decompositions Brigitte Servatius Herman Servatius Worcester Polytechnic Institute Page 1 of 100 First Prev Next Last Page 2 of 100 We say that a framework

More information

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PAUL BALISTER Abstract It has been shown [Balister, 2001] that if n is odd and m 1,, m t are integers with m i 3 and t i=1 m i = E(K n) then K n can be decomposed

More information

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models. Undirected Graphical Models: Chordal Graphs, Decomposable Graphs, Junction Trees, and Factorizations Peter Bartlett. October 2003. These notes present some properties of chordal graphs, a set of undirected

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

CS 441 Discrete Mathematics for CS Lecture 26. Graphs. CS 441 Discrete mathematics for CS. Final exam

CS 441 Discrete Mathematics for CS Lecture 26. Graphs. CS 441 Discrete mathematics for CS. Final exam CS 441 Discrete Mathematics for CS Lecture 26 Graphs Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Final exam Saturday, April 26, 2014 at 10:00-11:50am The same classroom as lectures The exam

More information

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1)

Algorithms for Query Processing and Optimization. 0. Introduction to Query Processing (1) Chapter 19 Algorithms for Query Processing and Optimization 0. Introduction to Query Processing (1) Query optimization: The process of choosing a suitable execution strategy for processing a query. Two

More information

Module 11. Directed Graphs. Contents

Module 11. Directed Graphs. Contents Module 11 Directed Graphs Contents 11.1 Basic concepts......................... 256 Underlying graph of a digraph................ 257 Out-degrees and in-degrees.................. 258 Isomorphism..........................

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY

A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY A GRAPH FROM THE VIEWPOINT OF ALGEBRAIC TOPOLOGY KARL L. STRATOS Abstract. The conventional method of describing a graph as a pair (V, E), where V and E repectively denote the sets of vertices and edges,

More information

Treewidth and graph minors

Treewidth and graph minors Treewidth and graph minors Lectures 9 and 10, December 29, 2011, January 5, 2012 We shall touch upon the theory of Graph Minors by Robertson and Seymour. This theory gives a very general condition under

More information

An approach to the model-based fragmentation and relational storage of XML-documents

An approach to the model-based fragmentation and relational storage of XML-documents An approach to the model-based fragmentation and relational storage of XML-documents Christian Süß Fakultät für Mathematik und Informatik, Universität Passau, D-94030 Passau, Germany Abstract A flexible

More information

arxiv: v2 [cs.ds] 18 May 2015

arxiv: v2 [cs.ds] 18 May 2015 Optimal Shuffle Code with Permutation Instructions Sebastian Buchwald, Manuel Mohr, and Ignaz Rutter Karlsruhe Institute of Technology {sebastian.buchwald, manuel.mohr, rutter}@kit.edu arxiv:1504.07073v2

More information

HW Graph Theory SOLUTIONS (hbovik) - Q

HW Graph Theory SOLUTIONS (hbovik) - Q 1, Diestel 9.3: An arithmetic progression is an increasing sequence of numbers of the form a, a+d, a+ d, a + 3d.... Van der Waerden s theorem says that no matter how we partition the natural numbers into

More information

TotalCost = 3 (1, , 000) = 6, 000

TotalCost = 3 (1, , 000) = 6, 000 156 Chapter 12 HASH JOIN: Now both relations are the same size, so we can treat either one as the smaller relation. With 15 buffer pages the first scan of S splits it into 14 buckets, each containing about

More information

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Christophe Crespelle Christophe Paul CNRS - Département Informatique, LIRMM, Montpellier {crespell,paul}@lirmm.fr

More information

Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search

Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search Simpler, Linear-time Transitive Orientation via Lexicographic Breadth-First Search Marc Tedder University of Toronto arxiv:1503.02773v1 [cs.ds] 10 Mar 2015 Abstract Comparability graphs are the undirected

More information

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees. Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets

More information

University of Waterloo Midterm Examination Solution

University of Waterloo Midterm Examination Solution University of Waterloo Midterm Examination Solution Winter, 2011 1. (6 total marks) The diagram below shows an extensible hash table with four hash buckets. Each number x in the buckets represents an entry

More information

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

A generalization of Mader s theorem

A generalization of Mader s theorem A generalization of Mader s theorem Ajit A. Diwan Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai, 4000076, India. email: aad@cse.iitb.ac.in 18 June 2007 Abstract

More information

Introduction to Graph Theory

Introduction to Graph Theory Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex

More information

Abstract Path Planning for Multiple Robots: An Empirical Study

Abstract Path Planning for Multiple Robots: An Empirical Study Abstract Path Planning for Multiple Robots: An Empirical Study Charles University in Prague Faculty of Mathematics and Physics Department of Theoretical Computer Science and Mathematical Logic Malostranské

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data

PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data PathStack : A Holistic Path Join Algorithm for Path Query with Not-predicates on XML Data Enhua Jiao, Tok Wang Ling, Chee-Yong Chan School of Computing, National University of Singapore {jiaoenhu,lingtw,chancy}@comp.nus.edu.sg

More information

P Is Not Equal to NP. ScholarlyCommons. University of Pennsylvania. Jon Freeman University of Pennsylvania. October 1989

P Is Not Equal to NP. ScholarlyCommons. University of Pennsylvania. Jon Freeman University of Pennsylvania. October 1989 University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science October 1989 P Is Not Equal to NP Jon Freeman University of Pennsylvania Follow this and

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

On vertex types of graphs

On vertex types of graphs On vertex types of graphs arxiv:1705.09540v1 [math.co] 26 May 2017 Pu Qiao, Xingzhi Zhan Department of Mathematics, East China Normal University, Shanghai 200241, China Abstract The vertices of a graph

More information

Chapter 3. Algorithms for Query Processing and Optimization

Chapter 3. Algorithms for Query Processing and Optimization Chapter 3 Algorithms for Query Processing and Optimization Chapter Outline 1. Introduction to Query Processing 2. Translating SQL Queries into Relational Algebra 3. Algorithms for External Sorting 4. Algorithms

More information

Definition: A graph G = (V, E) is called a tree if G is connected and acyclic. The following theorem captures many important facts about trees.

Definition: A graph G = (V, E) is called a tree if G is connected and acyclic. The following theorem captures many important facts about trees. Tree 1. Trees and their Properties. Spanning trees 3. Minimum Spanning Trees 4. Applications of Minimum Spanning Trees 5. Minimum Spanning Tree Algorithms 1.1 Properties of Trees: Definition: A graph G

More information

CHAPTER 2. Graphs. 1. Introduction to Graphs and Graph Isomorphism

CHAPTER 2. Graphs. 1. Introduction to Graphs and Graph Isomorphism CHAPTER 2 Graphs 1. Introduction to Graphs and Graph Isomorphism 1.1. The Graph Menagerie. Definition 1.1.1. A simple graph G = (V, E) consists of a set V of vertices and a set E of edges, represented

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Discharging and reducible configurations

Discharging and reducible configurations Discharging and reducible configurations Zdeněk Dvořák March 24, 2018 Suppose we want to show that graphs from some hereditary class G are k- colorable. Clearly, we can restrict our attention to graphs

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

Chapter 2 Graphs. 2.1 Definition of Graphs

Chapter 2 Graphs. 2.1 Definition of Graphs Chapter 2 Graphs Abstract Graphs are discrete structures that consist of vertices and edges connecting some of these vertices. Graphs have many applications in Mathematics, Computer Science, Engineering,

More information

Line Graphs and Circulants

Line Graphs and Circulants Line Graphs and Circulants Jason Brown and Richard Hoshino Department of Mathematics and Statistics Dalhousie University Halifax, Nova Scotia, Canada B3H 3J5 Abstract The line graph of G, denoted L(G),

More information

Discrete mathematics

Discrete mathematics Discrete mathematics Petr Kovář petr.kovar@vsb.cz VŠB Technical University of Ostrava DiM 470-2301/02, Winter term 2018/2019 About this file This file is meant to be a guideline for the lecturer. Many

More information

Elements of Graph Theory

Elements of Graph Theory Elements of Graph Theory Quick review of Chapters 9.1 9.5, 9.7 (studied in Mt1348/2008) = all basic concepts must be known New topics we will mostly skip shortest paths (Chapter 9.6), as that was covered

More information

Section Summary. Introduction to Trees Rooted Trees Trees as Models Properties of Trees

Section Summary. Introduction to Trees Rooted Trees Trees as Models Properties of Trees Chapter 11 Copyright McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Chapter Summary Introduction to Trees Applications

More information

Paths, Flowers and Vertex Cover

Paths, Flowers and Vertex Cover Paths, Flowers and Vertex Cover Venkatesh Raman, M.S. Ramanujan, and Saket Saurabh Presenting: Hen Sender 1 Introduction 2 Abstract. It is well known that in a bipartite (and more generally in a Konig)

More information

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices.

An undirected graph is a tree if and only of there is a unique simple path between any 2 of its vertices. Trees Trees form the most widely used subclasses of graphs. In CS, we make extensive use of trees. Trees are useful in organizing and relating data in databases, file systems and other applications. Formal

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Combinatorics Summary Sheet for Exam 1 Material 2019

Combinatorics Summary Sheet for Exam 1 Material 2019 Combinatorics Summary Sheet for Exam 1 Material 2019 1 Graphs Graph An ordered three-tuple (V, E, F ) where V is a set representing the vertices, E is a set representing the edges, and F is a function

More information

This article was originally published in a journal published by Elsevier, and the attached copy is provided by Elsevier for the author s benefit and for the benefit of the author s institution, for non-commercial

More information

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321

Part XII. Mapping XML to Databases. Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321 Part XII Mapping XML to Databases Torsten Grust (WSI) Database-Supported XML Processors Winter 2008/09 321 Outline of this part 1 Mapping XML to Databases Introduction 2 Relational Tree Encoding Dead Ends

More information

Time Constrained Continuous Subgraph Search over Streaming Graphs

Time Constrained Continuous Subgraph Search over Streaming Graphs Time Constrained Continuous Subgraph Search over Streaming Graphs Youhuan Li, Lei Zou, M. Tamer Özsu, Dongyan Zhao Peking University, China; University of Waterloo, Canada; {liyouhuan,zoulei,zhaody}@pku.edu.cn,

More information

Graph Algorithms Using Depth First Search

Graph Algorithms Using Depth First Search Graph Algorithms Using Depth First Search Analysis of Algorithms Week 8, Lecture 1 Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Graph Algorithms Using Depth

More information

ON VERTEX b-critical TREES. Mostafa Blidia, Noureddine Ikhlef Eschouf, and Frédéric Maffray

ON VERTEX b-critical TREES. Mostafa Blidia, Noureddine Ikhlef Eschouf, and Frédéric Maffray Opuscula Math. 33, no. 1 (2013), 19 28 http://dx.doi.org/10.7494/opmath.2013.33.1.19 Opuscula Mathematica ON VERTEX b-critical TREES Mostafa Blidia, Noureddine Ikhlef Eschouf, and Frédéric Maffray Communicated

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4)

Characterizing Graphs (3) Characterizing Graphs (1) Characterizing Graphs (2) Characterizing Graphs (4) S-72.2420/T-79.5203 Basic Concepts 1 S-72.2420/T-79.5203 Basic Concepts 3 Characterizing Graphs (1) Characterizing Graphs (3) Characterizing a class G by a condition P means proving the equivalence G G

More information

Matching Theory. Figure 1: Is this graph bipartite?

Matching Theory. Figure 1: Is this graph bipartite? Matching Theory 1 Introduction A matching M of a graph is a subset of E such that no two edges in M share a vertex; edges which have this property are called independent edges. A matching M is said to

More information

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2 Graph Theory S I I S S I I S Graphs Definition A graph G is a pair consisting of a vertex set V (G), and an edge set E(G) ( ) V (G). x and y are the endpoints of edge e = {x, y}. They are called adjacent

More information

The External Network Problem

The External Network Problem The External Network Problem Jan van den Heuvel and Matthew Johnson CDAM Research Report LSE-CDAM-2004-15 December 2004 Abstract The connectivity of a communications network can often be enhanced if the

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

CSE 21 Mathematics for Algorithm and System Analysis

CSE 21 Mathematics for Algorithm and System Analysis CSE 21 Mathematics for Algorithm and System Analysis Unit 4: Basic Concepts in Graph Theory Section 3: Trees 1 Review : Decision Tree (DT-Section 1) Root of the decision tree on the left: 1 Leaves of the

More information

Hamilton paths & circuits. Gray codes. Hamilton Circuits. Planar Graphs. Hamilton circuits. 10 Nov 2015

Hamilton paths & circuits. Gray codes. Hamilton Circuits. Planar Graphs. Hamilton circuits. 10 Nov 2015 Hamilton paths & circuits Def. A path in a multigraph is a Hamilton path if it visits each vertex exactly once. Def. A circuit that is a Hamilton path is called a Hamilton circuit. Hamilton circuits Constructing

More information

CS122 Lecture 10 Winter Term,

CS122 Lecture 10 Winter Term, CS122 Lecture 10 Winter Term, 2014-2015 2 Last Time: Plan Cos0ng Last time, introduced ways of approximating plan costs Number of rows each plan node produces Amount of disk IO the plan must perform Database

More information

CS 161 Lecture 11 BFS, Dijkstra s algorithm Jessica Su (some parts copied from CLRS) 1 Review

CS 161 Lecture 11 BFS, Dijkstra s algorithm Jessica Su (some parts copied from CLRS) 1 Review 1 Review 1 Something I did not emphasize enough last time is that during the execution of depth-firstsearch, we construct depth-first-search trees. One graph may have multiple depth-firstsearch trees,

More information

Hash-Based Indexing 165

Hash-Based Indexing 165 Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Graphs That Are Randomly Traceable from a Vertex

Graphs That Are Randomly Traceable from a Vertex Graphs That Are Randomly Traceable from a Vertex Daniel C. Isaksen 27 July 1993 Abstract A graph G is randomly traceable from one of its vertices v if every path in G starting at v can be extended to a

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology

Trees : Part 1. Section 4.1. Theory and Terminology. A Tree? A Tree? Theory and Terminology. Theory and Terminology Trees : Part Section. () (2) Preorder, Postorder and Levelorder Traversals Definition: A tree is a connected graph with no cycles Consequences: Between any two vertices, there is exactly one unique path

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Dynamic Skyline Queries in Large Graphs

Dynamic Skyline Queries in Large Graphs Dynamic Skyline Queries in Large Graphs Lei Zou, Lei Chen 2, M. Tamer Özsu 3, and Dongyan Zhao,4 Institute of Computer Science and Technology, Peking University, Beijing, China, {zoulei,zdy}@icst.pku.edu.cn

More information

Discrete mathematics , Fall Instructor: prof. János Pach

Discrete mathematics , Fall Instructor: prof. János Pach Discrete mathematics 2016-2017, Fall Instructor: prof. János Pach - covered material - Lecture 1. Counting problems To read: [Lov]: 1.2. Sets, 1.3. Number of subsets, 1.5. Sequences, 1.6. Permutations,

More information

Bipartite Roots of Graphs

Bipartite Roots of Graphs Bipartite Roots of Graphs Lap Chi Lau Department of Computer Science University of Toronto Graph H is a root of graph G if there exists a positive integer k such that x and y are adjacent in G if and only

More information

Lecture 1. 1 Notation

Lecture 1. 1 Notation Lecture 1 (The material on mathematical logic is covered in the textbook starting with Chapter 5; however, for the first few lectures, I will be providing some required background topics and will not be

More information

Good Will Hunting s Problem: Counting Homeomorphically Irreducible Trees

Good Will Hunting s Problem: Counting Homeomorphically Irreducible Trees Good Will Hunting s Problem: Counting Homeomorphically Irreducible Trees Ira M. Gessel Department of Mathematics Brandeis University Brandeis University Combinatorics Seminar September 18, 2018 Good Will

More information

Matching and Planarity

Matching and Planarity Matching and Planarity Po-Shen Loh June 010 1 Warm-up 1. (Bondy 1.5.9.) There are n points in the plane such that every pair of points has distance 1. Show that there are at most n (unordered) pairs of

More information

Math 443/543 Graph Theory Notes

Math 443/543 Graph Theory Notes Math 443/543 Graph Theory Notes David Glickenstein September 3, 2008 1 Introduction We will begin by considering several problems which may be solved using graphs, directed graphs (digraphs), and networks.

More information

Discrete mathematics II. - Graphs

Discrete mathematics II. - Graphs Emil Vatai April 25, 2018 Basic definitions Definition of an undirected graph Definition (Undirected graph) An undirected graph or (just) a graph is a triplet G = (ϕ, E, V ), where V is the set of vertices,

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information