Answering Aggregate Queries Over Large RDF Graphs

Size: px

Start display at page:

Download "Answering Aggregate Queries Over Large RDF Graphs"

Dwayne Lloyd
5 years ago
Views:

1 1 Answering Aggregate Queries Over Large RDF Graphs Lei Zou, Peking University Ruizhe Huang, Peking University Lei Chen, Hong Kong University of Science and Technology M. Tamer Özsu, University of Waterloo Dongyan Zhao, Peking University Technical Report: TR-DB-ICST-PKU Institute of Computer Science and Technology, Peking University, Beijing, China

2 Answering Aggregate Queries Over Large RDF Graph Lei Zou 1,Ruizhe Huang 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao 1 1 Peking University, China 2 Hong Kong University of Science and Technology, China 3 University of Waterloo, Canada {zoulei,huangruizhe,zhaody}@pku.edu.cn, leichen@cse.ust.hk,tozsu@cs.uwaterloo.ca ABSTRACT In this paper, we develop a graph-based methodology for processing aggregate SPARQL queries. The recent version of SPARQL specification (SPARQL 1.1) includes aggregation queries as a a core requirement. However, few existing works and SPARQL engines consider aggregate queries. We decompose each query into a set of star aggregate queries whose results are combined to achieve the final aggregate results. We develop index structures and processing algorithms for efficient evaluation of aggregation queries. Extensive experiments on both real and benchmark datasets confirm the efficiency of our method. 1. INTRODUCTION As a core concept in Web of Data, RDF (Resource Description Framework) has attracted considerable interest. Generally speaking, a RDF dataset is a collection of triples, each of which is denoted as SPO (sub ject, property, ob ject). Figure 1 shows a flattened representation of a RDF dataset that we use as a running example. In order to query RDF repositories, SPARQL query language has been proposed by W3C. The number of RDF datasets and their volumes have been increasing significantly. For example, in data.gov project, many open government datasets have been released in RDF format. There are also a number of biological RDF datasets, such as Bio2RDF (bio2rdf.org) and Uniprot RDF (dev.isb-sib.ch/projects/uni prot-rdf). In order to answer SPARQL queries efficiently, a number of query engines (e.g., Jena [17], RDF-3x [11]) and processing algorithms (e.g. [1, 19, 3, 20, 9, 14]) have been developed. However, so far, existing SPARQL query optimization techniques have not considered aggregation operators. W3C has recently proposed aggregate functions for SPARQL that extend the way the solution set is constructed. Aggregate functions within SPARQL provide the ability to partition a solution set into one or more groups based on rows that share specified values, and then to create a new solution set that contains one tuple per aggregated group. Example 1 demonstrates the semantics of SPARQL aggregation. Example 1. Consider the following aggregate queries over the RDF dataset in Figure 1. Subject Property Object Person1 Male Person1 $50,000 Person1 Person1 School1 Person1 authored Paper1 Person2 Person2 Male Person2 $100,000 Person2 worksin School2 Person2 School1 Person2 authored Paper2 Person2 American Person3 Person3 Male Person3 $40,000 Person3 School2 Person3 worksin School1 Person3 authored Paper1 Person4 Female Person4 $45,000 Person4 Canadian Person4 worksin School2 Person4 School1 Person4 Person4 authored Paper3 Person5 Male Person5 $50,000 Person5 American Person5 Canadian Person6 Associate Person6 Canadian Person6 $45,000 Person6 authored Paper2 Person6 School1 Person6 School2 School1 China School1 No.1 University School2 located in USA School2 Top University School3 located in China School3 Rank 1 University Paper Paper1 CIKM Paper1 Mining Algorithm Paper Paper2 VLDB Paper2 Spatial Database Paper Paper3 ICDM Paper3 Frequent Mining Figure 1: RDF Triple Table T (Q 1 ) What is the total for different groups of individuals who have the same and. SELECT? t? g SUM(? s ) WHERE {?m < t i t l e >? t.?m <>? g.?m <s a l a r y >? s. } GROUP BY? t,? g (Q 2 ) Group all individuals by their s, and the locations of the schools that they graduated from. Report all groups whose total salaries are higher than $95,000. SELECT? t? n? p SUM(? s ) WHERE {?m < t i t l e >? t.?m <n a t i o n a l i t y >? n.?m <s a l a r y >? s.?m < >?g.? g <l o c a t e d I n >? p. } GROUP BY? t,? n,? p HAVING SUM(? s ) >95000.

3 Each aggregate SPARQL query corresponds to one query graph the query graphs and corresponding aggregate result sets (R(Q i )) of Q 1 and Q 2 are shown in Figures 2a and 2b, respectively. Note that, one group is eliminated by the HAVING statement in R(Q 2 ). v v2 v1 (a) Query Q 1 SUM() Male $100,000 Male $90,000 Female $50,000 SUM() Male China $100,000 Male China $50,000 Male USA $40,000 Female USA $50,000 (b) Query Q 2 Figure 2: Aggregate Result Tables Obviously, applications can take an original solution set and calculate aggregate values by themselves. However, this leads to expensive data transfer between SPARQL endpoints and applications, and this should be avoided. It is preferable to compute aggregates within the RDF engine, since aggregate queries usually result in significantly smaller solution sets. In this paper, we focus on enabling SPARQL engines to calculate aggregates efficiently. This is a topic that has not yet been sufficiently studied. 1.1 Related Work and Possible Solutions Few SPARQL query engines consider aggregate queries, and to the best of our knowledge only two proposals exist in the literature [10, 12]. We discuss these below Using Existing SPARQL Query Engines Given an aggregate SPARQL query Q, a straightforward method to answer Q is to transform it into a SPARQL query Q without aggregation predicates, find the solution set to Q by existing query engines, then partition the solution set into one or more groups based on rows that share specified values, and finally, compute the aggregate values for each group [10]. Although it is easy for existing RDF engines to implement aggregate functions this way, the approach is problematic, since it misses opportunities for query optimization. For example, although there are only two aggregate tuples in the final aggregate result set R(Q 2 ), this method needs to find the whole solution set for query Q 2 (disregarding aggregate predicates in Q 2 ) and then partition them into three groups. Then, it computes an aggregate value for each group and filters out the group that cannot satisfy the HAVING condition. Obviously, this is not the optimal way to process this query. Furthermore, it has been pointed out [12] that this method may produce incorrect answers. For example, if we convert Q 2 to a traditional SPARQL query Q 2 by removing aggregate predicate in the SELECT clause and the GROUP BY, the solution set R(Q 2 ) is given in Figure 3(a). Computing aggregation over R(Q 2 ) does not produce the correct result of Q 2 the total of the third group is computed as $135, 000, but the correct result is $90, 000. The reason is that Person6 graduated from two schools that are both located in China. Thus, the of Person6 is included in the aggregation twice in Figure 3(b). Simply removing duplicates in R(Q 2 ) also leads to incorrect answers (the total for the third group is $45, 000), since Person4 has exactly the same dimension values with Person6. Seid and Mehrotra [12] studies the semantics of group-by and aggregation in RDF graph and how to extend SPARQL to express?t?n?p?s American USA $100,000 American China $100,000 Canadian China $45,000 Canadian China $45,000 Canadian China $45,000 Computing aggregation over R(Q 2 )?t?n?p?s American USA $100,000 American China $100,000 Canadian China $135,000 Figure 3: Problem in Using Existing SPARQL Query Engine grouping and aggregation queries. They do not address the physical implementation or query optimization techniques Using Relational Systems An alternative approach is to dump RDF triples into a relational system and convert the SPARQL query to SQL. Then, we can use aggregate SQL queries over tables [8, 6] to answer aggregate SPARQL queries. Generally speaking, there are three ways to store RDF triples in relational tables: 1) One big triples table: Build a single large three-column (SPO) table (like Figure 1), convert the SPARQL query to an SQL equivalent, which is run on this table. In order to speed up query processing, an exhaustive index based on all permutations of S,P,O columns can be built [11, 16]. Let us consider how to answer Q 1 by this method. Assume that there are 10,000 faculty members in table T (as shown in Figure 1). Although there are at most 6 groups ({male,female} {assistant professor, associate professor, professor}) in the final aggregate result set, we need to join two temporary tables with 10,000 triples. Obviously, it is expensive to do that in an online query. Traditional data warehousing systems materialize aggregate views (i.e., computing data cubes) to speed up online aggregate queries. However, it is impossible to define a data cube over one big triples table, since RDF data may have too many dimensions (i.e., properties). For example, DBPedia data has more than 1,000 dimensions. 2) Vertical partitioning: Vertical partitioning, as proposed in the SW-store system [1], builds a two-column (S-O) table for each property that is ordered by column S allowing fast S-S merge join. However, this solution also has expensive online join cost for S-O joins. 3) Property table+leftover table: Jena2 [17] proposes the use of property tables to speed up query processing over RDF data. Two types of property tables are proposed. The first type is called a clustered property table, and contains clusters of properties that tend to be defined together. The second type of property table, called a property class table, exploits the rdf:type property of subjects to cluster similar sets of subjects together in the same table. Both of these have similar structure. Property table approach reduces the number of join steps. Furthermore, for each property table T i, we can also materialize the data cube defined by all dimensions in T i. However, if an aggregate query involves more than one property table, this approach cannot be used to speed up query processing, since it requires joins or unions to combine data from several tables. Furthermore, property tables may include too many NULL values, and they cannot handle multi-valued attributes [1]. Finally, the RDF data tend not to be very structured. For example, each subject in the same type need not have the same properties. In Figure 1, Person4 has property, but Person1 does not. This facilitates pay-as-you-go data integration, but prohibits the application of classical relational approaches to speed up aggregate query processing. For example, materialized views, which is a commonly used optimization approach [8], may not be used easily. Assume that we have a materialized view V 1 over dimensions (A,B,C). In this case, given an aggregate query over dimensions (A,B), we can get the solution set by only scanning view V 1 instead of scanning the original table. However, we

4 cannot do the same thing in RDF as Example 2 demonstrates. Example 2. Considering the RDF dataset in Figure 1, assume we have the following aggregate query: (Q 3 ) What is the total for different groups of individuals who have the same, and. SELECT? g? t? n SUM(? s ) WHERE {?m <>? g.?m < t i t l e >? t.?m <s a l a r y >? s.?m <n a t i o n a l i t y >? n} GROUP BY?g,? t,? n v (a) GROUP BY Pattern P 3 SUM() Female Canadian $45,000 Male American $100,000 (b) Aggregate Result Set R(Q 3 ) Figure 4: Query Q 3 Figure 4(b) shows the aggregate result set R(Q 3 ) for Q 3. Although group-by dimensions in Q 1 is a subset of group-by dimensions in Q 3, it is impossible to get the aggregate result set R(Q 1 ) by scanning R(Q 3 ), as would be possible with relational materialized views. The main reason is that SPARQL semantics is based on subgraph matching, which is not well captured in the relational representation. Consider Person1 in RDF graph, which can match query Q 2 but cannot match query Q 3, since Person1 does not have property (i.e., dimension) Others There is an extensive body of work on group-by and aggregation on XML data (e.g., [18, 5, 2]). These mainly focus on supporting group-by and aggregation at logical [2] or physical levels [18, 5]. A key difference between XML and RDF is that XML is tree-structured, where RDF is graph-based. Thus, some aggregation methods, such as the merge algorithm [5], cannot be used in aggregate SPARQL queries. 1.2 Proposed Approach and Contributions In order to answer aggregate SPARQL queries efficiently, we follow the graph matching approach, where both the RDF data and the SPARQL query are represented as graphs and the result is found by subgraph matching. Specifically, given an aggregate query Q, we first decompose it into several star aggregate queries S i (i = 1,..., n), where each star aggregate query is formed by one vertex (called center) and its adjacent properties (i.e., adjacent edges). For example, queries Q 1 and Q 3 are star aggregate queries whose graph patterns are shown in Figure 2a and 4(a), respectively. The formal definition of star aggregate query is given in Section 4. We propose T-index to process star aggregate queries efficiently without performing joins. A T-index is a trie, where each node N has a materialized set of tuples M(N). A star aggregate query can be answered by grouping materialized tables associated with nodes in T-index. T-index and star aggregate query processing are discussed in Sections 4.1 and 4.2, respectively. Once the results of star aggregate queries S i (i = 1,..., n), are obtained, we employ gstore [20], which we had previously proposed as a graph matching-based SPARQL query engine, to find all relevant nodes for each star center. Then, based on these relevant nodes, we can find the final result of Q. Experiments that we have conducted (Section 7) demonstrate that the performance of our method is superior to existing methods in answering aggregate queries over large RDF data. 2. PROBLEM FORMULATION In this section, we review the terminology that we use in the paper, and formally define our problem. Spatial Database VLDB School Frequent Mining Paper3 ICDM 2000 Paper2 Canadian authored authored Person6 Person4 $45,000 American worksin Female $50,000 $100, School2 USA Person2 No. 1 University Top University 010 Male School1 Male Person1 Male $50, China 005 Person5 American $50,000 Canadian 012 authored Rank 1 University China authored 007 Paper CIKM worksin 003 Person Mining Algorithm $40,000 Male Figure 5: RDF Graph DEFINITION 2.1. A RDF graph is denoted as G = V, L V, E, L E, where (1) V = V e V l is a collection of vertices that correspond to all subjects and objects in the RDF graph, where V e is a collection of entity and class vertices that are represented by URIs, and V l is a collection of literal vertices; (2) L V is a collection of vertex labels assigned as follows: The label of a vertex v V l is its literal value, the label of a vertex v V e is its corresponding URI; (3) E is a collection of directed edges that connect the corresponding subjects and objects; (4) L E is a collection of edge labels, where the label of an edge (v 1, v 2 ) E is its corresponding property. If (v 1 V e v 2 V e ), (v 1, v 2 ) is called a link property edge. If (v 1 V e v 2 V l ), (v 1, v 2 ) is called an attribute property edge. Note that we do not distinguish class entity vertices from other entity vertices in this definition, since class entities play the same role as other entity vertices in our solution. In the remainder, the term entity vertex refers to this more general definition unless clarification is needed. Figure 5 shows the RDF graph corresponding to data in Figure 1, in which rectangle nodes denote entity vertices, and others are literal vertices. The number besides an entity vertex is the vertex ID that is introduced to simplify the description. For example, properties,,,,,,, and are attribute properties, where, worksin and authored are link properties. DEFINITION 2.2. An aggregate SPARQL query Q consists of three components: 1) Query pattern P is a set of triple statements that form one query graph. 2) Group-by dimensions and measure dimensions are pre-defined object variables in query pattern P. 3) (Optional) HAVING condition specifies the condition(s) that each group must satisfy in the solution set. Figure 6 demonstrates the three components of the aggregate query Q 2, and the corresponding answer set R(Q 2 ) is given in Figure 2a. Note that, we first assume that all group-by dimensions correspond to attribute property edges (e.g.,?t,?n and?p in Figure 6), and we will relax this assumption in Section 6.

5 SELECT? t? n? p SUM(? s ) WHERE {?m < t i t l e >? t.?m <n a t i o n a l i t y >? n.?m <s a l a r y >? s.?m < >?g.? g <l o c a t e d I n >? p. } GROUP BY? t,? n,? p HAVING SUM(? s ) > Measure dimension Query pattern Grouping dimension Having condition (optional) Figure 6: Three Components in Aggregate Queries DEFINITION 2.3. Given an aggregate query Q and its query pattern P, R(P) and R(Q) denote solution sets to P and Q, respectively. R(P) is a set of tuples that project all matches of P in the RDF graph G into group-by dimensions and measure dimensions. R(Q) partitions R(P) into one or more groups based on tuples that share values on specified group-by dimensions. Each group corresponds to an aggregate tuple, which is formed by group-by dimension values and the aggregate value in this group. R(Q) is a set of these aggregate tuples. u 1 u 1 Q u 2 u 2 S 1 S 2 L L Canadian {004,006} USA {011} American {002} China {010,012} R(S 1 ) R(S 2 ) L 1 L 2 SUM() American USA {002} {011} 100,000 American China {002} {010} 100,000 Canadian China {004,006} {010,012} 90,000 R(Q) Figure 7: Overview of Our Solution 3. OVERVIEW OF OUR SOLUTION We illustrate our method using query Q in Figure 7, which is the same as Q 2 without the HAVING clause. We decompose Q into two star aggregate queries S 1 and S 2, defined as follows. DEFINITION 3.1. A Star Aggregate (SA) query (u, g = {p 1,... p d }, m = {p d+1,...p n }) consists of a central vertex u, a set of groupby dimensions {p 1,...p d }, and a set of measure dimensions {p d+1,...p n }, where {p 1,...p d,...p n } are all the attribute properties adjacent to u. The query pattern, P, of a SA query is a star, and each match of star S over RDF graph G is a single entity vertex (see Definition 2.1) together with its neighbors. In Figure 7, S 1 is a SA query, which can be denoted as (u 1, {, }, {}). Measure dimensions are optional for a SA query, since users may be interested in the count aggregation (the number of matches to SA queries). For example, S 2 has no measure dimension. In order to answer a star aggregate query efficiently and without joins, we propose T-index, which is a trie (i.e., a prefix tree). Each node of T-index corresponds to an entity vertex in the RDF graph and its adjacent attribute properties (called a transaction). Figure 8a shows all transactions in the RDF graph of Figure 5, which form the transaction database D. Any path beginning from the root is the prefix (Definition 4.2) of at least one transaction. For example, path root--- (the path reaching node N 3 ) is the prefix of four transactions: 001,002,003 and 004. Furthermore, each node has a materialized set for these transactions (Figure 8b). We answer a star aggregate query by accessing the relevant materialized sets in T-index without performing joins. For example, we answer star query S 1 by accessing materialized sets M(N 4 ) and M(N 7 ). Since group-by dimensions in M(N 4 ) are,,, and group-by dimensions in query S 1 are,, we compute a temporary aggregate set M (N 3 ) on dimensions,. Similarly, we project M(N 7 ) over dimensions, to get M (N 7 ). Finally, we combine M (N 4 ) and M (N 7 ) to answer star query S 1. Details of the SA query algorithm are in Section 4. Figure 7 shows result sets R(S 1 ) and R(S 2 ) for the two star aggregate queries, respectively. Specifically, R(S i ) groups all tuples that share the same values on specified group-by dimensions, and each group corresponds to an aggregate tuple. In order to answer aggregate query Q in Figure 7, we need to join R(S 1 ) and R(S 2 ). In order to speed up the join, we find all matching vertices of vertex u 1 and of vertex u 2, where u 1 and u 2 are S 1 and S 2 s star centers. We use the gstore system [20] for this match. Based on star query results R(S 1 ) and R(S 2 ), we partition these matches into different groups, where all tuples in one group share the same values on group-by dimensions. The whole process is discussed in Section STAR AGGREGATE QUERIES We first discuss the aggregate index, T-index, (Section 4.1) and then present the query algorithm based on T-index (Section 4.2). We also discuss, in Section 4.3, the maintenance of T-index if there are frequent updates to the RDF graph. 4.1 T-Index For each entity vertex v in a RDF graph G, all distinct attribute properties (Definition 2.1) adjacent to v are collected to form a transaction. For example, the adjacent attribute properties of entity vertex 001 are,,. Thus, we have a transaction T(001) =,,. All transactions are collected to form a transaction database D, as shown in Figure 8a. Each entity vertex v in RDF graph G corresponds to one transaction T(v) in D. In each transaction, properties are ordered in their frequency descending order in D, where property frequency is defined as follows. DEFINITION 4.1. The frequency of a property p in a transaction database D is Freq(p) = {T(v) p T(v) T(v) D} For example, the frequencies of, and are 6, 5 and 4, respectively. An attribute that occurs multiple times in an entity vertex (e.g., for Person5) appears once in that transaction and is counted once in frequency calculation as per Definition 4.1. Thus, precedes, which precedes in the relevant transactions. This order is important since the construction of paths follows this order. Frequency ordering leads to fewer nodes being inserted since there is a higher probability that more prefixes will be shared among different transactions. It also facilitates query processing, as discussed in Section 4.2. Note that, if multiple dimensions have the same frequency, their order is arbitrarily defined (the effect of this is experimentally studied). Furthermore, for ease of presentation, we use terms entity vertex and transaction interchangeably, as well as attribute property and dimension. DEFINITION 4.2. Given a transaction T, the length-n prefix of T is the first n dimensions in T. DEFINITION 4.3. A T-index is a unordered tree structure defined as follows:

6 Vertex ID Entity vertex Adjacent Attribute Properties 001 Person1,, 002 Person2,,, 003 Person3,, 004 Person4,,, 005 Person5,, 006 Person6,, 007 Paper1,, 008 Paper2,, 009 Paper3,, 010 School1, 011 School2, 012 School3, N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N 3 ) L $100,000 Male {002} $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N 4 ) L $45,000 Female Canadian {004} $100,000 Male American {002} M(N 5 ) L $50,000 Male Canadian {005} $50,000 Male American {005} M(N 7 ) L $45,000 Canadian {006} Dimension N11 List DL N9 N12 N10 (a) Transaction Database D 1. It consists of one root labeled as root. 2. Each node N in T-index denotes a dimension. 3. Node N j is a child of node N i if and only if there exists at least one transaction T, where the path reaching node N i is length-n prefix of T and the path reaching node N j is length- (n + 1) prefix of T. 4. Each node N in T-index has a vertex list N.L registering the IDs of all transactions T i, where the path reaching N is a prefix of T i. Figure 8b shows an example of T-index. When inserting T(001) =,, into the T-index, the path N 0 -N 1 -N 2 -N 3 is followed. 001 is registered to N i.l where i = 1, 2, 3. Furthermore, since,, is a prefix of four transactions ( 001,002,003,004 ), the corresponding vertex list to node N 3 is N 3.L = {001, 002, 003, 004}. Note that the storage of the vertex lists can be optimized by a variety of encoding techniques. We don t discuss this tangential issue any further. Besides T-index, there are two associated data structures: dimension list DL and materialized aggregate sets M(N). Dimension list DL records all dimensions in transaction database D. When introducing a node into T-tree, according to the dimension of the node, we register the node to the corresponding dimension in DL, similar to building an inverted index. Consequently, the dimensions in DL are ordered in their frequency descending order. Each node N has an aggregate set M(N). Let the dimensions along the path from the root to node N be (l 1, l 2,..., l N ). According to N.L, one can find all transactions represented by the portion of the path reaching this node. M(N) is a set of aggregate tuples that group these transactions based on shared values on dimensions (l 1, l 2,..., l N ). Each aggregate tuple t in M(N) has two parts: the dimensions t.d and the vertex list t.l that stores the vertex IDs in this aggregate tuple, as shown in Figure 8b. Consider node N 3 in Figure 8b. We partition all transactions represented by the path N 0 -N 1 -N 2 -N 3 into four groups based on group-by dimensions that share specified values along (,,). Each group corresponds to one aggregate tuple t. These pre-computed aggregate sets speed up SA queries, as discussed in the next section. Figure 8: T-index (b) T-Index We now discuss the construction of T-index and computing aggregate sets given in Algorithm 1. Initially, a scan of D derives a list of all dimensions, and the dimension list DL is created. We introduce a root node into T-index (Lines 1-2 in Algorithm 1). When we insert a transaction T(v) into T-index, we find a node N i, where the path reaching N i is the maximal prefix of T(v) (Line 4). The maximal prefix means that (1) the path reaching node N i is a length-n prefix of T(v) and (2) there exists no path that is a length-(n+1) prefix of T(v). Let {l 1,..., l Ni } be the dimensions along the path from the root to node N i and {p i,..., p k } be T(v) s dimensions. If {l 1,..., l Ni } {p 1,..., p k }, then we need to add new nodes to extend the path (Lines 5-6). Specifically, we introduce a new node N j as a child of node N i to denote the (n + 1)-th dimension of T(v). Iteratively, we introduce ( T(v) n) new nodes (i.e., nodes corresponding to dimensions {p 1,..., p k }\{l 1,..., l Ni }) to extend the path from node N i. We register these new introduced nodes into the corresponding dimensions in DL (Line 7). Furthermore, we update all vertex lists of each node along the path (Line 9). We iterate the above processing by inserting all transactions into T-index (Lines 3-10). Algorithm 1 T-index and Materialized Aggregate Sets Require: Input: Transaction Database D and RDF graph G Output: T-tree and aggregate sets associated with each node. 1: Scan D to find all dimensions and build dimension list DL. 2: Introduce a root node into T-tree 3: for each transaction T(v) in D do 4: Find a node N i in T-tree, where the path reaching N i (i.e.,{n 0,..., N i }) is the maximal prefix of T(v). Let {l 1,..., l Ni } be dimensions along path {N 0,..., N i }. 5: if {l 1,..., l Ni } {p 1,..., p k }, where {p 1,..., p k } are T(v) s dimensions then 6: Introduce T(v) n nodes {p 1,..., p k } \ {l 1,..., l Ni } to extend the path from node N i. 7: Register these new introduce nodes into the corresponding dimensions in DL 8: end if 9: Record the ID of transaction T(v) into vertex lists N i.l, where i = 0,..., k 10: end for 11: for each child N i of the root do 12: Call Function PostOrderVisit(N i ) (Algorithm 2) 13: end for

7 THEOREM 4.1. The structure of T-index does not depend on the order of inserting transactions into T-index. PROOF. (Sketch) Consider T-indexes T 1 and T 2 that are built by two different insertion orders. Since a T-index is an unordered tree, we can prove that T 1 is isomorphic to T 2 by finding a bijective function F from nodes in T 1 to nodes in T 2. We can define function F as follows: Given a node N in T 1, assume that N is the n-th dimension in one transaction T(v) in D. N s image node in T 2 is defined as node N, where is N is also the n-th dimension in transaction T(v) in D. Algorithm 2 Function: PostOrderVisit(N) 1: for each child node N i of N do 2: Call Function PostOrderVisit(N i ). 3: end for 4: for each child N i of N do 5: compute M (N i ) = (p 1,p 2,...,p n) M(N i ) by Algorithm 4 6: end for 7: M(N) = i M (N i ) 8: for each vertex v in N.L but not in i N i.l do 9: F (v) = (p 1,...,p n) v 10: if there exists some aggregate tuple t, where t.d = F (v) then 11: Register vertex ID of v in vertex list t.l 12: else 13: Generate a new aggregate tuple t, where t.d = F (u) and insert vertex ID of v into t.l 14: Insert t into M(N) 15: end if 16: end for Obviously, for each node N in T-index, we can access all entity vertices in N.L and their attribute properties to build M(N). However, some computation and I/O can be shared for computing M(N) of different nodes. For example, if M(N 3 ) and M(N 5 ) have been materialized, M(N 2 ) can be computed by merging M(N 3 ) with M(N 5 ) rather than accessing the original data. THEOREM 4.2. Given a non-leaf node N in the trie-index that has n child nodes N i, (i = 1,..., n), the following statement holds: for any transaction T, if the path reaching node N i is a prefix of T, the path reaching node N is also a prefix of T. PROOF. It can be proven according to (3) of Definition 4.3 (about T-index). Given a node N with n child nodes N i, (i = 1,..., n), according to Theorem 4.2, we know that N.L i N i.l. Consequently, its aggregate set M(N) can be computed from the aggregate sets associated with its child nodes. Assume that the properties along the path reaching node N are (p 1,..., p m ). Initially, M(N) = φ. For each child node N i of N, we compute M (N i ) = (p 1,p 2,...,p n) M(N i ) (Lines 4-6 in Algorithm 2). Then, we compute M(N) = i M (N i ) (Line 7). Furthermore, for each vertex v that is in M(N).L but not in i M(N i ).L, we need to access the property values of vertex v on dimensions p 1,..., p n in the RDF graph (Lines 8-17). Specifically, we define function F(v) = (p 1,...,p n) v, which means projecting v s adjacent properties over (p 1,..., p n ) (Line 10). We insert F(v) into M(N). If there exists some aggregate tuple t, where t.d = F(v), we register vertex ID of v in vertex list t.l (Lines 11-12). Otherwise, we generate a new aggregate tuple t, where t.d = F (v) and insert vertex ID of v into t.l (Lines 14-15). THEOREM 4.3. Any entity vertex u in RDF graph G is accessed once in computing aggregate sets of trie-index by Algorithm 1. PROOF. (sketch) According to T-index s structure, each entity vertex v is only in one path of T-index. Assume that v corresponds to node N in T-index. The dimension values of v only need to be accessed when computing M(N). Aggregate sets of N s ancestor nodes can be computed from its children nodes without accessing the original data. We illustrate the construction of T-index using an example. First, a scan of D (Figure 8a) derives a list of all dimensions in D and their frequencies, and the dimension list DL is constructed. The root of T-index (N 0 ) is created and labeled as root. Then, we insert all transactions of D into T-index one-by-one. 1. The scan of the first transaction T(0012) leads to the construction of the first branch of the tree: (root,,, ), inserting nodes N 1, N 2 and N 3. Initially, we set N 1.L = {001}, N 2.L = {001} and N 3.L = {001}. 2. T(002) shares a common prefix (root,,, ) with the existing path, and adds one new node N 4 () as a child of node N 3 (). It also causes updating the corresponding vertex list of each node along the path. 3. The above process is iterated until all transactions are inserted into T-index. 4. Finally, for each node N i in T-index, we build aggregate sets M(N i ) by post-order traversal over T-index. Specifically, we first compute M(N 4 ) by assessing entity vertices 002 and 004 and their dimension values. Then, we compute M(N 3 ) by merging the projection of M(N 4 ) over dimensions (,, ) and assessing entity vertices 001 and 003. The process is iterated until all aggregate sets are computed. M(N 3 ), M(N 4 ) and M(N 5 ) are given in Figure 8b as examples. Note that, Person5 is a citizen of two countries resulting in two tuples for Person5 in M(N 5 ). 4.2 Star Aggregate Query Algorithm Given a SA query S = (u, g = {p 1,...p d }, m = {p d+1,...p n }) (see Definition 3.1), we answer S using T-index by Algorithm??. Let P = {p 1,..., p d,...p n }. Given the set of properties P, we find their match (N 1,..., N n ), where N i is a node in T-index and all nodes N i (i = 1,..., n) are in the same path from the root, and the property associated with N i equals p i (Line 1 in Algorithm??). It is possible to have multiple matches for a given P. For each match (N 1,..., N n ), we use N to denote the farthest node from the root (Line 3 in Algorithm??). Note that, this is not necessarily N n, since node identifiers are arbitrarily assigned to help with presentation. The aggregate set associated with node N is denoted as M(N). We compute M (N) by projecting M(N) over group-by dimensions {p 1,..., p d } (i.e., M (N) = (p 1,p 2,...,p d ) M(N)) using Algorithm 4 (Lines 4-5 in Algorithm??). Initially, M (N) is set to be φ (Line 1 in Algorithm 4). As mentioned earlier, each aggregate tuple t M(N) has two parts: t.d and t.l. The former denotes the values over the set D of dimensions, and the latter denotes the vertex list. For each aggregate tuple t M(N), we compute F(t) = (p 1,...,p d ) t.d (Line 3 in Algorithm 4). If there exists an aggregate tuple t M (N), where t.d = F(t), we update t.l = t.l t.l (Lines 4-5 in Algorithm 4). Otherwise, we insert a new aggregate tuple t into M (N), where t.d = F(t) and t.l = t.l (Lines 6-7 in Algorithm 4). The M (N) from all the matches are merged to form the final result to SA query S, i.e., R(S ) (Lines 6-7 in Algorithm??). For example, given a SA query Q 1 (u 1, {, }, {}) in Figure 7, there are two matches (N 1, N 3,N 4 ) and (N 1, N 6, N 7 ) in T-index, which can be found by dimension list and node links. In

8 Algorithm 3 SA Query Algorithm Require: Input: A Trie-Index and a SA query S (u, g = {p 1,..., p d }, m = {p d+1,..., p n }) Output: An aggregate result set M for a SA query Q. 1: Locate all matches of properties (p 1,..., p d,..., p n ) in T-index 2: for each match m i do 3: Let N i denote the node in match m i that is farthest from the root 4: Let M(N i ) denote the aggregate set associated with node N i. 5: M (N i ) = (p 1,p 2,...,p d ) M(N i ) by Algorithm 4 6: end for 7: M = i M (N i ) 8: Return M Algorithm 4 Projection Algorithm Require: Input: M(N) and dimensions (p 1, p 2,...p d ) Output: M (N) = (p 1,p 2,...,p d ) M(N) 1: M (N) = φ 2: for each aggregate tuple t M(N) do 3: F(t) = (p 1,...,p d ) t.d 4: if there exists an aggregate tuple t in M (N), where t.d = F(t) then 5: t.l = t.l t.l 6: else 7: insert t into M (N), where t.d = F(t) and t.l = t.l 8: end if 9: end for 10: Return M (N) the first match, N 4 is the farthest node from the root. Since M(N 4 ) is an aggregate set over dimensions (,,, ), we compute a temporary aggregate set M (N 4 ) on dimensions (, ) by projecting M(N 4 ) over these dimensions. Similarly, we project M(N 7 ) over dimensions (,) to get M (N 7 ). Finally, we obtain R(S 1 ) by merging M (N 4 ) and M (N 7 ). Figure 9 illustrates the process. M(N 4 ) L 45,000 Female Canadian {004} 100,000 Male American {001} M (N 4 ) L Canadian {004} American {002} M (N 7 ) L Canadian {006} M(N 7 ) L 45,000 Canadian {006} L SUM() Canadian {004,006} $90,000 American {002} $100,000 Figure 9: Answering SA Query 4.3 Maintenance of T-index Updates of the RDF data requires efficient maintenance of T- index. Obviously, updates can be modeled as a sequence of triple deletions and triple insertions. As mentioned earlier, all dimensions are ordered in their frequency descending order in dimension list DL to improve SA query evaluation performance (also confirmed by experimental results in Section 7). However, RDF data updates may change the order of dimensions in DL requiring special care during index maintenance. Therefore, we consider updates of T- index in two cases based on whether or not the order of dimensions in DL changes Dimension List DL s order does not change Consider the insertion of a new triple (s, p, o). If p is a link property, we do not need to update the T-index. Thus, we only consider the case when p is an attribute property, i.e., dimension. If s does not occur in the existing RDF data, we introduce a new transaction into D. Then, we insert the new transaction into one path of T-index following Definition 4.3. Accordingly, we need to update the materialized aggregate sets M(N i ) along the path. The detailed steps are the same as Lines 4-9 in Algorithm 1. If s is already in the existing RDF graph, assume that s s existing dimensions are {p 1,..., p n } and Freq(p i ) > Freq(p) > Freq(p i+1 ) in dimension list DL. Again, in the case of equality, order is chosen arbitrarily. This means that the new inserted dimension p should be inserted between p i and p i+1. We locate two nodes N i and N n, where the path reaching node N i (and N n ) has dimensions (p 1,..., p i ) (and (p 1,..., p i,...p n ) 1 ). We remove subject s from all materialized sets along the path between nodes N i+1 and N n, where N i+1 is a child node of N i. Then, we insert dimensions (p, p i+1,..., p n ) into T-index from node N i, and update the materialized sets along the path. Example 3. Insert a triple (Person6,, Male ) into RDF triple table T shown in Figure 1. N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N3) L $100,000 Male {002} $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} $45,000 Male {006} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N4) L $45,000 Female Canadian {004} $100,000 Male American {002} $45,000 Male Canadian {006} Figure 10: Example 3 Insert triple: (Person6,, Male ) N11 N9 N10 N12 Dimension List DL Although inserting the triple changes the frequency of dimension, it does not lead to changing the order in DL. Since Person6 s dimensions are (,, ), we delete Person6 (i.e., 006) from path N 0 N 1 N 6 N 7 in the original T-index. Then, we insert Person6 s new dimensions (,, ) into path N 0 N 1 N 2 N 3 N 4 where and obtained from path N 0 N 1 N 6 N 7. Path N 6 N 7 is deleted, since the updated aggregate sets in N 6 and N 7 are empty. Figure 10 shows the updated T-index after inserting the triple. Suppose now that we need to delete a triple (s, p, o), where p is an attribute property as discussed above. Assume that s s existing dimensions are {p 1,..., p n } and p = p i, i.e., p and p i are the same dimension. We locate two nodes N i and N n, where the path reaching node N i (and N n ) has dimensions (p 1,..., p i ) (and (p 1,..., p i,...p n )). We remove subject s from all materialized sets along the path between nodes N i+1 and N n. Then, we insert dimensions (p i+1,..., p n ) into T-index from node N i, and update the materialized sets along the path. Again T-index itself does not need to be modified. 1 Although dimension list is a set, when the order is important, we specify them as a list enclosed in ( ).

9 N4 {001,002,003,004,005,006} {001,002,003,004,005} {001,002,003,004} N3 {002,004} M(N3) L $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N2 N1 {006} N5 {005} N0 N8 N6 N7 {006} M(N4) L $45,000 Female Canadian {004} Delete triple: (Person2,, Male ) N11 N9 N10 N12 Dimension List DL {004} M(N 3 ) L $50,000 Male {001} $40,000 Male {003} $45,000 Female {004} root N4 {001,002,003,004,005,006} {001,002,003,004,006} N 2 {001,003,004} N 3 N1 {005} N7 N0 N5 {002,006} N8 N 2 M(N4) {005} L $45,000 Female Canadian {004} M(N 2 ) L $50,000 {001} $45,000 {004,006} $40,000 {003} $100,000 {002} N11 N9 N10 N12 M(N 2 ) L $50,000 Male {005} Dimension List DL (a) Phase 1 Delete triple Figure 11: Example 4 (b) Phase 2 Swap & Dimension List DL s order changes Some triple deletions and insertions that affect dimension frequency will lead to changing the order of dimensions in DL. Assume that two dimensions p i and p j need to be swapped in DL due to inserting one triple (s, p, o) or deleting one triple (s, p, o). Obviously, j = i + 1, i.e., p i and p j are adjacent to each other in DL. Updates are handled in two phases. First, we ignore the order change in DL and handle the updates using the method in Section Second, we swap p i and p j in DL, and change the structure of T-index and the relevant materialized sets. We focus on the second phase. There are only three categories of paths that can be affected by swapping p i and p j : (1) path has both dimensions p i and p j, i.e., path has dimensions (p 1,.., p i, p j,..., p m ); (2) path shares a prefix (p 1,..., p i ) with a path in the first category; and (3) path shares a prefix (p 1,..., p i 1 ) with a path in the first category and p i 1 s child is p j. We discuss below how to update the three categories of affected path. We first locate all paths in the first category, i.e., find all paths that have both dimensions p i and p j. Consider one such path H with dimensions (p 1,..., p i, p j,..., p n ). Nodes N i and N j have dimensions p i and p j, respectively. We re N i as N i and N j as N j. Moreover, we set N i s corresponding dimension to be p j and N j s corresponding dimension to be p i. We compute aggregated set M(N i ) by projecting M(N j) over dimensions (p 1,..., p i 1, p j ), i.e., M(N i ) = (p 1,...,p i 1,p j ) M(N j ) and set M(N j ) = M(N j), since they have the same group by dimensions. If node N i has a branch (i.e. N i has other child nodes in addition to N j ) in the original T-index, which is exactly the case belonging to the second category, we introduce a node N i as a child of N i 1, where N i 1 is a parent of N i in the original T-index, and set the dimension in N i as p i. We move all child nodes of N i except for N j to be children of N i. We compute aggregate set M(N i ) = M(N i ) \ (p 1,...,p i ) M(N j ). Specifically, for each aggregate tuple t in M(N i ), we check whether there exists an aggregate tuple t (p 1,...,p i ) M(N j ), where t.d = t.d. If there exists such an aggregate tuple, we generate a new aggregate tuple t, where t.d = t.d, and t.l = t.l \ t.l, and insert t into M(N i ). Otherwise, we insert t into M(N i ) directly. If node N i 1 has a child N m whose corresponding dimension is p j, where N i 1 is a parent of N i in the original T-index, which is a case belonging to the third category, we merge node N j and N m. Specifically, we remove N m and move N m s child nodes to be N i s child nodes. Then, we compute M(N i ) = M(N i ) M(N m). Specifically, for each aggregate tuple t in M(N m ), we check if there exists an aggregate tuple t in M(N i ), where t.d = t.d. If so, we set t.l = t.l t.l. Otherwise, we insert t into M(N i ) directly. Example 4. Delete a triple (Person2,, Male ) from the RDF triple table T shown in Figure 1. This changes the order of and in dimension list DL, since the frequency of is changed to 4. We first assume that the order does not change, and we delete the triple using the method in Section Figure 11a shows the updated T-index after the first phase. Now, we swap and. Path H 1 = N 0 N 1 N 2 N 3 N 4 is a path in the first category, since N 2 and N 3 s corresponding dimensions are and. Path H 2 = N 0 N 1 N 2 N 5 is a path in the second category, since it shares prefix (N 0 N 1 N 2 ) with H 1. Path H 3 = N 0 N 1 N 6 N 7 is one path in the third category, since it shares prefix N 0 N 1 with H 1 and N 1 s child node s dimension is. In the first step of phase 2, we first find the path N 0 N 1 N 2 N 3, where the dimensions in N 2 and N 3 are and, respectively. Then, we re N 3 as N 3 with dimension, and re N 2 as N 2 with dimension. Aggregate sets M(N 3 ) = M(N 3) and M(N 2 ) = (,) M(N 3 ). Since transaction 005 does not have dimension, we introduce a new node N 2, and M(N 2 ) = M(N 2 ) (,) M(N 3 ). Finally, we merge nodes N 6 with node N 2, since they share the same prefixes. Figure 11b shows the final updated T-index. 5. GENERAL AGGREGATE QUERIES As noted in Section 3, we decompose a GA query Q into several SA queries {S i }, i = 1,..., n, where each star center u i is an entity vertex in Q, and a link structure J (Definition 5.1). The result of each S i (R(S i )) is computed using the approach discussed in Section 4. We then join these R(S i ) to compute the result of Q, i.e., R(Q) = i R(S i ). In this section, we focus on the last step of computing R(Q). DEFINITION 5.1. Let GA query Q consist of n SA queries. The link structure J of Q is a subgraph induced by all star centers u i, i = 1,..., n. Specifically, J is denoted as J(V = {u i }, E = {e j }, Σ = {P(e j )}), where vertex u i (1 i n) is a star center, e j (1 j m) is an edge whose endpoints are both star centers, and P(e j ) is the label (link property) of the edge e j. Note that, J is a connected subgraph, since all entity vertices (in Q) are connected together by link properties. For each R(S i ), we

10 Algorithm 5 General Aggregate (GA) Query Algorithm Require: Input: A GA query Q Output: Query Result R(Q) 1: Each entity vertex u i (in Q), i = 1,..., n, together with its adjacent attribute properties form a SA query S i 2: for each SA query S i do 3: Call Algorithm?? to find R(S i ), i = 1,..., n. 4: T i = t R(S i ) t.l 5: end for 6: All entity vertices u i together with link properties between them form the link structure J 7: Find all subgraph matches of J over RDF graph 8: U = {g 1 }, where g 1 includes all subgraph matches 9: for each entity vertex u i in J, i = 1,..., n do 10: set U = φ 11: for each group g in U do 12: for each aggregate tuple t R(S i ) do 13: Select a group g of matches M g {M[i] t.l and M[i] refers to the i-th vertex in M} 14: Insert group g into U 15: end for 16: end for 17: Set U = U 18: end for Assume that the measure dimension is associated with u i 19: for each group g in U do 20: Find all matching vertices to u i for all matches in g 21: Access the measure values of these matching vertices 22: Compute the aggregate value in measure dimension in this group, and insert it into R(Q) 23: end for 24: Report final result R(Q) can find a vertex list L i that includes all vertices in R(S i ). Specifically, we get T i = t R(S i ) t.l, where t is an aggregate tuple in R(S i ). It means that all vertices in T i are candidate matching vertices of u i. To compute R(Q), we need to find all subgraph matches of J over the RDF graph, where a subgraph match is defined as follows. DEFINITION 5.2. Given a link structure J(V = {u i }, E = {e j }, Σ = {P(e j )}) in Q and a subgraph G (V = {v i }, E = {e j }, Σ = {P(e j )}) in RDF graph G, where v i is an entity vertex in G, e j is an edge whose endpoints are both in V, and P(e j ) is the label (link property) of the edge e j, G is called a subgraph match of J in RDF graph G if and only if: 1. v i T i 2. e j (u i1, u i2 ) E e j (v i 1, v i2 ) E and P(e j ) = P(e j ) Consider query Q in Figure 7. The results of S 1 and S 2 and the T i lists are shown in Figure 12(a), while the link structure J is shown in Figure 12(b). S 1 u1 S 2 R(S 1 ) Title L American {002} Canadian {004,006} u2 R(S 2 ) L USA {011} China {010,012} T1 = {002, 004, 006} T2 = {010, 011, 012} u1 u2 {002} {010} {002} {011} {004} {010} {006} {010} {006} {012} (a) SA Query & Answers (b) Link Structure & Matches Figure 12: Query Decomposition u1 Link Structure J Any subgraph isomorphism algorithm can be utilized to find all subgraph matches (as defined in Definition 5.2) of J over the RDF u2 graph, such as VF2 [4] or ULLMAN [15] algorithms. In this work, we utilize our previous system gstore [20] for this purpose, since it is optimized for subgraph isomorphism matches over the RDF graph. For example, we find five subgraph matches of J over RDF graph G, where Figure 12(b) shows the flatten representation of these matches. Then, we need to partition these subgraph matches into one or more groups based on subgraph matches that share specified values in group-by dimensions, and create a new solution set R(Q) that contains one tuple per aggregated group. Specifically, we try to get U = {g j }, j = 1,..., m, where all subgraph matches in group g j share the same values over group-by dimensions. Obviously, a straightforward method works as follows: for each match M, we access the corresponding entity vertices and their group-by dimension values in RDF graph. Then, we partition these matches into different groups based on subgraph matches that share specified values in group-by dimensions. This method suffers from a large number of random I/Os. Furthermore, partitioning subgraph matches is an expensive task, if there are a large number of subgraph matches of J over the RDF graph. In order to improve the performance, we would like to utilize star aggregate query results R(S i ) to partition subgraph matches, which helps reduce I/O accesses. Furthermore, scanning aggregate sets in T-index (in SA query algorithm) requires sequential access, which is much faster. More importantly, R(S i ) has partitioned subgraph matches based on group-by dimensions associated with each vertex u i in query Q. Therefore, we propose to utilize R(S i ) to find final partitions. Initially, we assume that all subgraph matches are in the same group g 1, and set U = {g 1 } (Lines 7-8 in Algorithm 5). Then, we perform a multi-level partitioning over these subgraph matches. At the first level, we consider group-by dimensions in R(S 1 ). For each group g U, we partition matches in g into some new groups g i, i = 1,..., m, where each new group g i has matches that share the same values over group-by dimensions in R(S 1 ). Obviously, g = i g i. Specifically, in order to partition matches in group g, we sequentially scan R(S 1 ). For each aggregate tuple t R(S 1 ), we find a new group g i of subgraph matches M, such that M[i] t.l and M[i] refers to the i th vertex in subgraph match M (Lines 11-12). We insert these new groups g i into U (Line 13). We repeat the above process (Lines 11-16) for all groups g in U. Then, U is the first-level partition. Obviously, g U{all matches in group g}= g U { all matches in group g }. Iteratively, we consider other R(S i ) for other level partitions (Lines 9-18). Finally, for each aggregate group, we compute the aggregate value in aggregated dimension, and insert it into final result R(Q). Assume that the measure dimension is associated with vertex u i in Q. For each group g, we find a list of distinct vertices matching u i. We access the measure dimension values of these matching vertices and compute the aggregate value for this group. Let us recall the decomposition of Q given in Figure 7. We discuss how to partition all matches of J into different groups based on group-by dimensions to find the final result R(Q). Initially, all matches are in the same group g, i.e., g = {(002, 010), (002, 011), (004, 007), (006, 008), (006, 009)} and U 0 = {g}, where (002, 010) is a flattened representation of a match. Based on R(S 1 ), we can get the first-level partition U 1. Specifically, we partition matches of g into fine-grained groups. According to the first aggregate tuple t 1 in R(S 1 ) (Figure 12(a)), we create a new group g 1 = {(002, 010), (002, 011)}, since 002 t 1.L. Due to the same reason, we create another group g 2 = {(004, 007), (006, 008), (006, 009)}. Then, the first level partition is U 1 = {g 1, g 2 }, shown in Figure 13(b). Then, based on R(S 2 ), we can perform the second-level partition U 2. Specifically, we partition each group g i (i = 1, 2) into more fine-grained groups. Figure 13(c) shows the second-level par-

gstore: A Graph-based SPARQL Query Engine

Noname manuscript No. (will be inserted by the editor) gstore: A Graph-based SPARQL Query Engine Lei Zou M. Tamer Özsu Lei Chen Xuchuan Shen Ruizhe Huang Dongyan Zhao the date of receipt and acceptance