Improved Processing of Path Query on RDF Data Using Suffix Array

Size: px

Start display at page:

Download "Improved Processing of Path Query on RDF Data Using Suffix Array"

Amberly Davidson
6 years ago
Views:

Journal of Convergence Information Technology Volume 4, Number 3, September 2009 Improved Processing of Path Query on RDF Data Using Suffix Array Corresponding author Sung Wan Kim * Division of

1 Journal of Convergence Information Technology Volume 4, Number 3, September 2009 Improved Processing of Path Query on RDF Data Using Suffix Array Corresponding author Sung Wan Kim * Division of Computer, Sahmyook University, Seoul, , Korea swkim@syu.ac.kr doi: /jcit.vol4.issue3.6 Abstract RDF is a recommended standard to describe additional semantic information to resources on the Semantic Web. Matono et al. proposed an indexing and query processing scheme for path-based RDF query using a suffix array. In this paper, we indicate some points on the previous approach. We propose an improved indexing and query processing scheme to reduce the binary search space and the overhead caused by repeating direct pattern matching. Finally, experimental performance evaluations demonstrate our approach outperforms the previous one. Keywords RDF, Indexing, Query Processing, Suffix Array. 1. Introduction In the Semantic Web, we can associate resources over the Web with metadata describing additional semantic information. RDF was recommended as a standard format to associate these metadata [1]. RDF data is a set of triples in the form of <subject, property, object>. Subject and property denote resource URI respectively. The object has the resource URL or a literal value. The RDF data can be represented as a directed graph where subject and object are nodes and property is an arc (Figure 1). Ovals indicate resources and rectangles indicate literals. An Arc describes the relationship between subjects and objects. Several indexing and query processing approaches have been proposed to handle query processing for the RDF data. Matono et al. [4] introduced an indexing scheme based on a suffix array structure to process path-based RDF queries. They showed performance gain, using experimental evaluations, for the simple path-based RDF queries. The same approach was described in [5] with the same experimental results. Matono's approach is significant since it was the first applying of suffix array to process path-based RDF query. Figure 1. An example RDF graph In this paper, we first indicate points regarding Matono's scheme. We then describe some approaches to improve query processing performance. Section 2 describes the research background and related work. Improved approaches and experimental evaluations are explained in Section 3 and Section 4 respectively. We finally conclude in Section Research Background 2.1. Suffix Array Suffix array is a widely used data structure to retrieve a specific string pattern amongst large textual data [6]. A suffix at the position of i for a given text is the sub-sequence beginning from the i-th position to the end. For the sample text 'abracbra', the suffix at position 5 is 'cbra'. A suffix array is a list of all extracted suffixes of the input text in lexicographical order. Thus, if a suffix pattern repeats in the given text, the suffixes all appear consecutively in the suffix array. In practice the suffix array consists of only beginning positions of suffixes. The following steps are used in the suffix array construction for the sample text of 'abracbra'. We first assign index points to the given text, as shown in Figure 2. We here assign an index point to each character. 45

Improved Processing of Path Query on RDF Data Using Suffix Array Sung Wan Kim Text a b r a c b r a Index Points 1 2 3 4 5 6 7 8 (idx) Figure 2.

2 Improved Processing of Path Query on RDF Data Using Suffix Array Sung Wan Kim Text a b r a c b r a Index Points (idx) Figure 2. Assigning index points We then extract all suffixes from the given text. The left hand side of Figure 3 shows the extracted suffixes with index points. Next, we order the suffixes lexicographically as shown in the right hand side of Figure 3. We finally enter the sorted index points in the suffix array. suffixes idx Suffixes(sorted) idx lcp a b r a c b r a 1 a 8 0 b r a c b r a 2 a b r a c b r a 1 1 r a c b r a 3 a c b r a 4 1 a c b r a 4 b r a 6 0 c b r a 5 b r a c b r a 2 3 b r a 6 c b r a 5 0 r a 7 r a 7 0 a 8 r a c b r a 3 2 Figure 3. Extracted suffixes and sorted suffixes Figure 4 shows the suffix array SA of the given text. We can retrieve a specific string pattern by binary search on the suffix array. For example, we can find a string pattern 'ra' in the given text by twice performing the binary search. resource r 0 and r n is defined by the sequence of path r 0 p 1 r 1 p 2 r 2... p n r n (n > 0) where a (r n p n+1 r n+1 ) indicates a single triple. idx pid r1 p r2 p r3 p r4 n kr 2 r1 p r5 p r4 n kr 3 r1 p r5 p jp 4 r1 p r5 p r6 n cn Figure 5. Path information table (ptab) Matono et al. introduced an indexing scheme using suffix array to efficiently process the path-based queries over RDF data [4]. This scheme treats RDF data as a DAG and extracts all paths in the form of an alternation of labels of nodes and labels of arcs from root nodes (nodes with indegree zero) to terminal nodes (nodes with outdegree zero). For example we can extract a path pattern 'r1.p.r5.n.jp' from Figure 1. We assign an integer pair (pid, idx) as an index point for each suffix, since suffixes are extracted from different paths. In this paper, we call this pair a suffix label. The pid indicates the path identifier and idx is the index point within the path. Figure 5 shows the assignment table of suffix labels for the suffixes extracted from the four paths of the RDF graph in Figure 1. SA Figure 4. Constructed suffix array (SA) The LCP (longest common prefix) array is an auxiliary data structure. The LCP array stores the lengths of the longest common prefix substring between each suffix and its preceding suffix in the suffix array. For example, the LCP value of suffix 'bracbra' is 3 since its preceding suffix is 'bra' Indexing RDF Data Using Suffix Array Several RDF query languages have been proposed. The typical RDF query retrieves resources that have a specific relationship R, reachable from a given resource. We thus should describe the relationship R in the query. This relationship can be represented with the path expressions. For example, the relationship between Figure 6. Suffix array index construction The steps to generate the suffix array are shown in Figure 6. We sort all suffixes in lexicographical order and eliminate duplicate suffixes. We finally obtain the suffix array of [ (4,7) (3,5) (1,9) (4,6) (3,4) (1,8) (1,2) 46

3 Journal of Convergence Information Technology Volume 4, Number 3, September 2009 (1,4) (1,6) (3,2) (2,2) (4,2) (4,4) (1,1) (3,1) (2,1) (4,1) (1,3) (1,5) (1,7) (3,3) (2,3) (4,3) (4,5) ]. Query processing is handled using this suffix array and the path information table (ptab), shown in Figure 5. In [4] only a simple path query type was considered. The following query is the example represented by RDQL format, one of the RDF queries. Ex.1) select?x where (r1 p r5) (r5 p?x) This query retrieves all resources reachable from a given path pattern. We call this form of query a forward simple path query. The condition in the where clause of the above query can be represented as a path pattern 'r1.p.r5.p'. The following steps are used in [4] to retrieve suffix labels for all suffixes having path 'r1.p.r5.p' as the beginning pattern. 1) Find a position p of the array component in the suffix array SA that first matches the given path pattern using the ptab and by performing binary searches over the suffix array (here, p is 16, if the index of array begins from 1). 2) Perform pattern matching repeatedly to find more positions for the adjacent suffixes in the left and right hand sides to the first matched position p over the suffix array SA (position 17 is additionally acquired here). 3) Extract the content of the suffix array SA for the positions, namely suffix labels. For this example, (2, 1) and (4, 1) are extracted from SA [16] and SA [17]. Then, add the length of the given query pattern (here, the length is 4) to the idx value of each suffix label. Obtain the final answer {r4, r6} from ptab, using the modified suffix labels. The following example explains backward query processing. We retrieve all resources that precede the path pattern of a given query. All processing steps are equivalent to the forward type query, excepting step 3. We decrement the idx value of each suffix label by one. Ex. 2) select?x where (?x p r4) (r4 n kr) For the above backward path query we find an index label (1, 6) and finally obtain {r3} as a return value. However, there is a missing result (that is 'r5' on the path with PID 2) since we have generated the suffix index after eliminating the duplicate suffixes. We can observe two features in the query processing. First, we have whole suffix array index as the retrieval space for all queries. As the size of RDF data grows the size of the suffix array also grows. Thus, the number of retrievals and processing time for binary searches increase. Second, there are overheads, especially to repeatedly performing time-consuming path pattern matching for the left and right hand sides to the first matched position in the suffix array. In next section, we describe new approaches to support handling the backward query type without omitting results and to improve the two drawbacks of Matono's approach. 3. Proposed Approach In this section, we describe an improved index organization and query processing approach to improve the performance of Matono's approach. We assume RDF data to be a DAG. We define a path as an alternation of labels of nodes and labels of arcs as follows. Path ::= (rsclbl '.' proplbl '.')*(rsclbl literalval (rsclbl '.' proplbl)) rsclbl ::= URI Reference (refer to [1]) proplbl ::= propname (refer to [1]) literalval ::= Constant Values (refer to [1]) propname ::= URI Reference (refer to [1]) The length of path d indicates the number of components that comprise the path pattern. For the path pattern 'r1.p1.r2.p1', the length of the path d is Index Organization Figure 7 shows the stages of index organization. We first extract all paths from RDF data and construct the path information table (ptab). For each extracted path, we distinguish suffixes and assign a suffix label that consists of a pair of path id (pid) and index point (idx) for each suffix. The distinction from Matono's approach is we do not eliminate duplicate suffixes to handle the backward path query. We next compute the 47

4 Improved Processing of Path Query on RDF Data Using Suffix Array Sung Wan Kim LCP value for each suffix after sorting the extracted suffixes in lexicographical order. keyword group can be implemented by a list or B-tree. LCP array is used to reduce the overhead for the repeated pattern matching during query processing. The next section explains its usage. Figure 7. Index Generation Step A characteristic of the suffix array is that suffixes with the same path pattern as a prefix, appear consecutively. In Matono's approach, the entire suffix array is always included in the search space. However in the case of a given query pattern that begins with 'r1', it is enough to have only the part of the suffix pattern beginning with 'r1' as the binary search space. This indicates that suffixes without 'r1' in the starting position do not need to be included in the search space. If we perform pattern matching with only the suffixes that begin with the same component as the first of a given query pattern, we can reduce the number of binary searches and pattern matching. Finally, we generate the index using the sorted suffix labels and LCPs. In this paper, we maintain several suffix arrays, instead of a single one, to reduce the binary search space. The proposed index structure is shown in Figure 8. It consists of two parts. Each suffix array SA_k in the right side of the index includes only the suffix labels for all suffixes that have 'k' as their first component. For instance, SA_r1 includes suffix labels of [(1, 1)(2, 1)(3, 1)(4, 1)] assigned only to the suffixes beginning with 'r1'. We also maintain several LCP arrays. Each LCP array LCP_k maintains the LCP values for the suffix patterns with 'k' as their first component. The keyword group in the left side of the index includes only the first components of the suffixes. Each keyword connects to SA_k and LCP_k arrays. The Figure 8. Proposed index structure In the case of the query processing the path pattern 'r1.p.r5.p' given in the example query 1 using the proposed index, only the set of suffix labels of [(1, 1)(2, 1)(3, 1)(4, 1)] stored in SA_r1 is used for the binary search. Thus, we can reduce the number of searches to find the first matched suffix to the given query pattern and the number of pattern matches, since the binary search space is reduced Query Processing In this section we describe our approach to handle forward and backward path queries. Repeatedly performing pattern matching between the path pattern in a given query and adjacent suffix patterns in the left and right hand sides to the first pattern matched point on the suffix array is time consuming. We first extract a suffix label from the suffix array to perform pattern matching. Then, we obtain the corresponding suffix pattern in the form of a string from the path information table (ptab) using the extracted suffix label. We next match them with the path pattern in a given query. Rather than use this approach, we handle it by comparisons among the components of the LCP array that comprises integer values. Figure 9 shows the processing algorithm for the forward path query. The function GetSuffixLabel extracts all suffix labels assigned to suffixes matched to the path pattern in a given query. Finding the position p for a suffix in SA_i, which is first matched to the path pattern in a given query, is the initial step of this function. This 48

5 Journal of Convergence Information Technology Volume 4, Number 3, September 2009 initial step is performed in the same manner as in the previous approach. However, in the proposed approach the binary search space is limited to only a single SA_i, instead of the entire suffix array. For the example query 1 we obtain that p is 3 on the SA_r1 and include a suffix label of (2, 1) in a temporary set. The second step finds other adjacent suffix patterns placed at the left and right hand sides of the first matched position p over SA_i. Instead of direct pattern matching, we utilize an LCP_i array. The value of LCP_i[p] maintains the length of the longest common prefix from the suffix patterns of SA_i[p] and SA_i[p- 1]. For instance, the suffix patterns for SA_r1[4] and SA_r1[3] are 'r1.p.r5.p.r6.n.cn' and 'r1.p.r5.p.r4.n.kr' respectively. Thus, the value of LCP_r1[4] is 4. Consider the case of finding the adjacent suffix patterns to the left hand side of the first matched position p on SA_i. Assume that d is the path pattern length in a given query. If LCP_i[j] is greater than or equal to d, then the suffix pattern at SA_i[j-1] can be considered to be matched to the path pattern in a given query. We thus include the suffix label at SA_i[j-1] in the result set. We can find all suffix labels for the adjacent suffixes to the left of position p by repeatedly performing this procedure. Similarly, we can find the suffix labels for the adjacent suffix patterns in the right hand portion. We repeatedly include the value of LCP_i[j+1], if the value of LCP_i[j+1] is greater than or equal to d. We can replace the time-consuming pattern matching with integer comparisons and reduce query processing time. For the example query 1 we here obtain an additional suffix label of (4, 1) at the position 4 on the SA_r1. The function ForwardQueryProcessing obtains the final answer from ptab, using the set of suffix labels returned by the function GetSuffixLabel. We first modify the returned suffix labels by adding d to the idx value of each suffix label and finally include the content of ptab[pid][idx] in the final answer. For the example query 1 the function GetSuffixLabel returns a set of suffix labels of {(2, 1) (4, 1)} and we have {r4, r6} as the final answer. We omit the processing algorithm for the backward path query since it is identical to that of the forward path query, excepting one thing. We first obtain a set of suffix labels, executing the function GetSuffixLabel. We then modify the returned suffix labels by decrementing 1 from the idx value of each suffix label. We then include the content of ptab[pid][idx] in the final answer. Function ForwardQueryProcessing(usrQueryPattern) // Assume, d is the length of query pattern usrquerypattern // Output : the final result set finalset Call GetSuffixLabel(usrQueryPattern) // obtaining tempset For each suffix label (pid, idx) in tempset Do Add the content of ptab[pid][idx +d] in finalset End For End Function Function GetSuffixLabel(usrQueryPattern) // Output : a set of suffix labels tempset matched with the query pattern usrquerypattern Step1)Find the position p in SA_i that contains the first matched suffix to usrquerypattern Add suffix label that is the content of SA_i[p] in the temporary set tempset Step2) Find additional suffix labels from the adjacent components to the position p on the SA_i using LCP array 2-1) Perform left side scan tp p; While (d <= LCP_i[tp]) tp tp - 1 // modify the position p value Add the content of SA_i[tp] in tempset End While 2-2) Perform right side scan tp p; While (d <= LCP_i[tp+1]) tp tp + 1 // modify the position p value Add the content of SA_i[tp] in tempset End While End Function Figure 9. Query Processing Algorithm for Forward Path Query 49

6 Improved Processing of Path Query on RDF Data Using Suffix Array Sung Wan Kim 4. Experimental Evaluation We evaluate performance in this section. We used a modified FOAF ontology-based RDF data set provided in FOAF project [7] after transforming the data to DAG and generated two data sets, as shown in Table 1. The number of suffixes was reduced by 40% when we eliminate duplicate suffixes. Table 1. Experimental Data Set Data1 Data2 Data Size (KB) 2,000 10,000 # of extracted paths 1,314 6,570 # of extracted suffixes 14,874 74,370 # of extracted suffixes (duplicate suffixes eliminated) 9,293 44,539 Tests were performed in a machine with Intel Core2 Duo 2.20GHz CPU, 1GB memory, and 300 GB HDD, running Window XP Professional. We used MS Visual C and MySQL 5.0 for the implementation. We implemented three approaches to evaluate performance of path-based RDF query processing. First, we implemented Matono's approach, mentioned in section 2. For this, we eliminated the duplicate suffixes and generated a table for the path information and a table for the suffix array. The second approach was similar to the first, but the duplicate suffixes were not removed to handle backward path query with no missing results. The last approach is our proposed approach. We also did not eliminate the duplicate suffixes. We applied the proposed index structure and query processing algorithms. For this, we generated an additional table to store keywords and a table to store LCP values. We performed the experiments after loading the suffix array in main memory for all three approaches. Query types for the performance evaluations are shown in Table 2. We measured the average execution times for the forward and backward path queries, not taking database caching into account. The path length in the table denotes the number of components in the path pattern for a given query. The retrieval target indicates the item to be in the final set. The position is where the given query is matched in the RDF data graph. If the position is root, for example, the given query is matched at the paths that begin from resources with root nodes (indegree of zero) in the RDF graph. Table 2. Query Types for Test Query features direction path length target position Q1 forward 6 resource upper Q2 forward 10 resource root Q3 forward 4 resource mid Q4 forward 5 property mid Q5 forward 5 resource mid Q6 forward 5 resource upper Q7 forward 4 resource lower Table 3 shows the query processing times for the above query types in Table 2. The number of binary searches in the table indicates the number of binary searches executed for the given query to be initially matched. The number of L/R accessing field denotes the number of executions for pattern matching to the left and right hand side to the first matched position in the two former approaches and the number of comparisons of LCP values in the proposed approach respectively. The number of returns field indicates the number of results obtained by query evaluation; it includes the replicated results. The number of results is the number of results in which the duplicated results is excluded from the number of returns. We omitted this field in the table for the forward queries, since the number of results is the same in all three approaches. For all query types, the number of binary searches is reduced remarkably, when we apply the proposed indexing approach. The number of accesses for the adjacent components in the left and right hand portions to the first matched position on the suffix array is counted differently in the first approach and the other two ones by the query types. Most of the duplicate suffixes appear among the suffixes extracted from nodes positioned under the mid parts of RDF graph. The query types of Q3, Q4, Q5, and Q7 are matched at the mid and/or lower parts of RDF graph. Thus, the first approach that removes the duplicate suffixes shows fewer accesses for the adjacent component to perform pattern matching than the other two ones. The number of returns after query processing was less in the first approach. Both the second and the proposed approaches, which do not remove the duplicate suffixes, returned the same number of results. 50

7 Journal of Convergence Information Technology Volume 4, Number 3, September 2009 Conversely, in the case of the query types of Q1, Q2, and Q6 that are matched in the upper parts of the RDF graph, the number of accesses to the adjacent components at the left and right hand side parts was measured to be equal, as was the number of returns counted. Thus, we determined eliminating duplicate faster than the second one. Hence, we know that eliminating duplicate suffixes directly influences query processing performance. Compared to the proposed approach, however, the first approach showed slightly faster or similar performance. One of the reasons for this is that the number of pattern matches has been Table 3. Query Processing Results suffixes does not influence the performance of query processing for a given query that is matched to the path beginning from the nodes positioned at the root or upper parts of the RDF graph. Thus, the query processing times for Q1, Q2, and Q6 are similar in the first and second approaches. The query processing times of the proposed approach is 50 % faster than the other two approaches. This is due to reducing the binary search space using the proposed index structure and replacing the path pattern matching with the integer comparisons based on LCD values. In the case of the processing times for the query types of Q3, Q4, and Q5, the first approach is 50 % reduced by removing the duplicate suffixes in the first approach. Thus, both the number of the returns and the time to exclude the repeated returns to obtain the final result was reduced. However, for query processing with duplicate suffixes, the performance of the proposed approach more than doubled the performance of the second one. Finally, to handle the backward path query, like Q7, the first approach performs better than the others, for the same reason as we process the query types of Q3, Q4, and Q5. The number of results in the first approach is one, whilst both the second and proposed approaches return three results. We thus know these two 51

8 Improved Processing of Path Query on RDF Data Using Suffix Array Sung Wan Kim approaches are more accurate than the first. Higher performance gain was obtained in the proposed approach than the second one. 5. Conclusion In this paper, we first introduced the characteristics of the previous indexing and query processing scheme using a suffix array to handle the path-based RDF queries. We then proposed two schemes to improve query processing performance. We hence proposed an index structure to reduce binary search space and introduced a query evaluation approach to reduce the overhead caused by repeating direct pattern matching. Finally, experimental evaluations demonstrated the proposed approach improves performance compared to the previous approach for path-based RDF queries. 6. Acknowledgement Part of this work was done while the author was a visiting researcher in the Information Systems and Database Group at the University of Waikato, New Zealand. 7. References [1] W3C, RDF Primer, [2] W3C, SPARQL Query Language for RDF, [3] P. Haase, et al. "A Comparison of RDF Query Languages", In the Proc. of the Third International Semantic Web Conference, 2004, pp [4] A. Matono, et al., "An Indexing Scheme for RDF and RDF Schema based on Suffix Arrays", In the Proc. of the First International Workshop on Semantic Web and Databases (SWDB). Sept. 2003, pp [5] Baolin Liu and Bo Hu, "Path Queries Based RDF Index", In the Proc. of the First International Conference on Semantics, Knowledge, and Grid (SKG), 2006, pp [6] William B. Frakes and Richard Baeza-Yates, Information Retrieval : data structures and algorithms, Sigma Press, [7] The Friend of a Friend (FOAF) project, 52

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,