Large-Scale Incremental OWL/RDFS Reasoning

Size: px

Start display at page:

Download "Large-Scale Incremental OWL/RDFS Reasoning"

Tamsin Nichols
5 years ago
Views:

1 Large-Scale Incremental OWL/RDFS Reasoning over Fuzzy RDF Data Batselem Jagvaral, Lee Wangon, Hyun-Kyu Park, Myungjoong Jeon, Nam-Gee Lee, and Young-Tack Park School of Computer Science and Engineering Soongsil University (SSU) Seoul, South Korea Abstract Ontological RDF data are extracted from multiple sources on the web through mapping and alignment for various purposes, but extracting and reasoning about ontologies from different sources causes information ambiguity and uncertainty. A reasonable solution to this problem is to annotate extracted ontology data with truth values to determine the reliability of information. However, the recent growth in data has brought forth difficulties in ascertaining the credibility of numerous ontologies during OWL/RDFS reasoning. In this paper, we present a distributed and incremental reasoning approach for RDF data with uncertainty. We focused on RDFS and OWL pd* semantics and developed methods for incremental OWL reasoning with uncertainty. We also introduced parallel algorithms that resolve the scalable reasoning problem. To evaluate the efficiency of the proposed system, we conducted OWL/RDFS reasoning over fuzzy LUBM3000 and achieved a performance three times higher than that achieved with the fastest reasoning system. Keywords Uncertainty Reasoning; Incremental Reasoning, Ontology, OWL/RDFS; distributed computing;spark I. INTRODUCTION In the real world, manually extracting voluminous ontological data, such as RDF data, from the web is an unreasonable technique [1]. Ontology extraction involves the use of an automated knowledge extractor that determines plausible relations between concepts from diverse knowledge sources. The problem with such extractors is that they are prone to errors when they are used to extract large amounts of ontologies from multiple sources. They can, for example, provide uncertain results on whether a relation holds because of errors from the extractors themselves or errors from publication (e.g., An extractor is 20% uncertain about the date of Obama s birth.). Sources may also be equally untrustworthy, thus presenting the problem of mutually exclusive and conflicting claims by multiple sources. To represent this uncertainty or trustworthiness, which is sometimes referred to as provenance, an extractor assigns a fuzzy value or a numerical certainty to extracted relations (triples) that expend RDF ontology data with fuzzy logic [2]. The extracted fuzzy RDF data are then deployed to Semantic Web applications for a variety of purposes, such as ontological reasoning and querying. Over the past decade, ontology reasoning with fuzzy logic and uncertainty has been pursued by a very small research community. Several studies have argued for the need to annotate RDF data with fuzzy or truth values given that this approach improves data credibility [2,3], but relatively minimal research has been conducted to address the scalable incremental reasoning problem. Meanwhile, much successful work has been done on scalable ontology reasoning with OWL and RDFS semantics by using frameworks such as MapReduce and Hive [2]. These frameworks have been proven capable of handling large-scale reasoning [3,4]. Nevertheless, performing fuzzy reasoning over RDF data with OWL semantics poses challenges to a reasoner. In addition, reasoning systems receive new ontology data continuously and need to perform incremental reasoning over each update. These challenges have prompted us to argue that RDF data incrementally come with uncertainty, thereby necessitating uncertainty reasoning without re-inferring previously inferred knowledge [5]. Accordingly, in this paper, we developed a framework for scalable incremental reasoning with OWL and RDFS vocabulary over fuzzy RDF data on the basis of the Spark framework. Our focus was directed particularly toward parallelizing fuzzy RDFS and OWL pd* rules because pd* vocabulary is less computationally complex than OWL Full or DL vocabulary while offering a rich set of complete reasoning rules. Our approach was to harness the full power of the current trading distributed framework, namely Spark, [6] for performing efficient incremental uncertainty ontology reasoning over RDF data that scale up to millions of triples. The rest of the paper is organized as follows: We provide a brief instruction to the reasoning problem in Section II and III. We explain our algorithms to solve the problem in Section IV. We discussed experimental results are in Section V and finally conclude this paper in section VI. II. RELATED WORK Most of the methods developed for large-scale ontology reasoning are based on either MapReduce or massively parallel computing [3,4,5]. An example is Jacob s reasoning system, WebPIE [10], which deduces a quantitative ontology on the basis of the distributed processing framework of Hadoop MapReduce. WebPIE supports scalable ontology reasoning for RDFS and OWL Horst semantics. The authors compressed input data by using dictionary encoding that is intended to reduce data workload. Despite these advantages, WebPIE performance in iterative processing diminishes because intermediate reasoning results are written to a disk. Previously, [3] presented a fuzzy pd* OWL reasoner and proposed /17/$ IEEE 269 BigComp 2017

2 MapReduce to process forward inferencing over large-scale data by using fuzzy semantics pd* (i.e., an extension of OWL Horst semantics with fuzzy vagueness). Cichlid is also OWL pd* reasoning system designed for Spark framework [8]. In our experiment, we But these approaches need to re-compute previously inferred data whenever it receives a new set of data. Jacob also proposed in [5] incremental reasoning system using MapReduce for only RDF schema reasoning rules. III. BACKGROUND A. Fuzzy RDF and Fuzzy pd* Reasoning A fuzzy triple is expressed in the form (s, p, o)[ ], where (s, p, o) is a triple represented between brackets, and represents a fuzzy degree. We adopted the notion that underlies fuzzy logic methods in evaluating the degree of uncertainty in OWL axioms formalized in [2]. For example, the inference rule owl:symmetricproperty can support annotations for fuzzy values as follows: (p, type, SymmetricProp)[n], (v, p, w)[m] (w, p, v)[n m] where 1 and 2 are fuzzy values for the triples and represents the minimum combination function for the triangular norms in fuzzy logic which is logical AND operation. The entire set of fuzzy RDF pd* rules, as described in [2], is listed in Table I. TABLE I. FUZZY PD* REASONING RULES Fuzzy IF-THEN rules f-rdfp1: (p, type, FunctionalProperty)[n], (u, p, v)[m], (u, p, w)[l] (v, sameas, w)[n m l] f-rdfp2: (p, type, InverseFunctionalProperty)[n], (u, p, w)[m], (v, p, w)[l] (u, sameas, v)[n m l] f-rdfp3: (p, type, SymmetricProperty)[n], (v, p, w)[m] (w, p, v)[n m] f-rdfp4: (p, type, TransitiveP roperty)[n], (u, p, v)[m], (v, p, w)[l] (u, p, w)[n m l] f-rdfp5: (v, sameas, w)[n] (w, sameas, v)[n] f-rdfp6: (u, sameas, v)[n], (v, sameas, w)[m] (u, sameas, w)[n m] f-rdfp7ab: (p, inverseof, q)[n], (v, p, w)[m] (w, q, v)[n m] f-rdfp11: (u, p, v)[n], (u, sameas, u')[m], (v, sameas, v')[l] (u', p, v')[n m l] f-rdfp12a: (v, equivalentclass, w)[n] (v, subclassof, w)[n] f-rdfp12b: (v, subclassof, w)[n], (w, subclassof, v)[m] (v, equivalentclass, w)[n m] f-rdfp13a: (v, equivalentpropety, w)[n] (v, subpropertyof, w)[n] f-rdfp13b: (v, subpropertyof, w)[n], (w, subpropertyof, v)[m] (v, equivalentproperty, w)[n m] f-rdfp14a: (v, hasvalue, w)[n], (v, onproperty, p)[m], (u, p, w)[l] (u, type, v)[n m l] f-rdfp14b: (v, hasvalue, w)[n], (v, onproperty, p)[m], (u, type, v)[l] (u, p, w)[n m l] f-rdfp15: (v, somevaluesfrom, w)[n], (v, onproperty, p)[m], (u, p, x)[l], (x, type, w)[k] (u, type, v)[n m l k] f-rdfp16: (v, allvaluesfrom, w)[n], (v, onproperty, p)[m], (u, type, v)[l], (u, p, x)[k] (x, type, w)[n m l k] B. Resilient Distributed Dataset Resilient Distributed Datasets (RDDs) developed for Spark framework are fault-tolerant, parallel data sets that are distributed across multiple nodes [6]. They support parallel transformations such as map, filter, mappartition, broadcast and join. These functions are higher order functions that take functions as input parameters. For example, map transforms a given RDD set to a new RDD set by applying a user defined function to RDD elements. For more specific information, we suggest the reader to refer to [6]. C. Rule Dependency Graph To leverage the distributed computing, we devise a rules dependency strategy extended from [3] and [7]. Our approach differs from them in its iterative loop that only considers new inferred data on each iteration. Rules are categorized into four groups; namely schema, instance, sameas, and type. Schema rules involve RDF schema semantics [2] and instance rules are f-rdfp13, f-rdfp4, f-rdfp8ab, f-rdfp3. Type rules are rdf schema domain, range, and owl reasoning rdfp15, rdfp16, rdfp14a, rdfp14b. Sameas rules are f-rdfp1, f-rdfp2, f-rdfp5 and f-rdfp6. Fig. 1. Rule dependency graph. schema rules SPO instance rules Moreover, in pd* semantics, different rules may generate the same triples. For example, both f-rdfp1 and f-rdp2 rules can assert a duplicated triple with different fuzzy values. As a result, it increases the confidence in the truth of the triple based on multiple sources of evidence. In Fig. 1, we illustrate this process in the form of trustworthiness scheme where s donates input triples and c denotes a conclusion derived from a specific rule. Accordingly, distributed fuzzy reasoning consists of two steps: the first step is to execute rules and derive new triples and the second step is to eliminate duplicate with the same conclusion by calculating fuzzy values (i.e., removing duplication is performed by fuzzy logic s-norms which is denoted by logical OR operation). Each of them is addressed in the following section. f-rdfp5 type rules sameas rules s 1 s 2 s 3 s 4 f-rdfp1 c 1 c 2 c 1 c 1 Fig. 2. Applying fuzzy pd* reasoning f-dfp2 IV. METHODOLOGY The purpose of our methods is to reduce reasoning cost in a distributed setting while incrementally inferring new triples. In general, it is inefficient to re-compute a large set of inferred data when a new set of data comes. To solve this bottleneck, we introduce an incremental reasoning approach that receives 270

new data and performs reasoning over them without reinferring previous data. Fig. 3 depicts how the Spark workflow progresses for ontological reasoning.

3 new data and performs reasoning over them without reinferring previous data. Fig. 3 depicts how the Spark workflow progresses for ontological reasoning. Each reasoning task consists of two steps: the first step is to get necessary triples from the triple store, execute a rule and then the second step is to compute fuzzy values for the duplicated triples. joining these relations has low selectivity and causes less communication bottleneck on the cluster. TABLE III. ALLVALUES REASONING Algorithm 2: AllValuesFrom Axiom Reasoning (f-rdfp16) function REASON-ALLVALUES(T, A) inputs: T, a RDD set of instance triples Q, a RDD set of type relations O, a RDD set of onproperty relations A, a RDD set of AllValues relations J Π vwp 1 2 A(v, w, 1) O(v, p, 2) B BROADCAST(J) T s GET-ALLVALUES-FROM-SPO(T, B) Q s GET-ALLVALUES-FROM-TYPE(Y, B) T Π upv 4 5 T s(u, x, 4) Q s(u, v, p, 5) b. GET-ALLVALUES-FROM-SPO and TYPE functions are designed to filter relations T and Y using broadcasted values B and are based on the same approach as Algorithm 1. c. symbol denotes join transformation in Spark. d. Πsymbol denotes map transformation in Spark. B. Incremental Reasoning We extended the existing approaches [3,7,8] to handle incremental reasoning. To illustrate our approach, we take the following example: Fig. 3. Reasoning flow in Spark A. Distributed OWL reasoning OWL reasoning rules that require a single join such as f- rdfp3 are implemented by Algorithm 1. The difference between our approach and [3,8] is that we broadcast annotated triples into local machines over the network and we utilized mappartition transformation to perform the single join operations. Consider a simple domain reasoning rule: (p, rdfs:domain, c)[n], (v, p, w)[m] (w, p, v)[n m] TABLE II. REASONING DOMAIN KNOLWEDGE Algorithm 1: RDFS Domain Axiom Reasoning function REASON-DOMAIN(T, B) inputs: T, a RDD set of triples B, broadcasted domain relations for each partition T part T in parallel do D GET-LOCAL-DATA(B) /* key-value dictionary */ for each t T part do if D.contains(t.pred) then c, 2 D.getByKey(t.pred) yield t.subj, rdf:type, c, t. 1 2 a. Parallel loop is implemented by mappartition transformation in Spark In algorithm 2, we demonstrate how we can perform the multijoin operation for OWL reasoning rules such as f-rdfp16 and f- rdfp15. The idea behind our approach is that schema relations A and O, are relatively small so that joining these relations first can reduce the computation cost greatly for the overall reasoning. After joining these relations, we select instance triples associated with A O using broadcast transformation. Then, after filtering type and instance triples using A O, both instance T and type Y relations are reduced in size so that P(X, Y), Q(X, Z) R(X, Y) where P and Q conclude the relation R. Suppose after inferring R, if Q is added, it is unnecessary to re-derive R relations. By using only P and Q relations, we can derive R and compute R U R. Based on this principle, we developed our incremental algorithm for OWL reasoning as shown in Algorithm 3. It aims to derive new inferences from T relations without having to re-compute previous inferences T and iterates until there is no new triple to derive. TABLE IV. MAIN ALGORITHM FOR OWL REASONING Algorithm 3: OWL Reasoning algorithm inputs: T, a set of instance triples, Q, a set of type triples S, a set of schema triples T SCHEMA-REASONING( T, S) U T Q 1 Q while Q 1 is not Ø do T 1 T while T 1 is not Ø do T 1 INSTANCE-REASONING(S, Q 1, T 1) T T U T 1 Q 2 Q 1 while Q 2 is not Ø do Q 2 TYPE-REASONING( Q 2, T) Q 1 Q 1 U Q 2 SAMEAS-REASONING( T) C. Handling Fuzzy Values and Network Shuffling During the reasoning, fuzzy values are handled as shown in the above algorithms but after the reasoning, it may happen that 271

4 different rules conclude the same triples with different truth values as described in Fig. 2. To group the same triples, we can use reducebykey transformation but it requires the RDD to fully shuffle through the network. On the other hand, when RDDs are pre-partitioned RDDs, the values on a single machine are computed locally and only finalized values are sent from the workers to the driver [7]. To apply this principle to our approach, we partitioned the RDD using the preserved hash partitioner and reduced the same triples locally to compute truth values. However, in advance, all RDDs need to be partitioned by the same partitioner. This can be accomplished by partitioning overall input triples before the reasoning process commences. Also, as defined in [7], join transformation called on two RDDs that are pre-partitioned with the same partitioner and cached on the same machine causes the join to be computed locally, with no shuffling across the network. In our multiple join algorithm, we apply this approach to avoid the high communication cost. V. EXPERIMENT To evaluate the efficiency of the proposed reasoner, we conducted OWL reasoning over LUBM [9] The LUBM dataset consists of University domain ontology and is widely used to evaluate ontology reasoning systems. To conduct experiments, we setup a Hadoop multi-node cluster using 5 worker nodes and one master node. Each compute node is equipped with 2.4GHz CPU with 24 core processors and 32GB main memory. To evaluate fuzzy reasoning, we assign arbitrary fuzzy values to LUBM instance triples. We show the throughputs of the reasoner in Fig. 4 and the incremental reasoning scalability Fig. 5. When we scale LUBM datasets up to 3000 Universities, the throughput remains quite stable. To evaluate the system, we run WebPIE on the same cluster. With annotated triples, our reasoner performs slower compared to the standard reasoning but overall it performs much faster than both WebPIE. Reasoning time (min) WebPIE Our approach Number of Universities Fig. 4. Fuzzy reasoning comparision with WebPIE using LUBM datasets. In Fig. 5, we show the reasoning comparison with Cichlid Spark based reasoner [8]. As we push 1k instance triples to the triple store, our reasoner performs OWL reasoning in a relatively few minutes while Cichlid s runtime increases two times higher than the current runtime. In addition, this experiment was conducted without computing fuzzy values. Reasoning time (min) Cichlid Update operation Our approach Number of Universities e. Update operation is to add 1k instance triples to the triple store using our approach. Fig. 5. OWL Reasoning comparision with Cichlid. VI. CONCLUSION In this paper, we present a scalable incremental and fuzzy OWL pd* semantic reasoning approach for a large-scale ontology that handles uncertainty annotations. We present methods to calculate fuzzy values and incremental reasoning approaches to prevent from deriving previously inferred data. To evaluate the efficiency of the proposed reasoner, we conducted OWL reasoning over LUBM3000 and achieved about three times higher throughput compared to that of WebPIE reasoning which employs MapReduce. ACKNOWLEDGMENT This work was also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2016R1A2B ), Republic of Korea. REFERENCES [1] K. Ahmad, and L. Gillam, Automatic ontology extraction from unstructured texts, In Proc. of the 2005 OTM Confederated international conference on the Move to Meaningful Internet Systems: CoopIS, COA, and ODBASE, [2] C. Liu, G. Qi, H. Wang, and Y. Yu, "Fuzzy Reasoning over RDF Data Using OWL Vocabulary," Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on, Lyon, 2011, pp [3] C. Liu, G. Qi, H. Wang and Y. Yu, Large Scale Fuzzy pd* Reasoning using MapReduce, In Proc. of the Semantic Web ISWC 2011, vol. 7031, pp , [4] J. Urbani, S. Kotoulas, E. Oren, and F.V. Harmelen, Scalable Distributed Reasoning using MapReduce,. In: Proc. of the Semantic Web - ISWC 09, vol. 5823, pp , [5] J. Urbani, A. Margara, F.V. Harmelen, and H. Bal, DynamiTE: Parallel Materialization of Dynamic RDF Data, In Proc. of the Semantic Weg ISWC 2013, vol. 8218, pp [6] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker and I. Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, In Proc. of the 9 th USENIX conference on Networked Systems Design and Implementation, pp. 2-2, [7] K. Jemin, and P. Young-Tack, Scalable OWL-Horst ontology reasoning using Spark, In. Proc. of BigComp 2015, pp , [8] R. Gu, S. Wang, F. Wang, C. Yuan, and Y. Huang, "Cichlid: Efficient Large Scale RDFS/OWL Reasoning with Spark," Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, Hyderabad, 2015, pp

5 [9] Y. Guo, Z. Pan, and J. Heflin, LUBM: A Benchmark for OWL Knowledge Base Systems, in Journal of Web Semantics 3(2), pp , [10] J. Urbani, S. Kotoulas, J. Maassen, F.V. Harmelen, and Henri Bal, OWL Reasoning with WebPIE: Calculating the Closure of 100 Billion Triples, In Proc. of the 7th European Semantic Web Conference, vol. 6088, pp ,

Large Scale Fuzzy pd Reasoning using MapReduce

Large Scale Fuzzy pd Reasoning using MapReduce Chang Liu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,whfcarter,yyu}@apex.sjtu.edu.cn 2 Southeast University, China