Supervised tracking and correction of inconsistencies in RDF data

Supervise tracking an correction of inconsistencies in RDF ata Youen Péron, Frééric Raimbault, Gilas Ménier, Pierre-François Marteau To cite this version: Youen Péron, Frééric Raimbault, Gilas Ménier, Pierre-François Marteau. Supervise tracking an correction of inconsistencies in RDF ata. 2011. <inria-00593579v2> HAL I: inria-00593579 https://hal.inria.fr/inria-00593579v2 Submitte on 20 May 2011 HAL is a multi-isciplinary open access archive for the eposit an issemination of scientific research ocuments, whether they are publishe or not. The ocuments may come from teaching an research institutions in France or abroa, or from public or private research centers. L archive ouverte pluriisciplinaire HAL, est estinée au épôt et à la iffusion e ocuments scientifiques e niveau recherche, publiés ou non, émanant es établissements enseignement et e recherche français ou étrangers, es laboratoires publics ou privés.

Supervise tracking an correction of inconsistencies in RDF ata Youen Péron, Frééric Raimbault, Gilas Ménier, an Pierre-François Marteau Valoria, Université e Bretagne Su, Université Européene e Bretagne, Vannes, France E-mail: {firstname.lastname}@univ-ubs.fr Abstract. The rapi evelopment of the Linke Data is hampere by the increase of errors in publishe ata, mainly relate to inconsistencies with omain ontologies. This problem alter reliability of application when analysis of inepenent an heterogeneous rf ata is involve. Therefore, it is important to improve the quality of the publishe ata to promote the evelopment of Semantic Web. We introuce a process esigne to etect an ientify specific inconsistencies in publishe ata, an to fix them at an ontological level. We propose a metho to quantitatively evaluate omain an range issues of a property. A iagnosis is then establishe an use to rive a supervise process for the correction (or possibly enhancement) of the ontology. We show the interest of this metho on a case stuy involving bpeia. The results of experiments carrie on very large ata sets generate with rf sp2bench benchmark valiate the applicability an scalability of our metho. 1 Introuction Launche in 2006 by Tim Berners-Lee, the Linke Data 1 movement an its recommenations is playing a leaing role for publishing, sharing, an interconnecting ata on the Semantic Web [3]. Many companies (eg. the BBC an The New York Times), organizations (eg. wikipeia Geonames, FreeBase, UN) an governments (eg. U.S. an U.K.) have aopte the principles of Linke Data to publish their ata using rf 2 stanar an their ontologies 3 using rfs 4 an owl 5 : the size an variety of available ata sets keep growing. April 2011, the Linke Data giant global graph ata has been estimate over 200 large ata sets (more than 1000 triplets each), an totaling 29 billion rf triples an 400 million links 6. Unfortunately, this uncontrolle growth is accompanie by a proliferation of errors in publishe ata associate with ontological inconsistencies. This problem alters application reliability when analysis of inepenent an heterogeneous rf ata is involve. Therefore, it is important to improve the quality of the publishe ata to promote the evelopment of the Semantic Web. Our metho quantitatively evaluates omain an property misuses an helps fix these problems. Most of the research ealing with ata quality in Semantic Web focus either on ontology (class an properties efinition) or raw ata (class an literal instances). Ontology valiation has been extensively stuie in [4, 2, 10, 8]. These works verify the consistency an completeness of an ontology within the assumptions of the open worl an non-unique ientification characterizing the semantic web : The tools are targete towars ontology esigners, irrespectively of the use relate to the ata publication. Surprisingly, valiation of raw ata has been much less stuie, even if it represents the main part of the publishe ata. Some tools have been esigne for syntactic valiation, such as vrp [14] a tool from ics-forthrffsuite 7. This kin of tools checks the ata compliance with the 1 http://www.w3.org/designissues/linkedata.html 2 http://www.w3.org/tr/rf-concepts/ 3 http://www.w3.org/tr/webont-req/#onto-ef 4 http://www.w3.org/tr/rf-schema/ 5 http://www.w3.org/tr/owl-semantics/rf-concepts/ 6 http://www4.wiwiss.fu-berlin.e/loclou/ 7 http://139.91.183.30:9090/rf

RDF syntax an semantic constraints (also calle consistency logic) in ontologies - uner the assumption that the use ontologies are errors free. The origin of the error in raw ata is searche a syntactic level. Error reporting is performe iniviually for each erroneous instance. The Peantic Web s project 8 proposes iagnosis of ontology errors an (most important) recommenation to avoi them. In [6] the initiators of this project perform an analysis of errors foun in a sample set obtaine by crawling: they ientify four symptoms of recurring errors an propose recommenations to manage them. The authors also introuce an online tool, rf:alerts 9 esigne to etect these errors. Our contribution takes part to the Peantic Web main works: we propose to enhance the semantic analysis by aing range an omain verification for properties on non literal ata (unlike [6] where the analysis is strictly restricte to range checking on literal ata). Our analysis is performe on class instances, being subject or object of property. We also propose a way to measure the importance of errors etecte so that the priority of the correction can be evaluate: if a correction has to be one, we propose a iagnosis an an aapte fix. In [13], the authors also aress the problem of ata compliance to ontologies. They propose a generic evaluation metho of ata, base on systematic search of some error patterns: this search is performe by sparql queries applie to the euctive closure of the graph of all inference rules. Their formulation is complex because it requires negations (which is nont trivial for sparql). Since we want our metho to be scalable such to be teste on very large set of rf ata, we propose a one-rule inference (hierarchy) process to spee up the etection of possible errors. We performe our test using the parallel environment haoop an the request language pig. Their metho isn t proviing any assistance to correct the ontology. If necessary, our approach aims also to fill this gap. The rest of the paper is organize as follows: In part 2, we recall some basic concepts of the semantic web an we efine the notations use through out this article. The thir part is the escription of our processing metho. The evaluation an tests are introuce in the fourth section. Then we conclue with a iscussion in the last section. 2 Backgroun an notations In this part, we recall some concepts of the rf ata moel, rfs an owl vocabularies an we efine the notations we use in this paper. 2.1 rf moel Accoring to the rf ata moel, all information is expresse as a triplet (subject, preicate, object). The subject an preicate of a triple are resources ientifie by uris 10. In this article, we use the prefixe uri form to simplifies the examples. A triplet object can be either a ressource or a literal. Let U be the set of resources an L the set or literals. A triplet t is efine as follows : t = (s, p, o) U U (U L) Since a same resource can play the role of a subject or an object into a triplet, a set of triplets can be represente as a graph. We aress the abox/tbox separation problem (see [1]) by istributing the triplets in the terminology ata (tbox) an assertion ata (abox). 2.2 Meta Data Vocabulary Terminology information is escribe using specific rf, rfs or owl vocabulary as efine by W3C. We introuce here in often the terminology we use in this paper : 8 site : http://peantic-web.org 9 http://swse.eri.org/rdfalerts/ 10 uri : Uniform Resource Ientifier

rf:type is a resource that provies an elementary system of belonging to a class. A triplet (r, rf : type, c) translates as r belongs to the class c. Let resources(c) be the set of resources belonging to the class c an let C be the set of classes. The number of resources in a class c is c = resoureces(c). A same resource can belong to several classes. All resources belong to the class owl:thing. rf:property is a class whose instances are use as preicate in a triplet. These resources are calle properties. Let the set of properties be P = resources(rf:property). A property s omains an tbox ranges are enote respectively by rfs:range an rfs:omain. Domain an range can be classes or even set of classes. Let omain(p) be the set of graph resources that belong to the object class of rfs:omain an let range(p) be the set of resources belonging to the object class of rfs:range. By efault, the omain an the tbox range of a property p are owl:thing. owl:objectproperty is the class of properties which range cannot be a literal, but only a class instance (also calle property-object in the text below). rfs:subclassof is a property escribing a subsumption relationship between a class c an a parent class f ( c f ). This is a transitive relationship : Each class can subsume itself : c 1, c 2, c 3 C (c 1 c 2 ) (c 2 c 3 ) = (c 1 c 3 ) c C, c c Let (c) be the set of classes that subsume a class c : (c) = {j C, j c} The generate hierarchy is efine as H. Let assume that, for each graph, the closure of this relationship has alreay been performe (as preprocessing) on the graph so that the following formula is true : ( u U), ( c, f C), (u c) (c f) = (u f) We use the metho of reasoning escribe in articles [15] an [7], which enable the implementation of inference rules on large (istribute) rf ata sets. 2.3 Effective omain an range Let omain e (p) be the set of resource subjects for a property p. Let range e (p) be the set of resource objects for a property p. 10 6 ɛ 10 5 ɛ r 10 4 # errors 10 3 10 2 10 1 10 0 10 0 10 1 10 2 10 3 rank Fig. 1. Number of errors per property p as a function of the rank of p.

We check in paragraph 3 that the effective omain (resp. effective range) is inclue in the omain efine in the tbox (resp. in the range efine in the tbox). Let a omain error be efine as the occurrence of a subject resource of an object-property p that oes not belong to omain(p). Similarity, let a range error be efine as one occurrence of an object resource of an object-property p that oes not belong to range(p). As a preliminary test, we have evaluate the omain an range error ε (p) an ε r (p) in DBpeia for each object-property : 13.9 % of the properties has at least one omain error. These errors are epicte in figure 1: in wich x-axis represents the properties are sorte by escening number of errors an the y-axis represents the number of error per property. Each axis is in a log scale. The shape of the curve reveals that the errors are concentrate on a small number of objectproperties - locate left to the vertical otte line. The fixing of a small number of object-properties can enhance greatly the general consistency of the overall set of ata. The metho escribe above has two rawbacks : (i) it is time an space expensive (especially for large ata sets) because it relies on a ictionary for the classes of resources. (ii) the lack of inication of which instances has cause the error, prevents a etaile tracking of the problem an therefore the esign of a solution. In the following parts of this paper, we escribe another evaluation metho that overcome these problems. 3 Our metho We propose a three-stage processing approach to compute the upper boun of the error number per object-property. Then we show how to use it to iagnose the source of the errors. To illustrate this metho, we will escribe each step of application on the bpeia s objectproperty bpeia:hometown. Its efinition s omain is bpeia:person, but in reality it is use with 21287 resources that o not belong to this class. Figure 2 shows the result of our lines processing metho: in the left part, the omain of efinition of bpeia:hometown appears in otte - its effective omain is hatche shae - lines (note that the range is not epicte here). In the right part, we can see the hierarchy of the classes with the number of resources using the property. Fig. 2. Class bpeia:person is the omain of efinition of bpeia:hometown. The effective omain, hatche shae, is ivie between classes: bpeia:person an bpeia:ban. The values appearing in the class hierarchy are those of the histogram of classes for property. 3.1 Computation of a histogram for a property In the first stage, we calculate the istribution of the resources in the effective omain e (p) an the effective range range e (p) of an object-property p in the set of classes C. Let us efine two

histograms of size C : p for the resources of the omain of p an p r for the resources involve in the range of p : p [c] = ressources(c) omain e (p) p r [c] = ressources(c) range e (p) In the case of object-property bpeia:hometown, the values of the histogram are inicate on the tree of Figure 2 for the relevant classes an are 0 for the other classes. Generally, the maximum coorinate of these histogram (omain an range) is for the class owl:thing because it inclues all of the instances member of the omain an of the effective range. p [owl:thing] = max c C {p [c]} p r [owl:thing] = max c C {p r [c]} In our example (bpeia), the object-property bpeia:hometown is involve by 38042 resources (istribute between people an organizations). 3.2 Searching for the most specific class In the secon stage, using the histogram for an object-property p, we select the most specific class class (p) that inclues the effective omain of p. It is obtaine by fining the lowest class in the hierarchy H among the classes containing the effective omain of the property. By esign, class (p) features both the following relations : (i) p [class (p)] = p [owl:thing] (ii) ( c C)(c class (p)) (p [c] < p [owl:thing]) (class (p) c) (i) ensures that resources(class (p)) inclues omain e (p) an (ii) ensures that class (p) is the most specific class containing omain e (p). Since we assume the assumption that the classes of instances meet a hierarchy : c 1 c 2 = (p [c 1 ] p [c 2 ]) This proves the existence an unicity of class (p). In the above example (see figure 2), owl:thing is the most specific class for bpeia:hometown that inclues the effective omain. In the same way as for the omain, lets efine class r (p) the more specific class that inclues the real range of p : (i) p r [class r (p)] = p r [owl:thing] (ii) ( c C)(c class r (p) (p r [c] < p r [owl:thing]) (class r (p) c) 3.3 Computing the upper boun of the number of errors The last stage, involves the computation of the upper boun of the number of errors ε sup (p) - for the omain of p - an ε sup r (p) for the range of the object-property p. ε sup ε sup r (p) = p [class (p)] omain(p) (p) = p r [class r (p)] range(p) The eviation between the number of errors an the upper boun is relate to the resource (or the effective range) belonging to more than one class : these resources may be counte several times using the hierarchy of classes they belong to. In the figure 2 example, the omain of efinition is the class bpeia:person. The upper boun of error is : ε sup (bpeia:hometown) = p [owl:thing] p [bpeia:person] = 21287 This means that 21287 resources are subjects of bpeia:hometown even if they o not belong to its omain of efinition (here bpeia:person). We show in 4.1 that this upper boun matches perfectly the number of errors compute using the naive counting process.

3.4 Diagnostics an error correction After having etect potential problems, we will now escribe how it is possible to fin an fix errors. In the current example of figure 2, there is for instance a problem with bpeia:hometown. The stuy of histogram class for bpeia:hometown shows that most of the errors comes from the instance of bpeia:ban. These instances use the property bpeia:hometown but o not belong to the efinition omain of bpeia:person. A first iea involves the application of the following assumption : each resource use as a subject of a property belongs to the efinition omain of the class of this property. (3) r omain e (p) = r omain(p) In the case of bpeia:hometown, this woul imply that each music group (instance of bpeia:ban) belongs also to people (instance of bpeia:person). For instance, the resource bpeia:the Beach Boys use the property bpeia:hometown but it is not an instance of bpeia:person. Changing this an letting bpeia:the Beach Boys belong to bpeia:person may introuce errors on search relate to bpeia:person. A secon solution woul rely on the use of two ifferent properties for the music group an for people. This is not acceptable since it woul uplicate properties that have the same meaning - some ontology level problem may then arise. A thir iea is to relax the efinition omain of bpeia:hometown. In the bpeia hierarchy, owl:thing is the only less specific class than bpeia:person. The problem is that bpeia:hometown is not compatible with every bpeia s classes. At least, we think that the best solution in this case, is to a the bpeia:ban class to the efinition omain of bpeia:hometown. In general, we propose a iagnostic on erroneous efinition omain for property using the following scheme : if it seems that most of the instances of effective omain of the property belong to the efinition omain s class then the inference (3) applies else if the property may have ifferent senses for the ifferent resources then as much istinct properties as neee are create (one for each semantic) else if it is possible to fin a class that inclues a part of the effective omain an complies with the semantic of the property an for which the number of errors is acceptable then the efinition omain of this class is wiene else the efinition of the property is enhance by merging it to the set of classes from the effective omain. 4 Evaluation In this section, we analyze the quality of the results obtaine by our approach applie on bpeia (version 3.6), then we valiate the scalability of our metho using the benchmark sp2bench [12]. 4.1 DBpeia analysis bpeia is built upon an ontology 11 of 272 classes, 629 object-properties an 706 litteral-properties. Range an omain of each property are efine by a class or a set of classes. The ata in 843, 000 articles from Wikipeia (English version) are extracte an projecte to a 3.5 million resources graph that complies to the ontology. We performe our analysis on this set of ata - effective omain an range evaluation an etermination of the upper boun number of errors. An online emo that present the results of our system is available 12. 11 http://bpeia.org/downloas36 12 http://cluster-valoria.univ-ubs.fr/swoct/swoct/

Comparison with the exact number of error The table 1 shows a selection of the properties having the most errors an compares the upper boun error ε sup (p) to the exact number of errors ε (p) - see paragraph 2.3 - Comparison of omain ε (p) ε sup (p) bpeia:class 166,004 166,004 bpeia:hometown 21,287 21,287 bpeia:network 12,171 12,171 bpeia:sisterstation 5,044 5,044 bpeia:esigner 3,028 3,028 Comparison of range ε r(p) ε sup r (p) bpeia:city 12,168 12,168 bpeia:associatemusicalartist 11,027 11,027 bpeia:artist 10,166 10,166 bpeia:associateban 6,809 6,809 bpeia:sisterstation 5,464 5,464 Table 1. Comparison of the upper boun error an the excat number of errors on the first top 5 erroneous properties This table shows that the upper boun error ε sup (p) perfectly matches ε (p) : this is not surprising since every resource use in these properties belongs to one an only one class. Case stuy. We iscuss below some cases of errors encountere in bpeia using the iagnostic process presente in paragraph 3.4. bpeia:architect is eicate to the instance of builings (bpeia:builings class) but is in fact use in conjunction with bpeia:place - which is, into the hierarchy, parent to bpeia:builings. It seems acceptable to assume that every place having an architect is a builing. For instance, the Mongomery place in New York is an instance of bpeia:place. Even if this place has an architect, it oes not belong to bpeia:builing. We propose to a the Montgomery place to bpeia:builing. bpeia:class is the property causing the greatest number of errors. The notion of class is use in two ifferent contexts. Accoring to the ontology, this property shoul be use with transportation items. The histogram show that this property is mainly use by sub classes of bpeia:species - which has no relationship with transportation. Species are in no way means of transportation (with exceptions such as horses). In this case, the inference rule (3) see in 3.4 shoul not apply. It seems that bpeia can have ifferent semantics relate to bpeia:species or bpeia:meansoftransportation. We propose to fix this ambiguous property by creating - for each context - new properties in the tbox (an use the correct property in the abox) bpeia:city has the efinition range bpeia:city. Resources from its effective range are mainly location instances (bpeia:place is a parent class of bpeia:city). For instance, the triplet (Cincinnati bengals, bpeia:city, paul brown staium) means that the American football team Cincinnati Bengals is locate in the Paul Brown Staium. It is therefore elicate to assume that all the places use as objects for bpeia:city are cities. The property seems to have one an only one semantic for all resources in the effective range so it oesn t seem necessary to uplicate it. The most specific class that inclues the effective range is owl:thing. This class is much too general to be use as range. The bpeia:place is the best trae-off between the number of errors an the semantic consistency (only 37 resources among 17707 of the effective range of bpeia:city are not instances of the bpeia:place class).

Fig. 3. Execution times require for the computation of the histograms as a function of the number of triplets generate by sp2bench 4.2 Scalability We have use the pig system [9] on a cluster of 16 computers which can process up to 250 tasks in parallel. Since we want to stuy the processing time base on the graph size, we have teste our metho on series of test sets of increasing size generate by sp2bench [12]. sp2bench is mainly esigne to evaluate sparql engines but it is also useful for scalable ata sets generation. We have introuces errors in abox by changing the reference ontology (tbox). The errors are etecte an counte following a two step process : 1. the computation of istribution histograms execute in a ata-parallel way. (see paragraph 3.1) 2. the upper boun error is sequentially compute given the histograms an the tbox. This last step uration is not comparable to the histogram s computation so we ignore it in our results. Figure 3 shows the results. We use a log scale for the two axis. The execution time is multiplie by 3 when the size of ata increases by a factor of 10 7. This results show that our metho scales relatively well acceptably scalable an emonstarte it efficiency on very large rf atasets. 5 Conclusion We have escribe an evaluate a metho that analysis a rf graph an computes upper bouns for the omain/range errors number for a given property. In the case of bpeia, this error matches exactly the number of errors. We explaine how to perform an error iagnosis an propose ways to fix the ontology accoringly. As an example, we applie this process to bpeia an suggest eicate fixing proceures.. The experimentations carrie on large set of ata show promising results an emonstrate the relative scalability an usability of our metho for quantitative stuy of the evolution of Semantic Web Data.

Our approach has nevertheless some limitations: for instance, we assume that the class hierarchy is error free so that it is possible to infer new relations. In some situations, the hierarchy relations shoul be checke or perhaps rebuilt - some systems [11] can compute a new hierarchy base on hierarchy clustering: anyway, this woul be mae uner the assumption of instances correctness. These two problems are closely relate so the scanning of the ata base shoul involve jointly these processes that must be run in intricate ways. We are currently expaning our metho to the processing of other OWL constraints such as range restriction (owl:allvaluefrom), isjonctions (owl:isjoint), carinality (owl:carinality) etc.. The iagnosis coul also be enhance using a semi-supervise process involving the selection of the most relevant example - for instance using page rank like algorithm [5]. References 1. Baaer, F.: The escription logic hanbook: theory, implementation, an applications. Cambrige Univ Pr (2003) 2. Baclawski, K., Matheus, C., Kokar, M., Letkowski, J., Kogut, P.: Towars a symptom ontology for semantic web applications. In: ISWC. p. 650 667 (2004) 3. Bizer, C., Heath, T., Berners-Lee, T.: Linke ata - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1 22 (2009) 4. Gomez-Perez, A.: Evaluation of ontologies. International Journal of Intelligent Systems 16(3), 391 409 (2001) 5. Hogan, A., Harth, A., Decker, S.: Reconrank: A scalable ranking metho for semantic web ata with context. In: 2n Workshop on Scalable Semantic Web Knowlege Base Systems. Citeseer (2006) 6. Hogan, A., Harth, A., Passant, A., Decker, S., Polleres, A.: Weaving the peantic web. In: 3r International Workshop on Linke Data on the Web (LDOW2010), in conjunction with 19th International Worl Wie Web Conference, CEUR (2010) 7. Hogan, A., Harth, A., Polleres, A.: Scalable authoritative OWL reasoning for the web. Information Systems 5(2), 49 90 (2009) 8. Kalyanpur, A., Parsia, B., Sirin, E., Henler, J.A.: Debugging unsatisfiable classes in owl ontologies. Journal of Web Semantics 3(4), 268 293 (2005) 9. Olston, C., Ree, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for ata processing. In: Proceeings of the 2008 ACM SIGMOD international conference on Management of ata. pp. 1099 1110. ACM (2008) 10. Parsia, B., Sirin, E., Kalyanpur, A.: Debugging owl ontologies. In: International Worl Wie Web Conference. pp. 633 640 (2005) 11. Quan, T., Hui, S., Cao, T.: A fuzzy FCA-base approach to conceptual clustering for automatic generation of concept hierarchy on uncertainty ata. In: Proc. of the 2004 Concept Lattices an Their Applications Workshop. pp. 1 12. Citeseer (2004) 12. Schmit, M., Hornung, T., Lausen, G., Pinkel, C.: SPˆ 2Bench: A SPARQL Performance Benchmark. In: Data Engineering, 2009. ICDE 09. IEEE 25th International Conference on. pp. 222 233. IEEE (2009) 13. Tao, J., Ding, L., McGuinness, D.L.: Instance ata evaluation for semantic web-base knowlege management systems. In: Proceeings of the 42th Hawaii International Conference on System Sciences (HICSS-42). pp. 5 8. Big Islan, Hawaii (2009) 14. Tolle, K.: Valiating RDF Parser: A Tool for Parsing an Valiating RDF Metaata an Schemas. Master s thesis, University of Hannover (2000) 15. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable istribute reasoning using mapreuce. The Semantic Web-ISWC 2009 pp. 634 649 (2009)