Measuring The Degree Of Similarity Between Web Ontologies Based On Semantic Coherence ABHIK BANERJEE, HAREENDRA MUNIMADUGU, SRINIVASA RAGHAVAN VEDANARAYANAN, LAWRENCE J. MAZLACK Applied Computational Intelligence Laboratory University of Cincinnati, Ohio 45220 UNITED STATES banerjak@mail.uc.edu, munimaha@mail.uc.edu, vedanasn@mail.uc.edu, mazlack@uc.edu Abstract: - The Internet comprises of a variety of websites, which both individually and in clusters generate large amounts of information. In order to make web pages machine-understandable we need a formal, explicit specification. This is provided by a Web Ontology. The importance of domain ontologies is widely recognized, particularly in its relation to the expected advent of the Semantic Web. For the task of detecting and recovering relevant ontologies, a means to measure the similarity between ontologies becomes a binding necessity on a vary large scale. The purpose of this paper is to describe a method that will effectively recognize and categorize different ontologies of the same domain and find the degree similarity between them to provide a framework for a research that can effectively provide a scope for merging the ontologies that relate to a similar concept in a domain. Key Words: - Ontology, merging, comparison, coherence, semantic web. 1 Introduction The Internet comprises of a variety of websites, which both individually and in clusters generate large amounts of information. It is up to human users to effectively and efficiently extract the information by having the machine do the work for us and fetch that information. In order to make web pages machine-understandable we need formal, explicit specifications. This is provided by a Web Ontology (from here on just Ontology will be used in place of Web Ontology). An illustration of knowledge by a group of concepts within a particular domain and the equivalent relationships between such concepts are called Ontology. Its main application is to rationale about the characteristics of that domain, and may be used to depict the domain [1]. Ontology is an important upcoming discipline and has great potential to enhance information management [2] [9]. The importance of ontologies pertaining to a particular domain is widely recognized, particularly in relation to the growth of Semantic Web. For the task of detecting and recovering relevant ontologies, a means to measure the similarity between ontologies becomes a binding necessity on a humungous scale [3]. The motivation also may include cases such as, If a person wants to find the right community with which he will be more comfortable to communicate, identification of similarity between ontologies (communities) can be of great benefit [3] [4]; a major application of this is to employ it in categorizing communities in social networks such as Facebook, Orkut, Twitter [5] [10]. In a major industry in the Web scenario - ontology engineering, it is supportive to find ontologies that are similar so that they can be easily used in tandem with other ones [3]. For illustration, when creating an ontology for astrophysics analysis, it would be better to find ISSN: 1792-4251 584 ISBN: 978-960-474-213-4
both astronomy and physics ontologies that can be used in sync with each other [3]; In fresh and lucrative areas such as semantic search engines, where context-based search is considered the best, they return ontologies in response to a query, it would be valuable to introduce [3] [6] module that would find akin ontologies abstracted as a key. Distances can also be used for sorting responses to such a query that in consequence will lead to ontology grading, in lieu to ontology proximity [3] [7]. The purpose of this paper is to find the similarity measure between two ontologies and to provide a framework for future research for merging the ontologies that relate to a similar domain. The objective of this research is to recognize and categorize similar ontologies and subsequently produce a common consolidated ontology framework for a particular domain. Our central hypothesis is to use WordNet to find the semantic coherence between two nodes in two different ontologies. We make use of the lexical similarity, semantic similarity and tree transformation costs for attaching a value to the degree of similarity between 2 ontologies. The hypothesis has been formulated on the basis of the fact that a WordNet gives all the meanings of a particular word and also its synonyms. 2 Problem Formulation In several assorted areas such as prearranged databases expressed basically in text form, experimental biology, compiler optimization, and image investigation degree of similarity has been well evaluated [8]. The problem focused in the paper is to find the degree of similarity between two ontologies from the same domain. The long term goals of this research is to recognize and categorize similar ontologies, detect their domain and consequently provide a larger framework for data integration, so that, we can perform better analysis and data mining on global data to achieve coherent and highly specific, trustworthy results for user queries. In these researches edit cost (or edit distance) from one tree to another one is employed to measure similarity degree of two trees. Nevertheless, the basis of such ideas are mainly concentrated on discovering matches rooted in the structural or geometric perspective without considering the conceptual semantics of the tree nodes in the framework of knowledge [8]. A close to comprehensive study about the similarity between ontologies was carried out by Maedche [3]. In their research, ontology had a tree-like structure that would be used to model a concept in the form of a taxonomy [8]. A method was developed to measure the similarity between ontologies based on the ideas of lexicon, reference functions, and semantic cotopy [11]. The scheme was built on the hypothesis that the same terms will be used in different ontologies for concepts but their relative positions with respect to each ontology may vary according to that application or that user s priority [8] [12]. In such cases computing the taxonomic overlap cannot be fully achieved and evaluation on a lexical level becomes almost impossible [8]. The structural characteristics of trees, which are in a way crucial to discovery of similarity, were not taken into account by this research. 3 Problem Resolution Starting with a particular domain of ontologies is expected to decrease complexity with retaining the main issue of this research. The Tourism domain will be used for testing our hypothesis. Before we can continue with our testing we want to get our data defined and also get our different ontological trees generated and represented in a particular format. We have decided to choose a group of undergraduate students from the industrial engineering department. We will ask the students a screening test, which will consist of questions such as number of miles traveled, number of countries visited, mode of transportation, and so forth. Initially, we plan to use only those students who have traveled for more than 1000 miles, visited at least 3 countries and utmost of 5 countries and mode of transportation to these countries should have been through commercial flights. We will take a minimum of 3 groups of 25 students each based on the above criteria to design individual ontologies for the tourism domain. There are forms of knowledge representation that includes semantic nets, frames, rules, etc. ISSN: 1792-4251 585 ISBN: 978-960-474-213-4
We decided to represent our Ontological trees in the form of frames. We chose to use the Protégé tool to develop the ontological frames. The undergraduate students selected for the experimental setup will be given several hours training on the use of Protégé and how to use the interfaces for the creation of frames. Fig 1 shows one of the interfaces that will be used to develop these frames for knowledge representation. nodes that constitute the two ontological trees with the predefined threshold values. 3.1.1 Lexical Comparison Level At this level, we compare nodes that have lexically similar names. To find the lexical similarity, we use the concept of edit distance. The edit distance is used to measure the minimum number of operations that have to be performed which may be insertions, deletions, and substitutions in order to transform one string to another. The similarity measure is calculated based on this edit distance given by the following equation: Lexical Similarity Measure: (LSM)=(s-c)/s; [3] (1) Where, s - length of the shortest string and c- number of changes required to transform one string to another (edit distance). Fig 1. (Interface for designing frames in Protégé) After the ontological trees are developed, we will then compare the two ontologies at a time by running it through our system comprising of the lexical, semantic and tree transformation phase. To test the robustness of our system, the tests and subjects will be varied by choosing a different domains in a similar manner. 3.1 Testing For testing our hypothesis we define a 3 step approach, that comprise of Lexical comparison level Semantic comparison level Calculating the Transformation cost The three testing procedures are iteratively run for all the nodes in both the ontological trees and the numerical values are compared for all the Fig 2. Two ontologies A and B depicting university structure [8] The nodes that have similar names are taken into consideration from the corresponding on- ISSN: 1792-4251 586 ISBN: 978-960-474-213-4
tologies. For example, in Fig 2, for nodes in ontology A, University A has best match with University B, LSM = (12-1)/12 = 0.91 Employee has the best match with Employee in Ontology B, LSM = 1 College has a bad match with any node, consequently, LSM = 0 3.1.2 Semantic Comparison Level At the semantic level, we incorporate our approach of using WordNet to try and find the concept of the words that were found similar or not completely similar at the lexical level. WordNet can be suitable option to find the semantic similarity between nodes. Though the WordNet resembles the structure of thesaurus in many forms, one of the differences between them is that, apart from expressing the words in the forms of concepts, WordNet also tries to find the relationships between the words and the concepts. WordNet uses the concept of synsets as well as using the concept of synonyms. Hypernyms and hyponyms can also be treated as synsets. Hyponyms and hypernyms can be understood by a simple example; we say that car is a kind of vehicle. Here car is a hyponym and vehicle is a hypernym. Synsets are defined as words that have a similar meaning and also can be substituted in a sentence in place of each other without changing the actual meaning of the proposition. So, we can substitute the words car and vehicle in a sentence interchangeably, such as if we say that The car is traveling at a speed of 100 miles per hour can also be represented as The vehicle was traveling at a speed of 100 miles per hour. We suggest the idea, that besides comparing the lexical similarity of the two similar nodes in two different ontological tree structures, we also use WordNet to find whether the semantic meaning of the two nodes matches. We will write a function that will compare two words to see if they form synset pairs, taking into consideration the corresponding hypernyms and hyponyms. The function we define will check if the two pairs of words (nodes/concepts) from the two ontological trees are a pair of synsets by comparing them to the WordNet dictionary. This function will return a numerical value between 0 and 1, where a value of 1 implies complete semantic coherence between the two word pairs. Considering our example from Fig 1, the WordNet would ideally return College and School to be semantically coherent. So this can be used to re-label College to School in Ontology A. 3.1.3 Calculating the Transformation Cost We define operations such as insertion of a node, deletion of a node, moving a node, relabeling a node as transformations that when done to one ontology, would closely resemble the other ontology in consideration. We assign a transformation cost to each of these transformation operations and compute the net transformation cost. The cost of insertion or deletion is given by the expression [8], Ci/d = [(h-d) +1+ D] / V (2) Where, h - height of the tree (no. of levels), d - depth of the given node in the tree, V - number of nodes in the tree before insertion/deletion, and D - number of descendants to that node after it is inserted. Deletion and insertion are exemplified by Fig 3 that portrays a conversion from Ontology A to Ontology B: The costs for deletion and insertion are: Deletion of Professor node at level 4: C d = [(4-4) +1+0] / 5 = 0.2 (3) Insertion of Professor node at level 3: C i = [(3-3) +1+0]/5 = 0.2 (4) The cost of moving a node is given by the expression [8], Cm = [0.5*(D+I)*(V-2)] / 2 (5) Where, D - cost of deletion, I t he cost of insertion, and V - number of nodes in the tree before insertion/deletion. From equations (3) and (4), we have: Cm = [0.5*(0.2+0.2)*(5-2)] / 2 = 0.3 ISSN: 1792-4251 587 ISBN: 978-960-474-213-4
Fig 3. Deletion and insertion of node Professor [8] The re-labeling operation is useful when labels of nodes do not match between two concepts or ontologies. The cost of re-labeling is dependent on the semantic similarity between two given concepts denoted by s. The cost of re-labeling is assigned the same value as returned by the function defined in the semantic comparison level. The structure of the two ontologies after re-labeling is illustrated by Fig 4. Fig 4. Ontology B transformed to ontology A 3.2 Threshold Setting After comparing the two ontological trees we obtain three numerical values for each of our three steps. The numerical values from the lexical, semantic and transformation cost are compared to a predefined threshold value. For the case of lexical and semantic comparison, if the numerical values are greater than the threshold value we constitute the result to be true. In the case of calculation of transformation cost if the numerical value obtained is less than the threshold value we assume the result of the test to be true. Deciding the threshold value for the tests is a difficult question and also an important one. We take a range of threshold values for each of the three tests and repeat our experiments with different inputs in the same domain as well as inputs from various domains. With each iteration, we decide if the threshold values for each of the tests needs to be increased or decreased. After setting our threshold values for each of the tests, we run our experiment on various input ontology sets for various domains. 4 Conclusion The results obtained as a result of the three tests performed are a sufficient measure to decide if two ontological trees or two ontological structures are similar or not. The lexical and the semantic comparison of the nodes gives us an idea if the two ontologies has the same concepts defined, and the cost of transformation of the trees defines the cost that is required to transform one ontology to another. Future goals of the research include the merging of the two ontologies once we conclude that the two ontologies are similar. A merged ontology for each domain will be more efficient, maintainable and error free. It will help in removing ambiguity between the various web ontologies available. The future benefits of having a merged ontology includes finding common communities in various social networking sites, semantic searches over the wide span of data spread over the internet and to consolidate ontologies over various coherent research areas like astrophysics, geophysics and many others. ISSN: 1792-4251 588 ISBN: 978-960-474-213-4
References:- [1] http://www.answers.com/topic/ontologycomputer-science [2] Chen, E.; Wu, G. 2005. An Ontology Learning Method Enhanced by Frame Semantics. In ISM 2005: Proceedings of the Seventh IEEE International Symposium on Multimedia, 374-382. [3] Maedche, A.; Staab, S. 2002. Measuring Similarity Between Ontologies. In EKAW '02: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web, Springer-Verlag, London, UK, 251-263. [4] Jung, J.; Euzenatl J. 2007 Towards Semantic Social Networks. In Proc. 4th European Semantic Web Conference, Innsbruck (AT), volume 4519 of Lecture Notes in Computer Science, 267 280. [5] Jung,J.; Zimmermann, A.; Euzenat.; J. 2007. Concept-Based Query Transformation Based On Semantic Centrality In Semantic Peer-To-Peer Environment. In Proc. Advances in Data and Web Management, Joint 9th Asia-Pacific Web Conference (APWeb) and 8th International Conference, on Web- Age Information Management (WAIM), Huang Shan (CN), volume 4505 of Lecture Notes in Computer Science, 622 629. [6] D Aquin, M.; Claudio Baldassarre, C.; Gridinoc, L.; Angeletou, S.; Sabou, M.; Motta; E. 2007. Watson: A Gateway For Next Generation Semantic Web Applications. In Proc. Poster session of the International Semantic Web Conference (ISWC), Busan [7] Alani, H.; Brewster, C.; 2005. Ontology Ranking Based On The Analysis Of Concept Structures. In Proc. 3rd International conference on Knowledge Capture (K-Cap), Banff, 51 58. [8] Xue, Y., Wang, C., Ghenniwa, H.H., Shen, W. 2009. A Tree Similarity Measuring Method And Its Application to Ontology Comparison. J.JUS 15, 1766-1781. [9] Jorge Gracia, Vanessa Lopez, Mathieu D Aquin, Marta Sabou, Enrico Motta, Eduardo Mena. 2007. Solving Semantic Ambiguity To Improve Semantic Web Based Ontology Matching. In Proc. 2nd ISWC Ontology matching workshop (OM), Busan, 1 12. [10] Khelif, K., Gandon, F.L., Corby, O., Dieng- -Kuntz, R. 2008. Using The Intension Of Classes And Properties Definition In Ontologies For Word Sense Disambiguation, Knowledge Engineering: Practice and Patterns, 16th International Conference, EKAW 2008, Acitrezza, Italy, September 29 - October 2, 2008, 188-197. [11] Ermolayev, V.; Keberle, N., Matzke,W. 2008 An Upper Level Ontology Model for Engineering Design Performance Domain, Proc 27th International Conference on Conceptual Modeling (ER 2008), Barcelona, Spain, October 20-24. [12] Liu, B., Zhang, H., Yang, X. 2008. GY- RTI: An Integrated Distributed Simulation Environment. In Proc. IEEE International Conference on Networking, Sensing and Control (ICNSC 2008), 232-235. ISSN: 1792-4251 589 ISBN: 978-960-474-213-4