The Design and Implementation of Optimization Approaches for Large Scale Ontology Alignment in SAMBO

Size: px

Start display at page:

Download "The Design and Implementation of Optimization Approaches for Large Scale Ontology Alignment in SAMBO"

Rolf Martin
5 years ago
Views:

Large Scale Ontology Alignment in SAMBO Huanyu Li Examiner : Patrick Lambrix Supervisor :

1 Linköping University Department of Computer Science Master thesis, 30 ECTS Datateknik 2017 LiTH-IDA/ERASMUS-A--17/001--SE The Design and Implementation of Optimization Approaches for Large Scale Ontology Alignment in SAMBO Huanyu Li Examiner : Patrick Lambrix Supervisor : Kristian Sandahl, Valentina Ivanova Linköpings universitet SE Linköping ,

2 Copyright The publishers will keep this document online on the Internet or its possible replacement for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: c Huanyu Li

3 Abstract The current World Wide Web provides a convenient way for people to acquire information, but it does not have the ability to manipulate semantics. In other words, people can access data from web pages efficiently but computer programs cannot satisfy effective data reuse and sharing. Tim Berners-Lee as the inventor of World Wide Web together with James Hendler and Ora Lassila, proposed the idea of Semantic Web that is expected as an evolution to existing Web. The knowledge representation for Semantic Web witnessed the development from extensible makeup language (XML) and resource description framework (RDF) to ontologies. A large quantity of researchers utilize ontologies to express concepts, relations and relevant semantics in specific domains. However, different researchers may have diverse comprehension about knowledge that brings inconsistent information in same or similar ontologies. SAMBO is an ontology alignment system that was designed and implemented by ADIT of Linköping University in Shortly after implementation, SAMBO could accomplish most tasks of ontology alignment. Nevertheless, as the scale grows rapidly, SAMBO could not achieve large scale ontology alignment. The primary job of this thesis is to optimize existing SAMBO system to fulfill alignment of large scale ontologies. The principal parts of this thesis are as follows. First, we achieve an analysis on current top ontology alignment systems, AML and LogMap which are capable of aligning large scale ontologies. This analysis aims to obtain the features in the design of high-quality systems. Then, we analyze existing SAMBO to figure out which aspects need to be optimized. We obtain the result that SAMBO should be improved in data structure, database design and parallel matching. Thus, we propose the design of optimization approaches and give the implementation. Finally, we evaluate the new system with large scale ontologies and acquire desired results. Keywords: Semantic Web; Ontologies; Ontologies Alignment; Large Scale Ontology; SAMBO; Optimization

4 Acknowledgments It is a wonderful time to achieve my thesis with an academic direction not an internship in a company to implement a software. I experienced the periods of research, analysis, design, implementation and evaluation. In each state, I felt I would be lost but obtained strength in the middle because of the support from people I would like to thank. First and foremost, I would like to express my gratitude to my internship supervisor Professor Patrick Lambrix for his help during my thesis time. His guidance and advice make me get better along the way doing thesis. Thanks for your patience and help! I really appreciate that. Second, I am grateful to another supervisor from LiU, Professor Kristian Sandahl. He guided me through the thesis process in LiU and gave a lot of important comments on my thesis. Without his help, I will not have complete and detailed comprehension of the thesis writing. Thank you so much! Then, I will thank my Chinese supervisor, Professor Ting He. He gave me a lot of guidance about my thesis and reviewed my thesis carefully and seriously to support advice. I am also grateful to Valentina Ivanova. She helped me set up the experimental environment at the beginning. And she also gave me advice during the design. To all my friends in the HIT-LiU class I am grateful for all the support and advice as well as the enjoyable moments spending with you in this year in Sweden. I am also thankful to Kai Chu for his suggestions in the opposition seminar. I will also give my sincere thanks to the people from LiU and HIT as well as those teachers who work for this exchange program. This is a kind of thing like a chance will change my life and I trust it did it. I could t have imagined how wonderful and unforgettable my life would get from the moment that I took part in this exchange program and the moment that I set foot in Sweden. Thanks for your work, I obtained this amazing and unforgettable memory studying abroad. Thanks for HIT and School of Software. After the 6 years study from bachelor to master, I am extremely proud of being a HITer. Finally, I would like to appreciate my family for their encouragement and unequivocal support that made me hold out until today. Thank you! Huanyu Li August, 2016 Linköping, Sweden iv

5 Contents Abstract Acknowledgments Contents List of Figures List of Tables iii iv v vii ix 1 Introduction Motivation Aim The Status of Related Research Outline of The Thesis Background Ontology Alignment System-SAMBO Ontology Alignment System-AML Ontology Alignment System-LogMap Comparison of Three Ontology Alignment Systems Requirements Analysis The Goal of Optimization Approaches Detailed Analysis on SAMBO Requirements of Optimization Approaches Overview Design Architecture of SAMBO The Structure of Optimization Approaches The Functional Modules Key Techniques Detailed Design and Implementation Detailed Design of Key Modules The Environment of Implementation Key Interfaces of System Testing and Evaluation System Testing Scheme Functional Testing Non-Functional Testing System Evaluation v

6 7 Discussion Result Reflection Method Reflection Sustainability Conclusion and Future Work Conclusion Future Work Bibliography 63

7 List of Figures 1.1 Example of General Format of OWL file Scenario of Ontology Alignment Division of Matching Result Framework in old SAMBO Framework in new session based SAMBO Configuration for Combining and Filtering Mapping Suggestion Main data structures in new session based SAMBO Architecture of AgreementMaker Schema of ontology loading module in AML Schema of ontology matching module in AML Configuration in AML Validation in AML Class Diagram in AML Main processes in LogMap Class Diagram in LogMap Aspects to be analyzed Database Design in SAMBO External Racer Reasoner Executable File Racer Reasoner Configuration Jena API Usage in SAMBO Overview Architecture Structure of Optimization Approaches Class Diagram after Optimization E-R Diagram of Database after Optimization Class Diagram for Database Access Class Diagram for Indexing Optimization Detailed Class Diagram Functional Procedure Optimization Sequence Diagram of Loading Ontologies Sequence Diagram of Computation Flow Chart of Handling Suggestions Save alignment result into RDF Detailed Class Diagram of Parallel Matching Module Flow Chart of Parallel Matching Flow Chart of computing similarity Flow Chart of Loading an Ontology Flow Chart of Accessing Classes in Ontology vii

8 5.12 Flow Chart of Accessing Concept s Description Flow Chart of build relationships Main Page Login Page Ontology Type Upload Ontologies Configuration for Properties Matching Properties Matching Concepts Matching Start Configuration for Concept Matching Suggestions Align Manually Finish Matching Result Display

9 List of Tables 1.1 Summary results of top systems in OAEI 2014 and 2015 for largebio Data sets of largebio in OAEI 2014 and Class Description of new session based SAMBO Main String Metrics in new session based SAMBO Class Description of AML Main String Metrics in AML Main String Strategies in AML Class Description of LogMap Main String Strategies in LogMap Main String Metrics in LogMap Components in the matching strategies of SAMBO Description of sub modules Description for Classes Attributes Description for Class MOntology Attributes Description for Class MClass Attributes Description for Class Lexicon Attributes Description for Class OntManager Attributes Description for Class MergerManager Attributes Description for Class Pair Attributes Description for Class Task Attributes Description for Class SimValueConstructor Description for key tables in database during matching Description for table mappable_ontologies Description for table mappable_concepts Description for table simvalue_ed Description for table simvalue_ng Description for View simvalue_view Popular Reasoners with OWL interfaces Essential functions to be used in OWL API Description of the interfaces Operations description for MOntology Operations description for MClass Operations description for Lexicon Operations description for OntManager Operations description for MergerManager Operations description for Pair Operations description for SimValueConstructor Operations description for URITable ix

10 6.1 Testing Environment Test Case and Result of login Test Case and Result of LoadFileservlet Test Case and Result of Mainservlet Test Case and Result of Classservlet Test cases for Non-Functional Testing Time for loading ontologies Time for matching with N-gram matcher

11 1 Introduction 1.1 Motivation This project comes from ADIT in IDA of Linköping University. I m interested in the topic of ontology alignment after I finish the course called Advanced Data Models and Databases. After contacting with my supervisor, the project title is defined as The design and implementation of optimization approaches for large scale ontologies in SAMBO. During the past several decades, as the scale of data and applications grows, information resources on the Internet rise dramatically enriching people s comprehension of domain knowledge. But it also brings such challenges as how to locate and access desired data or information from thousands of scattered and disordered storage to traditional Internet techniques [1]. Meanwhile, the current techniques of the Web do not reflect the ascending complexity of the Web [2]. Tim Berners-Lee, James Hendler and Ora Lassila proposed the idea of Semantic Web in 1998 as an extension of the existing web. To exchange and reuse the data from different or similar information systems are regarded as the original intention of semantic web [1]. However, there is a new challenge caused by the heterogeneity of information especially the semantic elements from different information systems [1]. In order to solve this problem and take semantics representation into account, the idea of ontology is suggested to represent information in a not only simple but also abstract way [1]. During the past decade, with the development of ontologies in many areas especially biomedical information system [3], many ontology alignment systems address to match the overlapping information. The reasons can be a specific domain may have different but similar ontologies and people who utilize ontologies may come from diverse areas with their own perspectives [4]. Although different ontology alignment systems can have their own frameworks for the purpose of pointing out the overlapping information, usually matching and filtering are necessary processes in these frameworks [5]. Nevertheless, user involvement becomes a new challenge in ontology alignment currently [4]. As the basis of OAEI which is short for Ontology Alignment Evaluation Initiative with a goal to evaluate the advantages and disadvantages of alignment systems, to compare performance of techniques with a yearly assessment event [6], user involvement particularly validations from domain experts has been realized as a vital component in the alignment process [7]. The authors in [8] proposed requirements divisions that promote user involvement in large scale ontology 1

12 1.2. Aim alignment. This thesis is based on SAMBO system in which a session-based user involvement approach has been implemented. 1.2 Aim SAMBO developed by Linköping University, is an ontology alignment system that concentrates on ontologies in OWL that is a type of web ontology language [9]. SAMBO implements the framework that is composed of preprocessing, matching, combinations and filtering as the chief components in this system [10]. In 2009 and earlier, matching strategies in SAMBO demonstrated good results according to F-measure at OAEI. In 2010, another matching system AgreementMaker implemented a strategy presenting better results assessed by same evaluation method, further in 2014, the enhanced version of AgreementMaker called AML advanced the matching result again [11]. The existing framework in SAMBO can achieve good results from small ontologies, however for large scale ontologies, it still has some restrictions such as efficiency of matching strategies [9] compared with AML. Consequently, with the aim to improve this existing system-sambo with the purpose that performance will improve, we will do this project titled «The design and implementation of optimization approaches for large scale ontologies in SAMBO» as my master thesis. We prefer to achieve the optimization by redesigning the basic data structure, promoting the matching strategies performance as well as implementing new strategies and approaches during the alignment process. SAMBO has some different versions to achieve specific functions. Such as the earlier version in [9] and the session based version in [4] and [12]. A simple version of SAMBO has been integrated in the ontology debugging and completion system-repose [13]. 1.3 The Status of Related Research With the development of matching strategies and matching tools, it has been realized that large scale matching evaluation is a severe challenge [14]. Some large ontologies for assessment may involve about 1 million entities that can lead some ontology alignment systems such as SAMBO not to accomplish with satisfied results. As well as, authors in [7] proposed that such elements as data set characteristics, OAEI data sets, evaluation measures and evaluation processes could be the significant factors in evaluation design. To such a degree that we can conclude a proper well-designed basic data structure of the ontology can come up with preferable effects during alignment process. Meanwhile we assume that some matching strategies with improved algorithm design can also increase the performance of ontology alignment system dealing with large scale ontologies Semantic Web, Ontology and Ontology Alignment As discussed in W3C, to put machine-understandable data on the Web is becoming a high priority for many communities. The web can reach its full potential only if it becomes a place where data can be not only shared but also processed by automated tools as well as by people. In terms of Web, programs for future will be better to share and process data even when these programs have been designed totally independently. The Semantic Web is a vision that having data on the web defined and linked in a way that it can be used by machines not just for display purpose, but for automation, integration and reuse of data across various applications. People desire Semantic Web to achieve data integration from pages to pages carrying out sophisticated jobs for user [15]. Ontology includes the basic terms and relations comprising the vocabularies of a topic area, as well as the rules for combining terms and relations for the purpose of defining extensions of vocabularies. Ontology can be used for communication between people and or- 2

1.3. The Status of Related Research ganizations enabling knowledge reuse and share, used as basis for interoperability between systems, repository of information and query model for information

13 1.3. The Status of Related Research ganizations enabling knowledge reuse and share, used as basis for interoperability between systems, repository of information and query model for information sources. Ontology satisfies the desire of Semantic Web that support a representation of data for machines. There are some languages to build ontology such as OWL language. The general format is exemplifed in figure 1.1. This is an instance of ontology including class information, label, subclassof and other relationships in the ontology. The data structure we need to optimize should integrate these elements in the ontology and store the indexing of the lexical information. Ontology is made up of such components [9] as concepts representing either a set or a class of entities, relations describing the characteristics of concepts, instances standing for actual entities as individuals. Axioms can represent the knowledge of specific domain with concepts and relations. A case of ontology definition is given in figure 1.1. The ontology in figure 1.1 includes concept information such as typed by owl:class with an identification ID typed by rdf:id. Information typed by rdfs:label and the relation typed by rdfs:subclassof are essential in the alignment. Figure 1.1: Example of General format of OWL file [16] As more and more people with their own comprehensions about specific domains are involved in building ontologies, different ontologies in same domain or similar domains contain huge overlapping information. The purpose of ontology alignment is to detect the correspondences of concepts or relations from different ontologies. Figure 1.2 is a case of different ontologies with overlapping information such as equivalent classes, concepts and relationships. In figure 1.2, the concepts B-cell activation and T-cell activation in Gene Ontology are equal to B Cell Activation and T Cell Activation in Signal Ontology respectively. 3

14 1.3. The Status of Related Research Figure 1.2: Scenario of Ontology Alignment [17] Evaluation of Ontology Alignment Systems Methods As we know, OAEI is the ontology alignment evaluation initiative organized yearly which is an assessment event with published tests and results. Data set for the evaluation in OAEI is a significant factor. The data set should be well-designed ontologies including meaningful overlapping information. There are some basic data sets such as the OAEI systematic benchmark suite including 51 different ontologies [18], large scale ontology sets (UMLS, FMA), directory set and thesauri [18]. So far in 2015, large biomedical ontologies consist of the FMA short for Foundational Model of Anatomy, NCI short for National Cancer Institute Thesaurus and SNOMED short for Systemized Nomenclature of Medicine that are semantically rich with tens of thousands of entities [6]. To assess the ontology alignment tested by such data set, there are some basic indexes from information retrieval such as precision, recall, fallout, missing and F-measure. The general matching result is given in figure 1.3. Set A represents all the alignment results in the system while set R is the relevant overlapping information in the source and target ontologies. As a result, we can define set A-R as false positives, intersection of A and R as true positives, set R-A as false negatives and D as true negatives. This is a category of compliance measures. The calculation for precision is defined in equation 1-1. The precision represents the ratio of the number of found relevant information to the total number of found information. P(A, R) = R Ş A A The calculation for recall is defined in equation 1-2. The recall represents the ratio of the number of found relevant information to the total number of relevant information. (1.1) R(A, R) = R Ş A R (1.2) 4

15 1.3. The Status of Related Research Figure 1.3: Division of Matching Result The calculation for F1-measure is defined in equation 1-3. F1-measure is a component of F-measure with the precision and recall as elements. F1 is the harmonic mean of precision and recall. F 1 = 2 P(A, R) R(A, R) P(A, R) + R(A, R) In terms of the performance measures in the evaluation, it depends on the environment. There are some criteria as below [18]. Speed. Different ontology alignment systems should be assessed beyond same environment with the same processor and same memory consumption. Supposing that there is user involvement in the alignment system, only the matching algorithms processing time is measured. Network. Some alignment systems may utilize such network to complete the matching that the measurement is limited by the bandwidth and throughput. Memory. This performance measures the extra memory required of the ontology management system. In terms of the user involvement measures, it generally includes criteria as below [8, 18, 19]. Level of user input effort. This is not an easy task seeing that the user interactions have randomness and so many dependences such as the user s comprehension of the specific domain. Oracle-Based measures. Some ontology alignment systems require access to oracle in their matching strategies. As a result, the evaluation of oracle queries can also be an assessment of the matching algorithms. (1.3) 5

16 1.4. Outline of The Thesis As the evaluation of large BioMed track that is the representative of large scale ontologies, it stresses on the performance of matching system as well as creating an error free reference alignment and mapping repair systems [20]. Furthermore, alignment coherence is also an evaluated measure together with precision, recall, F-measure and run time [20]. This measure refers to assess the unsatisfied classes when reasoning with the input ontologies. In conclusion, the evaluation of alignment systems must concern about the availability and the ability to achieve the alignment correctly in a desirable time [18]. Meanwhile, OAEI as a reference assessment exercise for ontology alignment provides benchmark suites which alignment systems can be trained on [20] OAEI Since the ability to deal with large scale ontologies is the goal to optimize existing SAMBO, we take OAEI results from top systems in recent 3 years in terms of Large BioMed Track also called largebio into account. The summary result is demonstrated in table 1.1. The principal indexes are time, precision, recall, F-measure and incoherence. Time(s) Precision Recall F-measure Incoherence AML % % LogMap % % LogMap % % Bio XMap % 15.8% LogMap % % C LogMapLite % 33.9% XMAP-BK % RSDLWB % Table 1.1: Summary results of top systems in OAEI 2014 and 2015 for largebio [20, 21] Largebio comprises three categories of problems which are FMA-NCI matching, FMA- SNOMED matching and SNOMED-NCI matching from OAEI 2014 [21] and 2015 [20]. The specific matching task of OAEI 2014 and 2015 is listed in table 1.2. Matching problem Source-Target Task Source input Target input FMA-NCI matching FMA-NCI small fragments 3696(5%) 6488(10%) FMA-NCI whole ontologies FMA-SNOMED matching FMA-SNOMED small fragments 10157(13%) 13412(5%) FMA whole ontology with (40%) SNOMED large fragment SNOMED-NCI matching SNOMED-NCI small fragments 51128(17%) 23958(36%) NCI whole ontology with SNOMED large fragment (40%) Table 1.2: Data sets of largebio in OAEI 2014 and 2015 [20, 21] 1.4 Outline of The Thesis Main Content Our primary job in this project is to strengthen performance of the existing SAMBO system including optimizing the basic data structures and the matching algorithms. In detail, the optimization of basic data structures is made up of the implementation of indexing mechanism. The optimization of matching algorithms includes modifications considering the optimization of data structures. 6

17 1.4. Outline of The Thesis The thesis addresses the following research questions: How to optimize SAMBO for large scale ontologies? To answer the research question we pursue three specific objectives: To optimize the data structure and database design for preferable data storage and access; To optimize the business logics for saving time; To optimize the code for future extension Study Method The first step of this thesis is a pre-study of relevant topics such as Semantic Web, ontology alignment and OAEI. The purpose is to capture the background knowledge. It can help to understand the existing SAMBO system with a comprehensive understanding. The second step is to analyze the top alignment systems such as AML and LogMap to obtain a clear comprehension why they have preferable performance chiefly about time. To analyze the data structures they implemented and know the details during the running are essential to know why they become top alignment systems operating large scale ontologies. Moreover, compare existing SAMBO system with AML and LogMap, find the aspects in SAMBO to be improved. Present a design of optimized approaches and implement them. The next step is to test the new system and analyze the result relevant to non-functional requirements. In the end, a discussion and a conclusion will be presented. In the discussion part, we will discuss the method and the result. Further, we will give a wide thinking about the system. In the conclusion part, we will conclude what we achieve for the system and present several directions for future research and work. This thesis work is supervised by Kristian Sandahl <kristian.sandahl@liu.se> and Patrick Lambrix <patrick.lambrix@liu.se> from Linköping University. Ting He < xuantinghe@hit.edu.cn> as supervisor will also support a supervision during the process. As well as, Ph.D. student Valentina Ivanova <valentina.ivanova@liu.se> from Linköping University will also support help related to the experimental environment and some other aspects Constraints and Limitations As a basis of existing SAMBO system, this thesis work has some constraints and limitation. For final test and evaluation of optimized system, we apply the test cases which are applied in the evaluation for the existing SAMBO with session mechanism implemented and fragments from large scale ontologies. We will not make big changes of existing strategies since the primary job now is to improve the system for large scale ontologies. We will pay our attention to optimize the data structures to fulfill indexing mechanism which is a significant requirement for large scale ontology alignment. Since SAMBO has different versions and each version has its specific functions. Our focus is new SAMBO system with session mechanism Organization of Thesis This thesis starts with an introduction about the background and purpose of the project, as well as the status of relevant research. Chapter 2 will present the background of top ontology alignment systems and SAMBO. Chapter 3 will describe the requirements of this project followed by overview design including architecture and division of functional modules together with description of key techniques in chapter 4. In addition, the detailed design and the key development and test steps will be presented through some flow charts and so on 7

18 1.4. Outline of The Thesis in chapter 5 and chapter 6. Chapter 7 will talk about the evaluation of the result. Finally, a conclusion will be demonstrated whether the result is expected in Chapter 8. 8

19 2 Background 2.1 Ontology Alignment System-SAMBO Overview Design and Implementation SAMBO is one of the ontology alignment systems concentrating on ontologies represented by OWL that is a web ontology language [9]. Its general framework is similarities computation of the terms. The already existing matching strategies are linguistic, instance-based, structure-based, with auxiliary matching as well as a combination of these. The terminological algorithms include NGram and Edit-Distance [9]. The structural analysis accords to is-a or part-of hierarchies related to concepts to be matched [9]. The basic framework is revealed as figure 2.1. The alignment algorithm is made up of several matchers operating the source and target ontology according to auxiliary information such as instance corpora, general dictionaries and domain thesauri. The following steps are combining and filtering the matching results to generate mapping suggestions that can be a guideline for next match. Figure 2.1: Framework in old SAMBO [9] 9

2.1. Ontology Alignment System-SAMBO Moreover, taking user involvement into account in ontology alignment system, the authors in [12] proposed a new session-based framework including validation

20 2.1. Ontology Alignment System-SAMBO Moreover, taking user involvement into account in ontology alignment system, the authors in [12] proposed a new session-based framework including validation sessions, recommendation sessions that permit users to interrupt the alignment process. This new technique purposes to implement the alignment problems result from the lack of users support such as the background knowledge together with personal judgment. These are essential parts to make the alignment system more effective and efficient. Meanwhile, they implemented this new session-based approach by extending SAMBO. The new framework is revealed in figure 2.2. Figure 2.2: Framework in new session based SAMBO [12] Figure 2.3 indicates the choice for different matchers. At the same time user can define the weight and threshold. This is the essential configuration for alignment. The moment user clicks on the start button, SAMBO will start the matching process. Figure 2.3: Configuration for Combining and Filtering [12] Figure 2.4 indicates a mapping suggestion after the matching. At the same time user can define the relationship as equivalent, sub or super and add comment or name of the suggestion Main Data Structures and Matching Strategies The primary data structures of new session based SAMBO are demonstrated in figure 2.5 and the description is demonstrated in table 2.1. MClass has an attribute called OntClass that is a model of class in ontology from Jena API analyzing ontology. Jena implements RDF graph that is the recommendation in Semantic Web [22]. All the lexical information or restrictions are included in the class. Table 2.2 lists the main string metric in new session based SAMBO. SAMBO is made up of widely used Edit-Distance, NGram and Porter Stemming methods. Some of the matching strategies of SAMBO are chiefly depending on the combinations of these string metrics and external resources such as WordNet and UMLS. 10

21 2.1. Ontology Alignment System-SAMBO Figure 2.4: Mapping Suggestion [12] Figure 2.5: Main data structures in new session based SAMBO 11

22 2.2. Ontology Alignment System-AML Class Name MElement MClass MProperty MOntology Description The abstraction of class and property in ontology. MClass describes the concept in ontology. OntClass is an interface operating concepts from Jena API. MProperty describes the property in ontology. Ontproperty is an interface operating properties from Jena API. MOntology is an object that stores classes and properties in ontologies defined as OrderedMap in java API. OntModel is a model for ontology in Jena API. Table 2.1: Class Description of new session based SAMBO Main String Metrics Edit-Distance NGram Porter Stemming Description ED algorithm quantifies how different two strings are though computing the minimum transform operations from one string to the other. The operations include insertions, deletions and replacements. It will convert string into sets of n-grams. It is 2-grams in SAMBO. Then employ other similarity to compute. It is a linguistic method to eliminate the grammatical differences from verb tense, plurals. Table 2.2: Main String Metrics in new session based SAMBO 2.2 Ontology Alignment System-AML Overview Design and Implementation AML short for AgreementMakerLight, is an open source ontology alignment system. Its first version is AgreementMaker. The framework in AgreementMaker is flexible and extensible because of its level-based matchers [23]. Its first level deals with concepts characteristics from ontologies such as comments and instances to compute the similarities. The second level is responsible for calculating the similarities of the structural properties. As a result, it will get the relationships from the structural computation. The third level is responsible for combining and filtering the results from the first two levels. Its architecture is demonstrated as figure 2.6. Figure 2.6: Architecture of AgreementMaker [23] 12

2.2. Ontology Alignment System-AML The authors in [11] took large scale ontologies into account and proposed a new framework called AgreementMakerLight.

The ontology loading module includes data structures such as Lexicon and RelationshipMap that represent principal elements of ontology objects.

23 2.2. Ontology Alignment System-AML The authors in [11] took large scale ontologies into account and proposed a new framework called AgreementMakerLight. It primarily divides the system into loading module and matching module [11]. The schema of loading module is demonstrated as figure 2.7. The ontology loading module includes data structures such as Lexicon and RelationshipMap that represent principal elements of ontology objects. As well as a data structure called alignment stores the mapping results. Figure 2.7: Schema of ontology loading module in AML [11] The schema of ontology matching module is demonstrated as figure 2.8. The Matchers in the schema is chiefly about the matching algorithms with divisions of primary matchers and secondary matchers according to their efficiency [11]. The primary matchers are regarded as the ones can match large scale ontologies in which its Hash Map cross-searches ensuring O(n) time [11, 23]. The selectors are similar to filters in SAMBO to acquire the accepted mappings by thresholds. Figure 2.8: Schema of ontology matching module in AML [11] Figure 2.9 gives an example in which ways that different matchers are used in AML. Actually there are default weights setting in the code. AML also has some filters settings. When the user clicks on the match button, AML will start its matching. 13

2.2. Ontology Alignment System-AML Figure 2.9: Configuration in AML [24] Figure 2.10 indicates the alignment suggestions after matching. User can set each suggestion correct or incorrect.

10: Validation in AML [24] 2.2.2 Main Data Structures and Matching Strategies One characteristic of AML is it separates the lexical information as a dictionary of an ontology.

24 2.2. Ontology Alignment System-AML Figure 2.9: Configuration in AML [24] Figure 2.10 indicates the alignment suggestions after matching. User can set each suggestion correct or incorrect. The shown suggestions need to be matched and filtered by filters and similarity threshold. The similarity is displayed. It also has the sorting function in terms of the similarity. Figure 2.10: Validation in AML [24] Main Data Structures and Matching Strategies One characteristic of AML is it separates the lexical information as a dictionary of an ontology. For example, the class Lexicon is the storage for the local names of a class in ontology while class WordLexicon stores the relevant computing parameters. The overview class diagram of AML is demonstrated in figure

25 2.2. Ontology Alignment System-AML Figure 2.11: Class Diagram in AML [24]. The descriptions of primary data structures are demonstrated in table 2.3. The primary data structures in AML are class Ontology that is the instance of ontology, lexicon which is the lexical data of concepts in ontology, class Provenance that is records in Lexicon and WordLexicon which is the content of words from concepts. Class Name Ontology Lexicon Provenance WordLexicon Description Stores classes and Lexicon. Stores lexical information of classes, properties and maps in ontology. It has the map of class name and Provenances. The names of classes such as typed by rdf:label, local name and synonyms with different default weights. Stores information like word weight and name weight. Table 2.3: Class Description of AML Table 2.4 lists the primary string metrics in AML. AML includes ISub, Jacro-Winkler and Edit-Distance methods. AML matches strings not just according to these string metrics but a combined algorithm as a basis of these metrics. Main String Metrics ISub JW (Jaro-Winkler) Edit-Distance Description ISub supposes the similarity of two strings related to their commonality as well as difference. ISub also considers the longest common prefix in Winkler. JW is a variant from Jaro that considers the common characters number and order with Winkler considering the longest common prefix to give high similarity to those strings with long common prefix. Edit-Distance algorithm quantifies how different two strings are by computing the minimum transform operations from one string to the other. Table 2.4: Main String Metrics in AML Table 2.5 indicates the chief matching strategies in AML. The chief matching strategies in AML are lexical matcher, word matcher and string matcher. All the matching strategies in AML are depending on the computation of strings similarity and it defines the evidence 15

26 2.3. Ontology Alignment System-LogMap content that is a function of their frequency [23] of each class, each name of each class and each word in a name. Matching Strategies Lexical Matcher Word Matcher String Matcher Description Computes two classes similarity with names using name weight. Computes two classes similarity with the word in their names. Computes similarity by some popular algorithms such as NGram and Edit-Distance. Table 2.5: Main String Strategies in AML 2.3 Ontology Alignment System-LogMap Overview Design and Implementation LogMap is another ontology alignment system developed by University of Oxford that can deal with large scale ontologies so far. LogMap is considered as a highly scalable system with reasoning and diagnosis abilities [25, 26]. The principal processes of LogMap are revealed as figure The lexical indexation is responsible for indexing the classes in ontology [26]. The structural indexation has a goal to demonstrate the extended class hierarchy in input ontologies which is easy for future computation [26]. The indexation is an advantage in LogMap compared with SAMBO. Figure 2.12: Main processes in LogMap [26]. The repair step is also a feature of LogMap with some reasoning algorithms for detecting unsatisfied classes from the mapping results as well as input ontologies. Furthermore, these undesirable results will be repaired by a greedy algorithm [26]. Further, they proposed LogMap2 which is an updated version of LogMap. The new feature is that LogMap2 involves user interaction during alignment process [25] Main Data Structures and Matching Strategies The overview class diagram of LogMap is demonstrated in figure 2.13 and the description is displayed in table 2.6. One characteristic of LogMap is it fulfills inverted file index through using multiple maps. The key in HashMap is a set of strings representing the entries while the value in HashMap can be either a class id or a set of class ids. As figure 2.13 shows, taking class OntologyProcessing as an example, there are a lot uses of HashMap structure to implement the inverted file index which is the basic indexing mechanism in LogMap. As table 2.6 indicates, the data structure in LogMap is all defined like Index such as EntityIndex, ClassIndex and PropertyIndex. IndexManager is a control class to manage index. While class OntologyProcessing will be activated during whole alignment process. 16

27 2.3. Ontology Alignment System-LogMap Figure 2.13: Class Diagram in LogMap [27] Class Name EntityIndex ClassIndex PropertyIndex IndexManager OntologyProcessing Description This is an abstraction of ClassIndex and PropertyIndex. This is the description of class in ontology including the relevant information of the class such as its labels, roots and direct sub and super classes. This is the description of property in ontology including domain and range of data property and objects property. IndexManager primarily stores the map of classes, properties and an object defined as OWLDataFactory from OWL API that is used to create ontology model. As the matching strategies in LogMap are depending on computing the lexical intersections from source and target ontology, OntologyProcessing stores the different maps of lexical information to classes. Table 2.6: Class Description of LogMap Table 2.7 indicates the primary string metric in LogMap. It only has ISub method. As it claims in [26], LogMap emphasizes anchors mappings in its design and implementation. As for generating the anchors mappings, LogMap utilizes inverted file index to analyze source and target ontology and constructs initial data set for anchors mappings through attaining the intersections from source and target. Finally, LogMap utilizes string metric ISub [26] to generate final anchors mappings. Main String Metrics ISub Description ISub supposes the similarity of two strings related to their commonality as well as difference. ISub also considers the longest common prefix in Winkler. Table 2.7: Main String Strategies in LogMap 17

28 2.4. Comparison of Three Ontology Alignment Systems 2.4 Comparison of Three Ontology Alignment Systems Depending on the studies of SAMBO, AML and LogMap, we compared them taking such aspects as data structures, string metrics, matching strategies and API they call in implementations into account. Table 2.8 lists the comparison result. This comparison is an aspect to consider how to enhance SAMBO. In terms of string metrics, Edit-Distance, NGram and QGram are according to string distance. QGram is the sum of absolute differences between NGram vectors of two strings [28, 29]. ISub method contains the consideration about both commonality and difference between two strings [30]. For implementation, OWL API provides the support of OWL 2 standard which is composed of three varieties that are entities like classes, properties identified by IRIs and axioms asserted to be true for description [31]. IRI is short for Internationalized Resource Identifiers including Unicode different from URI. SAMBO AML LogMap Data Structures OrderedMap MClass Multihashmap and Multihashmap and separated lexical data separated lexical data String Metrics Edit-Distance, Edit-Distance, ISub, ISub NGram, Porter QGram, JW Stemming Matching Strategies Combination of Combination of ISub together with based on String String Metrics String Metrics inverted file index Metrics Implementation Jena API OWL API OWL API Reasoning Racer ELK ELK, HermiT, Pellet and MORe Table 2.8: Main String Metrics in LogMap 18

29 3 Requirements Analysis This chapter presents the requirements for optimization approaches in SAMBO. This chapter is organized as follows, in Section 3.1 we define the goal of optimization approaches. Then in Section 3.2, we achieve a detailed analysis on SAMBO considering which aspects need to be improved. Finally, in Section 3.3, we present the requirements. 3.1 The Goal of Optimization Approaches The concept of ontology is combined with semantic web whose goal is to satisfy that the machines can locate information automatically through syntax and semantics. Ontology is defined to represent such information in different domains. As for aligning the ontologies and obtain overlapping information purpose to help build an effective and efficient semantic web. SAMBO has already implemented some approaches to align ontologies. Nevertheless, it requires to be improved to align large scale ontologies. We expect our optimized system has the ability to deal with large scale ontology alignment with a better scalability. 3.2 Detailed Analysis on SAMBO Depending on the goal of optimization approaches, the chief non-functional requirements are the time and the ability to operate large scale ontologies including thousands of concepts. To fulfill the goal, first we need to analyze the existing SAMBO system that in which aspects it can be enhanced, second redesign the parts to be upgraded and implement finally. Accordingly, we divide the analysis into two parts. The first part of analysis is according to the business logic of the existing system and the second part is primarily about the code. The first part is the analysis of the design while the second part is the analysis of the implementation. The sub key points are demonstrated in figure 3.1. we separate the analysis to business logics level and code level. In the business logics part, we pay attention and effort to the analysis on data structure, local database and algorithms while in the code analysis, we pay our attention to the react in the runtime and the detects of existing code. 19

3.2. Detailed Analysis on SAMBO Figure 3.1: Aspects to be analyzed 3.2.1 Data Structures and Data Storage Data Structures. The primary data structures of SAMBO are given in figure 2.5.

30 3.2. Detailed Analysis on SAMBO Figure 3.1: Aspects to be analyzed Data Structures and Data Storage Data Structures. The primary data structures of SAMBO are given in figure 2.5. In class MOntology, it has two attributes called classes and properties. The type is OrderedMap. A potential problem is SAMBO doesn t store lexical information such as labels separately, so there is no index for these varieties of data. When it comes to existing system, there only exists maps for classes and properties. Data Storage. Since existing SAMBO is designed and implemented in terms of the session-enabled framework, the local database storage is essential. We analyze the logic design of the database. Figure 3.2 indicates the database design in existing SAMBO system. These two tables are chiefly used during the matching stage. For table savesimvalues, each record represents a mapping from source and target ontologies concepts considering different types of matchers. It will add column during the matching. Field ontologies in table savesimvalues is defined as the string consists of source and target ontologies name while concept1 and concept2 are the names of concepts. As figure 3.2 shows, table savesimvalues, and table resultsforcombination have primary key definitions such as varchar 500. If it already exists alignment result in local database and people employ system for new construction in the iterations, the query for the existing result will increase the time consumption Algorithms Analysis Matching Strategies. 20

3.2. Detailed Analysis on SAMBO Figure 3.2: Database Design in SAMBO As the figure 2.3 indicates, there are primarily six text based matching strategies in SAMBO.

31 3.2. Detailed Analysis on SAMBO Figure 3.2: Database Design in SAMBO As the figure 2.3 indicates, there are primarily six text based matching strategies in SAMBO. Yet the description of the matchers as it indicates to users maybe cause ambiguities. The chief computing components for each strategy are listed in table 3.1. The matching strategies name are from the interface of SAMBO. Matching Strategy in SAMBO Edit-Distance NGram WL WN TermBasic TermWN Computing Components EditDistance Algorithm. NGram Algorithm. Combination of EditDistance and NGram. Combination of WordNet, EditDistance and NGram. Combination of EditDistance and NGram. Combination of WordNet, EditDistance and NGram. Table 3.1: Components in the matching strategies of SAMBO So, as table 3.1 lists, TermBasic is same as WL while TermWN is same as WN. It is not a good design in the user interface and the functional flows of the program. Reasoning. Racer reasoner is used in SAMBO to achieve its reasoning function. This small Racer reasoner application was designed in 2004 while the development of reasoners improves much in recent decade considering the interface and performance. In existing SAMBO system, it calls an external Racer reasoner s exe executable file and has configuration in the configuration files. Figure 3.3 proves the external executable file and figure 3.4 demonstrates the parameter configuration. Figure 3.3: External Racer Reasoner Executable File 21

32 3.2. Detailed Analysis on SAMBO Figure 3.4: Racer Reasoner Configuration As figure 3.3 shows, the Racer reasoner s version is Its updated version is 2.0 in 2014 [17]. More importantly, it has OWL API interface [32, 33]. So, the application of reasoner is an aspect to be optimized Dynamic Code Analysis In the dynamic code analysis, we tested existing SAMBO system with two ontologies which are from FMA and NCI. As a result, SAMBO cannot react in a short time when it loads such large scale ontologies. This problem comes from SAMBO applies Jena API to analyze ontology files about loading ontologies and analyzing. Jena API is old and its newest update was in While the popular ontology analyze API is OWL API that has been verified its reference implementation provides an efficient in-memory storage solution [34]. Figure 3.5 demonstrates the Jena API calls in SAMBO during the loading of ontologies. The method createontologymodel will return the ontology as Model in memory. Figure 3.5: Jena API Usage in SAMBO Then we test SAMBO and AML with oaei2014_fma_whole_ontology.owl containing classes and 54 properties and oaei2014_nci_whole_ontology.owl containing classes and 190 properties. AML loads the first one in 10 seconds and the second one in 11 seconds. While for SAMBO, it takes more than 400 seconds. AML uses OWL API to load ontologies. As a result, it can verify that OWL API has higher performance than Jena API. More importantly, OWL API implements new interfaces for OWL 2 standard and it tracks the standard closely. It should be easy to extend in the future for the system if it demands to represent more information in the application Static Code Analysis In this part, we went through the existing code in SAMBO and we found there is a package import called com.objectspace.jgl and SAMBO employs data structures called OrderedMap and Array in it. This package is a generic collection for Java. JGL is an early standard whose point is to consistent with C++ Standard Template Library [35] for Java collections framework and its newest version is JGL 3.1 in Since it is an old version without new update, it should be better to call new Java collections framework. OrderedMap and Array can be replaced by HashMap and Array in package java.util. 22

33 3.3. Requirements of Optimization Approaches 3.3 Requirements of Optimization Approaches Functional Requirements After the detailed analysis of SAMBO system, we define the functional requirements consist of following aspects: Data structure optimization. In the existing SAMBO system, there are 4 basic data structures called MClass, MElement, MOntology and MProperty representing the ontologies. There isn t a better data integration of these data structures without the function of indexing. Meanwhile, the matching algorithms and other functions utilize these four data structures separately. For instance, the matching algorithms calculate similarity of two elements just employ the MElement data structure. Database design optimization. As we analyzed the design of database, the design in some ways does not satisfy the standard for database design. It s better to design to reinforce the storage and query performance. For example, it creates the column representing a matcher s result dynamically which is not suitable for data maintainence and management. Indexing. We need some well-designed data structure such as HashMap in Java or generic idea to save the basic data of ontologies as well as some mapping relationships. We are certain there are such functions to be implemented as follows. More importantly, as URI is the unique identifier for resource, it s better to utilize URI as the index for either classes or properties in ontologies. Reasoning. As the technique of reasoning develops, many reasoners have already support interfaces. So, it s preferable to update the calls of reasoning function in SAMBO from some API callings rather than the external reasoning through network. Since Racer also has been updated in these years, it supports some interfaces for better and more reasonable using ways rather than the way we already implemented in existing SAMBO. Use of OWL API. The first version of SAMBO was implemented around During that time, OWL API was implemented by HP Lab. However, as the development of semantic web, the functions from OWL API become stronger with a lot new features added. Apache took over the update of OWL API and there are a lot of modifications about the functions calling or naming in it. So, the usage Jena API in SAMBO is old and it can cause ambiguity on condition that they are not updated. It is required to update the use of OWL API Non-Functional Requirements The non-functional requirements are due to the consideration of the evaluation from OAEI. Time. This is the principal non-functional requirement. The time that optimized SAMBO system cost cannot be longer than the existing SAMBO system. The final goal is the time should reach or be close to the top systems recently such as AML and LogMap. Absolutely, this time doesn t include the time spent on user involvement. 23

34 3.3. Requirements of Optimization Approaches The ability to deal with large scale ontologies. As a basis of OAEI, UMLS is a comprehensive and popular effort as an integration of medical thesauri and ontologies such as FMA, SNOMED CT and NCI. Extendibility. Extendibility is also a key non-functional requirement in the optimized system. The design and implementation of the optimization approaches should be easy to be extended. 24

4 Overview Design In this chapter, we present the overview design of the optimization approaches in SAMBO. This chapter is organized as follows. First, we present the architecture of SAMBO.

35 4 Overview Design In this chapter, we present the overview design of the optimization approaches in SAMBO. This chapter is organized as follows. First, we present the architecture of SAMBO. In Section 4.2 we conclude the structures of optimization approaches following by designs of function modules in Section 4.3. Finally, in Section 4.4 we discuss the alternatives of key techniques. 4.1 Architecture of SAMBO The system architecture is designed based on three levels which are data layer, logic layer and presentation layer. The architecture is given in figure 4.1. This architecture is designed depending on the functional requirements after the requirements analysis. Figure 4.1: Overview Architecture In terms of data access layer, we may build new data structure chiefly based on HashMap in Java as well as generic types. The generic type can ensure our data structures achieve 25

4.2. The Structure of Optimization Approaches strong extendibility. What s more, the basic element of the generic type maybe HashMap or HashSet in Java.

36 4.2. The Structure of Optimization Approaches strong extendibility. What s more, the basic element of the generic type maybe HashMap or HashSet in Java. In terms of presentation layer, since the existing session-based SAMBO to be enhanced is based on Java Servlet, the technical route of this level will accord to the process of Java Servlet development. The fundamental steps are first creating a java class inheriting javax.servlet.httpservlet, then describing the servlet in web.xml file giving the servlet name and servlet class and finally giving a URL to the servlet. 4.2 The Structure of Optimization Approaches As a basis of requirement analysis, the system can be divided into such modules as optimization of data structures, optimization for matching algorithms and optimization for OWL files. The structure is given in figure 4.2. The detailed description of sub modules is listed in table 4.1. We separate the optimization approaches into two overview aspects that are business logics optimization and code optimization. In detail, business logics optimization contains aspects such as data structure optimization, database optimization, indexing optimization and reasoning optimization. While the code optimization is about using OWL API to update the system and eliminating defects. Figure 4.2: Structure of Optimization Approaches Table 4.1 presents the details of each module. 26

37 4.3. The Functional Modules Sub modules Data structure optimization Database design optimization Indexing optimization Functional procedures optimization Parallel matching optimization OWL API callings Defects elimination Description A key point is there is a demand to store some data from ontology separately such as the lexical data for a class, as well as to store the map of classes, properties. Employs foreign key constraints to implement the data associations. Utilizes URI to identify the classes, properties and other kinds of resources. This is the optimization for some program flows as well as sequences of the system primarily about the matching process. To calculate the similarities of concepts from input ontologies in parallel method. Employs OWL API analyzing ontology files including loading ontologies, accessing the entities such as classes and properties, attaining the literal data such class label and property range. To eliminate the defects or bugs in existing system. Table 4.1: Description of sub modules 4.3 The Functional Modules Business Logic Optimization Data Structures Optimization An optimized data structures design is given in figure 4.3 and the description for attributes is listed in table 4.2. The new data structures design emphasizes storing essential and useful data in ontology individually. The reason to design like this is that alignment is achieved chiefly through the matching between lexical information of two concepts. In the existing system, we don t store these data separately which causes loading the ontology object each time. The new design separates the lexicon while is composed of the map of entities in ontologies. Figure 4.3: Class Diagram after Optimization 27

38 4.3. The Functional Modules Class Name MOntology MClass Lexicon MProperty MDataproperty MObjectproperty OntManager MergerManager Pair Task SimValueConstructor Description Stores maps of classes, data properties and object properties. Correspond with the class in ontology including class URI and name. Each object of Lexicon is a literal record for entities for class and property in ontology. An abstraction for data property and object property. Data property consists of range and domain. Object property consists of range and domain. OntManager manages the ontologies involved in an alignment and matching process. It has the objects for source and target ontologies to be aligned defined as MOntology. As well as, it comprises the callings to load ontologies in OWL API. MergerManager manages the matching process. It includes an object of OntManager. This class contains uri of concepts from database storage of two concepts. It will be used in the middle transforming database result set to suggestions. This is a class that contains source and target concepts index and other information used in computation. This is a functional class including the similarity computation functions. Table 4.2: Description for Classes The attributes description for class MOntology is listed in table 4.3. These are attributes different from existing attributes in SAMBO. Class MOntology is an entity of ontology with the attributes such as maps of classes, data properties, object properties, the lexicons of concepts and other essential interfaces accessing ontology entity. Attribute Name Map<Integer, MClass>classes Map<Integer, String>classlocalname Map<Integer, MDataproperty>dataproperties Map<Integer, String>datapropertylocalname Map<Integer, MObjectproperty>objectproperties Map<Integer, String>objectpropertylocalname Map<Integer, Set<lexicon»classlexicons Map<Integer, Set<lexicon»dplexicons Map<Integer, Set<lexicon»oplexicons OWLOntologyManager manager OWLDataFactory factory Description A map of class in ontology and its index defined from its URI. A map of class index and its local name. A map of data property and its index defined from its URI. A map of data property s index and its local name. A map of object property and its index defined from its URI. A map of object property s index and its local name. A map of class index and its set of lexical data. The index is defined from class URI while Set<Lexicon>is a set of lexical description typed such as rdfs:label. A map of data property s index and its set of lexicons. A map of object property s index and its set of lexicons. Primarily used for loading ontologies. Primarily used for accessing the annotation data such as label and synonyms of classes and properties. Table 4.3: Attributes Description for Class MOntology The attributes description for class MClass is listed in table 4.4. Class MClass is an entity of class in ontology including the basic information such as uri, name and local name. Relational information such as super, sub and equivalent relationships are also contained in MClass. 28

39 4.3. The Functional Modules Attribute Name Description String uri URI of the class in ontology. String name Name usually typed such as rdf:id of the class in ontology. String label The local name of the class. Map<Integer, MClass>supers A map of super classes. Map<Integer, MClass>subs A map of sub classes. Map<Integer, MClass>equi A map of equivalent classes. Map<Integer, MClass>haspart A map of classes with the relationship haspart. Map<Integer, MClass>part of A map of classes with the relationship partof. Table 4.4: Attributes Description for Class MClass The attributes description for class Lexicon is listed in table 4.5. Lexicon class is new data structure different from existing SAMBO. Class Lexicon is a component of each concept in ontology. It contains basic information like language type and content. Each concept in an ontology may have several lexicons. Attribute Name LexicalType type String language String name Description For instance, a class may contain a lexical description typed as rdfs:label. LexicalType aims to demonstrate different kinds of lexical data. In order to demonstrate the kind of language such as en representing for English. The description that is key data used in matching for the class. Table 4.5: Attributes Description for Class Lexicon The attributes description for class OntManager is listed in table 4.6. Class OntManager is a manager of source and target ontologies. This class will be used before the matching process to obtain the source ontology instance and target ontology instance, as well as their concepts index maps. Attribute Name MOntology source MOntology target Description One of the ontologies to align. The other one of the ontologies to align. Table 4.6: Attributes Description for Class OntManager The attributes description for class MergerManager is listed in table 4.7. Attribute Name OntManager ont Int map_ontologies_id Set<Integer>matcher_list Map<Integer, Pair>suggestedpairs Map<Integer, Task>task_list MapOntologyGenerateQuery motable MapConceptGenerateQuery mctable Description Manages ontology. This is an incremental int in database to identify the ontologies pair. A set stores the matcher configured from the user. A Map of suggested concepts. This is a list containing all concepts pairs to be matched. Database access instance related to table mappable_ontologies. Database access instance related to table mappable_concepts. Table 4.7: Attributes Description for Class MergerManager 29

40 4.3. The Functional Modules Class MergerManager is a control class during the whole alignment process. It includes attributes related to database access, matching suggestions management, computing tasks management and so on. Table 4.8 presents the attributes of class Pair. This class is used for middle control. For example, when the system obtains similarities form database, it will store the data into a list of Pair. Attribute Name String source_uri String target_uri String comment Double similarity Description The uri of source concept. The uri of target concept. The comment on matched concept pair. The similarity of source and target concepts. Table 4.8: Attributes Description for Class Pair Table 4.9 presents the attributes of class Task. This is a class will be used in the computing similarities process. Actually, as MergerManager shows, it has a map Task including all pairs to be matched. Attribute Name Int source_id Int target_id Set<Lexicon>source_lexicon Set<Lexicon>target_lexicon Double value Description The index of source concept. The index of target concept. The lexicon of source concept. The lexicon of target concept. The similarity of two concepts from source and target ontology. Table 4.9: Attributes Description for Class Task Table 4.10 presents the attributes of class SimValueConstructor. This is a class will be used in the computing similarities process. Attributes source_content and target_content are the representation of the concepts index. Attribute Name MOntology sourceontology MOntology targetontology Set<Integer>source_content Set<Integer>target_content Description The instance of source ontology. The instance of target ontology. A set of concepts index from source ontology. A set of concepts index from target ontology. Table 4.10: Attributes Description for Class SimValueConstructor Database Design Optimization The new database design is given in figure 4.4. The database redesign is desired to satisfy the standard of database design in order to implement data associations. The four tables demonstrated in figure 4.4 are chiefly involved during the matching process to store computing result which are used for the session mechanism. The idea of database design optimization is to separate the table which stores similarities of different matchers into several tables in terms of different matchers. For example, matcher EditDistance has an individual table called simvalue_ed to store the similarity. And utilize the identifiers from mappable_concepts and mappable_ontologies to obtain unique result. The class diagram related to database is given in figure 4.5. First, it has a class called Sql- GenerateQuery which is the base class inherited by MapOntologyGenerateQuery, MapConceptGenerateQuery and SimilarityGenerateQuery. In each generatequery class, it contains 30

41 4.3. The Functional Modules Figure 4.4: E-R Diagram of Database after Optimization the sql constructions and the interfaces for query data in database. While class DatabaseAccess is a base class inherited by MapOntologyDBAccess, MapConceptDBAccess and SimilarityDBAccess which are interfaces in the data level in the framework in SAMBO. Figure 4.5: Class Diagram for Database Access The description for each table is listed in table Table Name Description mappable_ontologies A table stores two ontologies that we matched before. mappable_concepts A table stores the specific maps between two concepts and from which two ontologies identified by foreign key. simvalue_ed A table stores the result of Edit-distance matcher. simvalue_ng A table stores the result of NGram matcher. simvalue_ed_ng A table stores the result of combination of Edit-Distance and NGram with default weights configuration. simvalue_ed_ng_wn A table stores the result of combination of Edit-Distance, NGram and WordNet with default weights configuration. simvalue_view This is a view combining matchers result for concepts. Table 4.11: Description for key tables in database during matching 31

42 4.3. The Functional Modules The details contain field and data type description for table mappable_ontologies are listed in table In table mappable_ontologies, it has an id and the names of two ontologies. Field Data Type Description moid INT(11) Primary key incremental and not null. source_ontology_name Varchar(100) The source ontology name. Not null. target_ontology_name Varchar(100) The target ontology name. Not null. Table 4.12: Description for table mappable_ontologies The details containing field and data type description for table mappable_concepts are listed in table This primarily stores the concepts pair that will be matched in the two ontologies with flag to show they are from classes or properties, as well as, there are id and foreign key comes from table mappable_ontologies. Field Data Type Description mcid INT(11) Primary key incremental and not null. moid INT(11) Foreign key from table mappable_ontologies. source_concept_name Varchar(200) The source ontology s concept name. Not null. target_concept_name Varchar(200) The target ontology s concept name. Not null. type INT This is a flag to distinguish if the cocepts are classes or properties. Table 4.13: Description for table mappable_concepts The details containing field and data type description for table simvalue_ed is listed in table This table is used for matcher Edit-Distance. Field Data Type Description id INT(11) Primary key, not null. Also a foreign from mappable_concepts. similarity FLOAT Matching result of concepts. Table 4.14: Description for table simvalue_ed The details contain field and data type description for table simvalue_ng are listed in table This table is used for matcher NGram. Field Data Type Description id INT(11) Primary key, not null. Also a foreign from mappable_concepts. similarity FLOAT Matching result of concepts. Table 4.15: Description for table simvalue_ng Table 4.16 presents the details of view simvalue_view which is used during the combination process in alignment. Field Data Type Description id INT(11) Id of mappable concepts. simvalue_ed FLOAT Matching result come from Edit-Distance matcher. simvalue_ng FLOAT Matching result come from NGram matcher type INT A flag to identify the concepts are from classes or properties. Table 4.16: Description for View simvalue_view Indexing Function Lexical index requires locating the labels of the classes in each ontology as well as their lexical variations. 32

43 4.3. The Functional Modules The English name of ontology classes as well as their alternative names such as synonyms is usually stored in OWL in label annotations. It s better to split each label of each class in the input ontologies into components. Besides, we can employ the external lexicon to find the synonyms. The class diagram for URITable and relevant classes is given in figure 4.6. It has a map of index to URI as well as a map of URI to index. And the operations contain URI add and functions returning URI and index. Figure 4.6: Class Diagram for Indexing Optimization Reasoning Function As the requirement analysis presents, the use of Racer reasoner can influence the performance of the system because it s an old version and the calling way from external executable file by network. Since Racer reasoner has interfaces in OWL API, it s better to update the functions related to reasoning with OWL API. Table 4.17 presents some popular reasoners that contain OWL interfaces. Reasoner name Racer HermiT Pellet FacT++ ELK MORe Package de.uulm.ecs.ai.owl.inference.racer.tcp.racertcpreasonerfactory org.semanticweb.hermit.reasoner com.clarkparsia.pellet.owlapiv3.pelletreasonerfactory uk.ac.manchester.cs.factplusplus.owlapiv3.factplusplusreasoner org.semanticweb.elk.owlapi.elkreasoner org.semanticweb.more.morereasoner Table 4.17: Popular Reasoners with OWL interfaces 33

44 4.4. Key Techniques Code Optimization Since SAMBO uses Jena 2 API that is not so popular and efficient to analyze ontologies, it s better to apply OWL API including loading ontologies, creating reasoner and so on. When we utilize OWL API to analyze ontology, there are several essential steps such as creating ontology object, creating reasoner, getting top node in ontology, getting sub level classes and getting annotation. There are several essential functions in OWL API analyzing ontologies. Table 4.18 presents the principal functions in some OWL API interfaces such as OWLOntologyManager and OWLOntology while table 4.19 presents what the interfaces in OWL API represent. Interfaces in OWL API OWLOntologyManager OWLDataFactory OWLOntology OWLClass OWLAnnotation OWLLiteral Main Function loadontologyfromontologydocument(source)//load ontology from source. getowldatafactory()//return the data factory that can create classes, properties etc. getowlannotationproperty()//one of the necessary annotationproperty is rdfs:label. getclassinsignature()//return a set of OWLClass. getiri(). getsubclasses()//get sub classes. getsuperclasses()//get super classes. getvalue(). getannotations(). getliteral()//get lexical value of literals. getlang()//return language tag such as en. Table 4.18: Essential functions to be used in OWL API Interface OWLOntologyManager OWLDataFactory OWLOntology OWLClass OWLAnnotation OWLLiteral Description Responsible for creating, loading and accessing ontologies. It has the management of a set of ontologies and the mappings between ontologies and their documents. Responsible for creating entities, class expressions and axioms. This interface has a set of OWLAxioms and a set of OWLAnnotations. This is the class in an ontology. This interface has methods such as getsubclasses and getsuperclasses. Responsible for acquiring the annotations such as label. In OWL API, there are three different kinds of properties that are typed as rdfs:comment, owl:deprecated and rdfs:label. Literals are data values such as strings or integers. This interface is responsible for getting the literal, language type. Table 4.19: Description of the interfaces As table 4.18 and table 4.19 show, generally OWL API is divided into two parts that are ontology manager (OWLOntologyManager) and data factory (OWLDataFactory). The manager is a base to access data factory while the factory has functions to access ontology entity, class entity, property entity and their literal information. 4.4 Key Techniques Indexing Data Structure One of the technical keys is to choose a proper index mechanism to upgrade the basic data structure during alignment process. The indexations of lexical components and structural 34

45 4.4. Key Techniques components in ontologies are really essential seeing that the indexations can support a quick access to the classes as well as the structural relevant components in the ontology. One of the indexation mechanisms we consider is hashing index. Hashing index can fulfill the fast locating of the data. Conversely it has disadvantages that it cannot satisfy the query for similar value in the repair and it will also need large space. But since the time requirement is highly desired, we still need to consider the just one time locating of hashing index. Solution 1 HashTable. HashTable extends Dictionary<K, V> according to Java source code [36]. K represents key while V represents value. To employ this data structure in our system, we can define K as integer or URI which can be viewed as unique identification of a concept or a property. The advantage of HashTable is it can satisfy quick access during the matching process. And its simple get and put method can fulfill a safe synchronization [37]. Solution 2 HashMap. HashMap extends AbstractMap<K, V> according to Java source code [38] different from HashTable. But both HashMap and HashTable implement Map<K, V>. Since Dictionary<K, V> is an old class, HashMap behaves well in performance and functions. But, compared with HashTable, HashMap is not thread-safe. Solution 3 ConcurrentHashMap. ConcurrentHashMap is regarded as a safe version of HashMap [39]. It has same functions in HashMap while implementing a more reliable thread-safe mechanism compared with HashTable. Our final choice is solution 2 that is HashMap. The reason is because it is better than HashTable. Considering about we don t operate HashMap parallel directly, it is a good choice Parallel Matching Another technical key is the optimization of matching strategies such as apply the divideand-conquer idea to solve the large scale ontology alignment. The core technique is to analyze the time space cost of the existing matching algorithms. We will utilize the technique of algorithms analysis and design to obtain the time and space cost of existing matching algorithms. These matching algorithms maybe not fulfill the alignment of large scale ontologies. So, after the analysis of these algorithms, we will consider the optimizations of these algorithms. Solution 1 ExecutorService from Executor Framework in Java 7. ExecutorService is a parallel mechanism extended from Executor Framework which employ executor to manage the objects of thread [40, 41]. It includes a set of functions to manage the thread pool such as controlling the number of thread. As well as it contains new features like Callable/Future which is similar to Runnable, but it supports the return of results from executor. Solution 2 Fork/Join Framework in Java 7. Fork/Join framework is an implementation of divide and conquer strategy. It divides huge task into sub tasks and operates sub tasks in parallel. It has basic fork function to divide task and join function to combine results [41, 42]. Work-stealing algorithm support the parallel mechanism in Fork/Join framework. If one thread finishes all sub tasks in its 35

46 4.4. Key Techniques queue, it will steal sub tasks from other thread s queue. Actually, Fork/Join framework is also an extension of Executor. It needs a threshold to end its fork. The only question of Fork/Join framework is that it will bring huge garbage because of switches of threads during the fork [43]. Our final choice is ExecutorService framework. 36

47 5 Detailed Design and Implementation In this chapter, we present the detailed design of the optimization approaches and the process of implementation. This chapter is organized as follows. First, we present the detailed design for key modules with design of data structures, functions, algorithm flows and so on. Then in Section 5.2, we introduce the implementation environment. Finally, in Section 5.3 we display the key interfaces of system. 5.1 Detailed Design of Key Modules Data Structures Optimization The detailed class diagram including the operations is given in figure 5.1. There are four classes which are MergerManager, OntManager, MOntology and Lexicon will be added or modified in the optimization. This is the chief modification of the old system. We add new data structure considering about the index mechanism. Figure 5.1: Detailed Class Diagram 37

48 5.1. Detailed Design of Key Modules The operations description for class MOntology is listed in table 5.1. Class MOntology is the representation ontology. It contains basic functions such as the set and get methods of the attributes. As well as, it includes the functions constructs the classes, properties and relationships in ontology. Especially in function buildclasses and buildproperties, it contains the initial generation of some indexing data structures. Operation Name MOntology() addclasses(mclass class, int index) adddataproperty(mdataproperty dp, int index) addobjectproperty(mobjectproperty int index) getclass(int index) dp, getlexicon(int index) addlexicon(string language, String name) loadmontology(url uri) loadmontology(string path) checkuri(uri uri) buildclasses(owlontology o) buildproperties(owlontology o) buildrelationships(owlontology o) checkresource(string uri) Description Constructor Method. This method aims to add a map of index and MClass into the storage after we generate the index of a class. This method aims to add a map of index and MDataproperty into the storage after we generate the index of a data property. This method aims to add a map of index and MObjectproperty into the storage after we generate the index of an object property. Returns an instance of MClass with the parameter of class index. Returns a set of lexicon data with the index of the class. This method is used to add the map of class index and its set of lexicon data after operating a class in the set of OWLClass. Loads an ontology from OWLOntology object with the URL parameter. Loads an ontology from OWLOntology object with the URI parameter. Checks the URI parameter is local file or not. Reads classes from OWLOntology object, build indexes and the storage for classes. Reads properties from OWLOntology object, build indexes and the storage for properties. Obtains the sub and super relationships in ontology. Checks the resource given based on the uri is class, data property or object property. Table 5.1: Operations description for MOntology The operations description for class MClass is listed in table 5.2. Class MClass is an instance of Class in an ontology. It has the basic set and get functions of the attributes. As well as, it maintains the sub, super, equivalent, partof and haspart relationships. Operation Name MClass(Integer id, String uri, String class_label) setalignname(string align_name) setalignclass(mclass aclass) addsuper(mclass superclass) addsub(mclass subclass) addequivalent(mclass equclass) addpartof(mclass pclass) addhaspart(mclass hpclass) Description Constructor Method. Sets the align name related to the object. Sets the object s aligned MClass. Adds super class. Adds sub class. Adds equivalent class. Adds MClass with partof relation. Adds MClass with haspart relation. Table 5.2: Operations description for MClass The operations description for class Lexicon is listed in table 5.3. Class Lexicon is an instance of a piece of lexical description of class or property in ontology. The operations are primarily about the get functions. 38

49 5.1. Detailed Design of Key Modules Operation Name Description Lexicon(String type, String language,string Constructor Method. name) getlanguage() Returns the type of language. getname() Returns the description in the lexicon record. Table 5.3: Operations description for Lexicon The operations description for class OntManager is listed in table 5.4. Class OntManager is designed as a manager of source and target ontologies. It is a way to access source and target ontologies objects. Operation Name OntManager() loadontology(string spath, String tpath) loadontology(url spath, URL tpath) getontology(string ontology) Description Constructor Method. Loads source and target ontologies according to the file path given as String. Loads source and target ontologies according to the file path given as URL. Returns source or target ontologies according to the parameter ontology valued by source or target. Table 5.4: Operations description for OntManager The operations description for class MergerManager is listed in table 5.5. Class MergerManager is an important through all process during alignment. It contains functions like loading ontologies, generating tasks list, saving result and so on. Operation Name MergerManager() getontmanager() loadontologies(string uri1, String uri2) loadontologies(url url1, URL url2) generatetasklist(int step) gettasklist() getconcepturi(int id, int ontology) matchforkjoin(map tasklist, Matcher m) Description Constructor Method. Returns the instance of OntManager class. This is a function loading ontologies with string parameters representing uri of ontologies. This is a function loading ontologies with URL parameters representing path of ontologies. Generates concepts list to be matched with parameter step representing align properties or classes. Returns the task list. Returns the URI of concept in the ontology with parameter id representing index in ontology and parameter ontology representing which ontology. This is the function responsible for parallel matching. Table 5.5: Operations description for MergerManager The operations description for class Pair is listed in table 5.6. Operation Name Description Pair(String sourceuri, String targeturi) Constructor Method. setsuri(string uri) Sets source URI. setturi(string uri) Sets target URI. Table 5.6: Operations description for Pair The operations description for class SimValueConstructor is listed in table 5.7. Class Sim- ValueConstructor is chiefly responsible of computing similarities. Especially in function calculate_simvalue(int step), it has logics accessing database, computing similarities. 39

50 5.1. Detailed Design of Key Modules Operation Name Description calculatesimvalue(int step) Computes the similarities of classes or properties with parameter step representing align properties or classes. getpairlist() Returns the matched concepts form database. calculatesimvalueparallel(int step) Computes the similarities in parallel. Table 5.7: Operations description for SimValueConstructor Indexing Optimization The class diagram of URITable is given in figure 3.6. The attribute size in URITable is essential to maintain the capacity of URITable and used to define the indexes of URIs. The operations description for URITable is listed in table 5.8. Each ontology has the structure URITable which ensures that the index management in ontology will not overlap. In URITable class, there is an attribute called size which means the current size of URI and index mappings. The attribute size can be used to define the index of current concept or property. So URITable can satisfy the unique identifier of an ontology. The construction of URITable is during the classes and properties build in the loading ontologies process. Operation Name adduri(string uri) getindex(string uri) getname(string uri) Description This function is called after gaining a class or property from an ontology. Adding the uri to URITable. The index is equal to the size of URITable right now. This function is called when we want to attain the index of resource in case that we have its URI only. This function aims to acquire the name that is part of class or property URIs. Table 5.8: Operations description for URITable Functional Procedure Optimization The new business logics diagram is given in figure 5.2. The biggest modification is we generate the TaskList before the matching process begins. For example, after the load of ontologies and before the matching process, the system will generate the task list. In the task list, each task contains the indexes of concepts from ontologies to be matched. So we can utilize index to access the lexicon and related data in an easy and convenient way. This is also easy to apply parallel matching on the system. After the generation of task list, it will begin the matching process. At first, it will query the local database to check if there is existed similarity result of the concepts in task list. Then, it will insert new record if there is no query result and update if the local database already has the storage. The details are shown as figure 5.2. For large scale ontologies, our focus is the optimization of parallel matching. By using ExecutorService framework, it will reduce the time consumption on large scale ontologies matching. To align large scale ontologies, it is better to match through separating large scale ontologies into several blocks and then to utilize the parallel matching by blocks. Figure 5.3 displays the sequences of loading ontologies. This sequence diagram involves user, boundary class called LoadFileServlet, control classes that are MergerManager and Ont- Manager and entity class called MOntology. At first, after the user click upload ontologies, it will send a message to load file servlet that will be activated. In next, it will activate object of MergerManager and object of OntManager step by step. Then, the instance of Class MOntology will call its internal functions to build classes, properties and relationships in the ontologies. MergerManager will get a returned OntManager instance with ontologies in- 40

51 5.1. Detailed Design of Key Modules Figure 5.2: Functional Procedure Optimization stances. At last, load file servlet will obtain the instance of MergerManager to achieve the loading process. Figure 5.3: Sequence Diagram of Loading Ontologies 41

52 5.1. Detailed Design of Key Modules Figure 5.4 displays the sequences of computation. At first, after the user clicks start computation, it will send a message to main servlet that will be activated. In the second in main servlet, it will generate an object of MergerManager and the object MergerManager has an internal calling to generate task list. Then, it will query the local database to check if there is existing results. Then, it will compute the similarities and update the database. At last, it will generate the combined and filtered suggestions result. Figure 5.4: Sequence Diagram of Computation After the Computation, it will generate a vector of suggestions and display in front end. User can set the suggestion as equivalent, sub or super relationship. As well as, user can reject the suggestion. The difference between equivalent classes and sub classes, super classes is equivalent relationship is unique that means each class in the source ontology can only have one equivalent class from the target ontology. So if user define a suggestion from a list of relevant suggestions, it need remove other components in the suggestion list. The process of handling suggestions is shown in figure 5.5. After user finishes all the suggestions, the system will generate an alignment OWL file including the equivalent, sub and super relationships. In detail, the alignment file includes the matching comment and data information. As well as, we also represent the properties relationships in the alignment file. The functional process is similar as the process of saving classes relationships shown in figure 5.6. The process of saving relationships into RDF is shown in figure Parallel Matching As we present in Section of key techniques, we utilize ExecutorService framework to implement parallel matching. Class Mappingtask implements the interface of Callable<Pair> 42

53 5.1. Detailed Design of Key Modules Figure 5.5: Flow Chart of Handling Suggestions 43

54 5.1. Detailed Design of Key Modules Figure 5.6: Save alignment result into RDF which is similar to interface Runnable in multi-thread in Java and the override function call() is similar to run(). This function will call a function to calculate similarities. Class Mappingtask contains the uris of source class and target class in ontologies that are used to access lexicons and identifiersin the ontology. Figure 5.7 describes the details of relevant data structures. The process is shown in figure 5.8. The function to compute similarity called in Mappingtask is shown in figure 5.9. Figure 5.7: Detailed Class Diagram of Parallel Matching Module 44

55 5.1. Detailed Design of Key Modules Figure 5.8: Flow Chart of Parallel Matching 45

56 5.1. Detailed Design of Key Modules Figure 5.9: Flow Chart of computing similarity OWL API Callings The flow to load ontologies. 46

57 5.1. Detailed Design of Key Modules In terms of loading ontologies, it depends on the instance of OWLOntologyManager created by the function createowlontologymanager() from OWLManager. Furthermore with the file object we build based on the file path, we can call loadontologyfromontologydocument() function to gain an instance of OWLOntology. The flow to load an ontology is given in figure Figure 5.10: Flow Chart of Loading an Ontology The flow to obtain classes in an ontology. After the loading of ontology, we acquire an instance of OWLOntology which can be used to get a set of OWLClass through calling getclassinsignature(true) method. Further we can scan the elements in the OWLClass set to justify whether they are top nodes in the ontology and maintain the URITable and map of name and index in MOntology class only if they are not top nodes. The flow to obtain classes in an ontology is given in figure The flow to gain concept description labels of a class. In the process analyzing ontology, we get an instance of OWLClass that can be used to read the class description. These varieties of description of classes or properties are defined as annotation in OWL API as OWLAnnotation. Therefore, we can utilize the method getannotations in OWLClass to access a set of OWLAnnotations. Furthermore, we require judge the type of annotation whether they are OWLLiteral which is we desire to employ in the matching. Finally, after we acquire the description, we need to add them into the storage of lexicon. The flow to access concept description of a class is given in figure The flow to generate relationships in the ontology. To build the relationships in the ontology, it focuses on subclassof and superclassof. At first, we define a class as child class, then obtain a set of all its super classes through OWL API. Then, we pick the class that the expression is defined as OWL_ClASS, OB- JECT_ALL_VALUES_FROM and OBJECT_SOME_VALUES_FROM. And we store the subclassof and superclassof relationships in the data structure. The process is shown in figure

58 5.1. Detailed Design of Key Modules Figure 5.11: Flow Chart of Accessing Classes in Ontology Figure 5.12: Flow Chart of Accessing Concept s Description 48

59 5.2. The Environment of Implementation Figure 5.13: Flow Chart of build relationships 5.2 The Environment of Implementation The development environment is Windows 7 enterprise. The development tool is NetBeans IDE. The existing SAMBO system is a Java web application that needs a tomcat server. During this project, we still require to run other alignment systems such as AML and LogMap. These systems have requirements for hardware. The running environment should under Java 1.7 and Apache Tomcat at least. Regarding the local database, the version should be MySQL 5.6 at least. 5.3 Key Interfaces of System In this section, we will present the key interfaces to run SAMBO system including the ontologies loading, matching, recommending and the result representation. 49

5.3. Key Interfaces of System Figure 5.14 shows the main page of SAMBO system.

User can register with his or her own email then will get a password sent to the email address. Figure 5.

Such as the source of web address, local disk and server of the project. Figure 5.16: Ontology Type Figure 5.

60 5.3. Key Interfaces of System Figure 5.14 shows the main page of SAMBO system. If the user click login, it will go to the login page shown in figure User can register with his or her own then will get a password sent to the address. Figure 5.14: Main Page Figure 5.15: Login Page As figure 5.16 shows, user can choose the ontologies uploading type. Such as the source of web address, local disk and server of the project. Figure 5.16: Ontology Type Figure 5.17 shows the interface to choose the source and target ontology. After user clicks upload button, the files will be uploaded. It has options for large scale ontologies setting and database storage setting. 50

5.3. Key Interfaces of System Figure 5.17: Upload Ontologies Figure 5.18 shows the configuration for properties matching.

After user click start button, SAMBO will start matching for properties. Figure 5.18: Configuration for Properties Matching Figure 5.

19: Properties Matching Figure 5.20 shows the step before concept matching.

61 5.3. Key Interfaces of System Figure 5.17: Upload Ontologies Figure 5.18 shows the configuration for properties matching. The default matching method is based on linguistic. User can set the thresh. After user click start button, SAMBO will start matching for properties. Figure 5.18: Configuration for Properties Matching Figure 5.19 shows the matching of properties. User can define the suggestion as equivalent or sub and super relationship. Figure 5.19: Properties Matching Figure 5.20 shows the step before concept matching. If user clicks finalize button, SAMBO will start the process for concept matching. As figure 5.21 shows, user can configure the details about the matching process. For example, user can choose different kinds of matchers or their combination and the filtering methods including weight-based and maximum-based. As well as, user can decide the threshold. 51

5.3. Key Interfaces of System Figure 5.20: Concepts Matching Start Figure 5.

process. This process will take some time which is the biggest time consumption in the system.

22 presents, the link remaining suggestions will go to a link displaying remaining suggestions and the link

62 5.3. Key Interfaces of System Figure 5.20: Concepts Matching Start Figure 5.21: Configuration for Concept Matching After the configuration is achieved, the system will begin the matching process. This process will take some time which is the biggest time consumption in the system. Then, it will present each matching suggestion based on user s configuration one by one. As figure 5.22 presents, the link remaining suggestions will go to a link displaying remaining suggestions and the link history will show the suggestions the user has finished. The user decides the suggestion s validity based on his background knowledge in the field. Figure 5.22: Suggestions 52

User can choose from both sides respectively and define the relationships of the concepts they select. Figure 5.23: Align Manually In figure 5.

63 5.3. Key Interfaces of System In figure 5.21, there is a button called align manually. If user clicks this button, SAMBO will go to the interface shown in figure The source and target ontologies are presented as trees in this interface. User can choose from both sides respectively and define the relationships of the concepts they select. Figure 5.23: Align Manually In figure 5.21, there is a button called finish. If user clicks finish button, SAMBO will go to the interface shown in figure The link called "The Alignment in OWL File" is the representation of final alignment result as figure 5.25 shown. Figure 5.24: Finish Matching 53

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,