Manually vs semiautomatic domain specific ontology building

Size: px

Start display at page:

Download "Manually vs semiautomatic domain specific ontology building"

Beverly Eaton
6 years ago
Views:

Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Comunicazione d impresa e pubblica Tesi di Laurea in Informatica per il Commercio Elettronico Manually vs

1 Facoltà di Lettere e Filosofia Corso di Laurea Specialistica in Comunicazione d impresa e pubblica Tesi di Laurea in Informatica per il Commercio Elettronico Manually vs semiautomatic domain specific ontology building Relatore Prof. Ernesto D Avanzo Correlatore Prof. Tsvi Kuflik Candidato Antonio Lieto matr Anno accademico

2 Acknowledgements There are many people that I have to thank for the realization of this work. First of all I want to say thank you to my advisor, Prof. Ernesto D Avanzo. He always has been available with me during this year of research thesis, giving me crucial councils for the development of this work (both regarding the theoretical and the experimental part). Another special person that I wish to thank is the Prof. Tsvi Kuflik. His comments and considerations about my work were very important for its improvement. So if there are some errors or mistakes I m the only guilty. Their guide was excellent. Then I want to thank the Dr. Brenda Schaffer for her suggestions for the manual ontology building, the Professors Roberto Cordeschi and Annibale Elia, for their constant interest to the whole project work that have involved even other Master Degree students of the University of Salerno, and the Prof. Marcello Frixione for his interesting seminar regarding the evolution of the semantic networks. Finally (last but not least) I have to thank my family that always has been with me in happy and difficult moments. This work is dedicated to my grandmother Maria that, unfortunately, is no longer among us. 2

3 Contents Chapter 1. The problem and the research question p. 5 Chapter 2. Methods and Tools for the semi-automatic or automatic ontology generation 2.1 Different approaches to the ontology generation p Methods and techniques for the ontology generation from text p Methods and techniques for the ontology generation from dictionaries p Methods and techniques for the ontology generation from a knowledge base p Methods and techniques for the ontology from semi-structured schema p Methods and techniques for the ontology generation from relational schemata p Research projects and tools for the ontology generations p. 25 Chapter 3. The Energy Domain Case Study 3.1 Energy Domain Modelling Process The modelling approach p Information Sources and Tool p Definition of Domain Concepts p Horizontal Links p Ontology Population p Logical Expressions p.48 3

4 3.1.7 Racer Pro p Manual approach: Energy Ontology Description p Energy Domain p Energy Sources p The case of the Hydrogen: Renewable or not Renewable? p Country p Energy Security Class p Infrastructures p Environmental consequences p Energy Use p A Semi automatic approach for ontology building p Semi Automatic Ontology Generation p. 71 Chapter 4 Evaluation and Experiments 4.1 Pilot Study Experiments p Precision and Recall for a quantitative evaluation p New Experiments p. 79 Chapter 5 Discussions and Conclusions 5.1 About the methodology p Proposal: a linguistically motivated keyphrases extraction system for Ontogen p Conclusions p. 93 4

5 Chapter 1. The problem and the research question Ontologies gain a lot of attention in recent years as tools for knowledge representation. However, in the context of the information and computer science there are many definitions regarding what an ontology is. Gruber (1993), for example, define the ontology as an explicit specification of a conceptualization (where a conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose ), pointing out the relative simplicity of the represented knowledge in comparison with the complexity of the knowledge it self. Tim Berners Lee (2001) give a more concrete definition of this concept defining an ontology as a document or file that formally defines the relations among terms, underlying, in this way, the importance of the relational aspect (formally defined) between the elements composing the ontology. In few and simplest words an ontology can be defined as a formal knowledge representation system (KRS) composed by three main elements: classes (or concepts or topics), instances (which are individuals which belongs to a class) and properties (which link classes and instances allowing to insert informations regarding the world represented into the ontology). To obtain a structured representation of the information through the ontologies is one of the main objectives in order to realize the so called Semantic Web 1 (T.B. Lee et al., 2001). In the context of the Semantic Web, in fact, ontologies are expected to play an important role in helping automated processes to access information. In 1 According to Tim Berners Lee (1999) the Semantic Web is an extension of the current web in which information is given in a well defined meaning. And that, according to his vision, should enable the machines to understand the semantics of the web resources and, therefore, to have a more intelligent behaviour in their activities of search. 5

6 particular, ontologies are expected to be used to provide structured vocabularies that explicate the relationships between different terms, allowing intelligent agents (and humans) to interpret their meaning. Another important aspect regarding the role of ontologies is linked to the issue of the information overload. One of the problem of the actual World Wide Web, in fact, regards the fact that large part of the information provided to the users as output of an explicit query are un-relevant. Through the implementation of ontologies within dedicated information systems (e.g. search engines) this problem can be reduced or, in the future, solved completely (at least in theory) because ontologies architecture (which usually is hierarchic) should be able to design the unique path that the information required through queries must follow to arrive at the web resources containing the desired information. Since the 2006 RDF 2, RDF Schema 3 and OWL 4 are generally considered as standard Semantic Web languages. In particular the OWL, which is the language used for the concrete manual ontology building case study (which will be introducted in the following pages), is the most expressive language. That means that it allows to augment the number (and the quality) of inferences that the software agents are able to do. Ontology building is a very complicated activity for several reasons. First, because it requires time consuming work of experts. Then because classification task is not simple as it seems. Finally because the incredible speed with which the knowledge developed itself in the real world, constraint the ontology engineers to continuously 2 See 3 See 4 See 6

7 update and enrich the generated ontologies with new concepts, terms and lexicon. In this way an ontology often becomes a never ending work which constantly requires manual efforts and resources to be built and maintained. In recent years, tools and methods have been developed to try and solve automatically or semiautomatically the problems related to manual ontology building (an overview of such approaches, tools and techniques will be presented in the second chapter). The research question of this work is the following: is it possible, with the actual tools and methods, to substitute (fully or even partially) the human activity in a complex task such as the ontology building? We will try to answer this question trough experimental results conducted on a concrete case study where a manual domain specific ontology has been compared with a semi-automatically built one. The objective of this work is to present a concrete case study regarding the evaluation of the manual approach to ontology building compared with the semiautomatic one. This thesis work has been developed on a three step phases: in the first one a manual energy domain ontology has been created. Then a part of this ontology has been semi automatically generated using the software Ontogen. And, finally, a comparison between the two approaches at the ontology building has been done trough a quantitative evaluation considering precision and recall measures. This work is structured as follow: in the chapter 2 will be examined some methods and tools used for the semi-automatic or automatic ontology generation, then chapter 3 is dedicated to the description of the energy domain case study and chapter 4 to the experiments and to the evaluation phase. The last chapter is dedicated to the results discussion and to the conclusions. 7

8 Chapter 2: Methods and Tools for the semi-automatic or automatic ontology generation Manual ontology building is a time consuming activity that requires a lot of efforts for knowledge domain acquisition and knowledge domain modeling. In order to overcome these problems many methods have been developed, including systems and tools that automatically or semi-automatically, using text mining and machine learning techniques, allows to generate ontologies. The research field which study this issues is usually called ontology generation or ontology extraction or ontology learning (Maedche et al., 2001). It studies the methods and techniques used to: construct automatically or semi-automatically an ontology ex-novo enrich or adapt an existing ontology using different sources The ontology learning process is useful for different reasons. First of all to accelerate the process of knowledge acquisition, second to reduce the time for the updating of an existent ontology, and finally to accelerate the whole process of ontology building. Buitelaar (2005) propose an incremental stratification of this process (see figure 2.1): 8

9 Figure 2.1. Stratification of the ontology learning process in (Buitelaar 2005) Starting from the lowest level each phase can be considered as input of the next phase. The phases identified are the following: 1. Extraction of relevant terms and their synonyms in a textual corpus for a target domain; 2. Concepts identification (is the third step of the scale proposed by Buitelaar); 3. Derivation of a hierarchy of the concepts previously identified; 4. Identification of non taxonomic relations between the concepts; 5. Adjustment of the ontology with new instances, concepts and properties (Ontology Population); 6. Discovering of new rules and axiomatic relations between concepts and properties. The major part of the approaches iterate this steps both to integrate the user feedback in the process of ontology generation and to re-use the new acquired knowledge as knowledge base for future iterations. 9

10 Different approaches for the ontology generation Maedche and Staab (2001), proposed a classification of the systems used for the automatic or semiautomatic ontology building. It is based on the type of input that the systems consider to initiate the process of ontology generation. The authors distinguished between: ontology generation from text, dictionary, knowledge base, semi-structured schemata and relational schemata. The most widely used approaches at the ontology extraction from text, as reported in Perez (2004), are the following: Pattern-based Extraction (Hearst 1992): this approach usually uses heuristic methods that examine the text with distinctive lexicon-syntactic pattern. A relation is recognized and extracted if a sequence of words within the text matches a pattern. The basic idea of this approach is very simple: to define a regular pattern that is able to capture the expression presented by the text and able to map the results of the matching into a semantic structure such as a taxonomy of relations among concepts. Association Rules: initially defined to extract informations from databases (in the data mining field, (Agawall et al., 1993) has been used in (Maedche, 2001) to discover non taxonomic relations between concepts using a concepts hierarchy as knowledge base. Conceptual Clustering: in this approach the concepts are grouped according to the their semantic similarity to build hierarchies. The semantic similarity can be calculated with different methods. For example it may be calculated according to the distributional approach: less distance between the linguistic distributions of two words means more similar concept (see Faure et al. 2000) 10

11 Ontology pruning: the aim of this approach is to build a domain ontology based on various sources (Kietz at al., 2000). It includes the following steps: first, a generic core ontology is used as basic infrastructure for a domain specific ontology. Then, a dictionary with important domain terms is used for the domain concept acquisition and these concepts are classified into the generic core ontology. Finally, domain specific and general corpora of text are used for the process of non domain specific concepts removal, following the heuristics that domain specific concepts should be more frequent in a domain-specific corpus than in a generic one. Concept learning: with this approach a given taxonomy in incrementally enriched acquiring new concepts from textual documents (see Hahn et al 2000). The above are the main approaches for the ontology generation from text. However, as mentioned earlier, the Maedche and Staab classification identifies four other classes of sources used as input for the ontology extraction. Amongst the ontology generation from dictionaries is based on the use of machine readable dictionaries to extract relevant concepts and relations among them. The methods and tools used for this task are based on linguistic and semantic analysis and usually Wordnet is used in this approach as dictionary; the ontology learning from a Knowledge Base aims at generate an ontology using as source an existing knowledge base; the ontology extraction from semi structured data, instead, have as objective the extraction of an ontology from sources with a pre-defined structure such as the XML Schema and, finally, the ontology learning from relational schema aims to learn an ontology extracting relevant concepts and relations from knowledge in database. In the following pages will be examined different methods, tools and techniques available 11

12 in literature for the automatic or semi-automatic ontology generation starting from different data sources. 2.1 Methods and techniques for ontology generation from text Since the late fifties many approaches have been proposed to extract terms, concepts and relationship from text. But since the mid nineties these efforts have taken a new momentum thanks to the availability of more sophisticated statistical and NLP techniques. These techniques are now used in many approaches for the ontology generation (Soergel et al, 2005). Reinberger s method (Reinberger et al, 2004) supports ontology extraction from text. Its aim is to create an initial skeleton of ontology to be then refined by analysts. It is based on a three step process. First, a textual domain corpus is parsed with a shallow parser, then noun phrases and their relations with other verb phrases or noun phrases are listed. At the third step clustering techniques group terms that share similar relations into the same classes and the ontology is created. Khan and Luo s method (Kahn and Luo, 2002) aims to build a domain ontology starting from text documents and using clustering techniques and Wordnet ontology. The hierarchy is created by grouping documents similar in content (the documents are provided by user) within the same cluster and then putting them into a hierarchy using an algorithm called SOTA. After building a hierarchy of clusters, a concept (or topic) is assigned to each cluster with a bottom up concept assignment mechanism (the assignment starts from the leaf nodes of the hierarchy). Then the assigned topic is associated with the appropriate concept is WordNet. And finally 12

13 the concepts of the internal nodes are assigned using the descendent nodes and their hypeyms in WordNet. Another method for building ontologies take into account linguistic techniques that come from the Differential semantics was proposed by Bachimont (2002). In this method the construction of the ontology is made in a three step process. First, there is a Semantic Normalization where the user chooses the relevant terms of a domain and normalizes their meaning, expressing similarity and differences of each notions with respect to its neighbours. Then these terms are placed into a hierarchy (and the user has to justify her/his decisions). At the second step there is a Knowledge Formalization phase in which, using the taxonomy obtained in the first step, the various terms are disambiguated for a domain expert to carry out a formalization of the knowledge. The third step consists in the Operationalization: the created taxonomy is transcribed into a specific knowledge representation language. Nobecourt (Nobecourt, 2000) presents an approach to build domain ontologies from text using text mining techniques and a corpus. This method is based on two activities: modelling and representation. The modelling activity is based on the extraction of relevant domain terms (the conceptual primitives ) from a corpus. After this operation, domain experts look for relevant terms of the domain in the list and identify the main sub-domains of the ontology. These terms are modelled as concepts and constitute the first skeleton of the ontology. In a second moment these concepts are described in natural language, constituting, in this way, a new source of documents (a new corpus) from which a new list of primitives can be extracted in an iteratively process used to gradually refines the skeleton of the ontology. The representation activity consists of the translation of the modelling schemata into an 13

14 implementation language. This method is technologically supported by the platform TERMINAE (Biebow et al, 1999) which will be discussed further later. Kietz et al. (2000) proposed a generic method to discover a domain ontology from given heterogeneous sources using natural language analysis techniques. It is a semi-automatic approach for ontology building in the sense that the user takes active part in the process. The authors propose to extract ontologies starting from a core ontology (e.g. SENSUS, Wordnet etc.) and enriching it with new specific domain concepts. The user has to specify which documents should be used to refine the core ontology. New concepts are identified using NL techniques applied to the suggested documents. Then the resulting enriched core ontology is pruned and focused to a specific domain through the removal, with several statistical approaches, of general concepts. Finally relations between concepts are learnt applying learning methods and added to the resulting ontology. This process is cyclic because the resulting ontology can be refined applying this method iteratively. Aussenac-Gilles et al. (2000), suggest a method that allows to create a domain model using NLP tools and linguistic techniques for the analysis of corpora. This method uses text as starting point, but may use also other existing ontologies or terminological resources to build the ontology. It performs the ontology learning in three levels: linguistic level, normalization level and formal level. The first one is composed of terms and lexical resources extraction from text. These elements are then clustered and converted into concepts and semantic relations at the normalization level. Finally concepts and relations are formalized by means a formal language. The process is composed by 4 phases. The first two are referred to 14

15 as the linguistic level and include: the corpus constitution (the authors point out the importance of a domain expert aid for the domain specific corpus selection) and the linguistic study that focuses on the selection of adequate linguistic tools for the analysis of the text. This phase yields domain terms, lexical relations and a set of synonyms. The third phase is the normalization. The result of this phase is a conceptual model expressed in form of semantic network. It is divided into two subphases: a linguistic phase and a conceptual one. During the linguistic phase the ontology engineer chose the terms and the lexical relations (e.g. hyponyms) that have to be modelled, then, s/he adds a natural language definition for these terms considering the senses that they have in the text and defining, for each sense, some identificative labels. If there are several meanings, the most relevant for the domain are kept. During the conceptual phase, concepts and semantic relations are defined in a normalized form using the labels of concepts and relations. Finally the last phase is the formalization. It includes ontology validation and implementation. The evaluation of the knowledge learnt is made by the user and by a domain expert. Once the ontology has been evaluated it can be implemented (e.g. following this approach an ontology for a private company about fiber glass manufacturing was built and implemented, see Aussenac-Gilles et al. 2003) Hearst (1992) method aims to acquire automatically lexical relations of hyponyms from a corpora to built a general domain thesaurus, using WordNet to verify its performance. The process exploits a set of predefined lexico-syntactic patterns easily recognizable. The method aims at discover this patterns and the authors suggests that other lexical relations will be acquirable in this way. All of them will 15

16 be used to build the thesaurus. The method proposed the following five steps procedure to automatically discover new patterns: Decide on a lexical relation of interest. Gather a list of terms for which this relation is known to hold (this step can be done automatically using this method). Find documents in the corpus where these expressions occur syntactically one near the other and record the whole environment (the environment is defined by the linguistic space where the defined expressions appear: e.g. in the simplest case Poets such as Shakespeare is the linguistic environment of the pattern such as used to indicate a hyponym relation between poet and Shakespeare. In more complicated cases the environment can be represented by sentences or full periods). Find commonalities among these environments and the initial hypothesis. Once a new pattern has been identified it is used to gather more instances of the target relation and the process restart from step 2. To validate this acquisition method, the author proposed to compare it with the information found in Wordnet. For example: if two terms of the thesaurus presented in hyponymy relations are also linked hirarchically in Wordnet, then the thesaurus is verified. Alfonseca and Manadhar (2002) proposed a method based on the Distributional Semantic Hypothesis (Harris, 1971) which states that the meaning of a word is highly correlated to the linguistic context in which it appears. From this point of view the context of a certain concept can be encoded as a vector of context words containing the words that co-occour with that concept and their frequency (topic 16

17 signature). These topic signatures are then clustered and a distance measures such as TFIDF o chi-square is used to separate the different word senses. The obtained signatures can be compared with the topic signatures of an existing ontology (e.g. WordNet) identifying, in this manner, the hyperyms candidates. To do that a top down classification alghoritm is used Methods and techniques for the Ontology generation from dictionaries Jannink and Wiederhold s approach This approach (Jannink and Wiederhold, 1999), aims to convert dictionary data into a graph structure to support the generation of a domain or task ontology. It uses an algebra extraction technique to generate the graph structure and to create the thesaurus entries for all the word defined into the graph. According to its purpose only headwords and definition having many-to-many relations are considered. This resulted in a directed graph that has two properties: each headword and definition is grouped in a node and each word in a definition node is an arc to the node having that headword. The basic hypothesis of this approach is that structural relationship between terms are relevant to their meaning. They are extracted in a 3 step process where a statistical approach and PageRank alghorithm are used to develop, as output, a set of terms related by the strength of the association in the arcs that they contain. 17

18 Rigau s method This method (Rigau et al, 1998) consists in the extraction of lexical ontologies from dictionaries. Its main goal is to semi-automatically develop versions of Wordnet for the Spanish language. It is based on two procedures: analysis of dictionary definitions and Word Sense disambiguation of genus words. So, in a first phase, in a monolingual dictionary each definition is analyzed to find an hyperonym of the word which has been defined (also called genus word). Then a Word-sense disambiguation procedure is used on the genus word to discover what meaning is used for it. This method was developed as part of the EuroWordNet Project which aimed at developing lexical ontologies for several european languages Methods and techniques for the Ontology generation from a knowledge base Suryanto and Compton s approach This approach (Suryanto and Compton 2001) aims at generating an ontology from a knowledge base of rules. The authors propose an algorithm to extract a taxonomy of classes where a class is a set of different path of rules that arrive at the same conclusion and a rule path for node n consists of all the conditions from all predecessors rules plus the new conditions of the particular rule of node n. The approach takes the initial trees and create a set of classes trying to discover relations among them. Three types of relations are considered: subsumption, mutual exclusivity and similarity. The central idea of this approach is to group all the rules in each class and calculate a quantitative measure for each relation between each 18

19 couple of classes. This quantitative measure provides the confidence to whether this relation exists (for example: the class A subsume the class B with certain confidence measure if the class A only exists when the class B exists but not the other way around). With the set of classes and relations created the class taxonomy is built up. The whole process in evaluated by an expert Methods and techniques for the Ontology generation from semi-stuctured schema Papatheodoru and colleagues method This method (Papatheodoru, 2002) aims to build taxonomies using a data mining approach called cluster mining from domain repositories written in XML or RDF. The cluster mining approach firstly tries to group similar metadata in a cluster, and then, processing this cluster, extract a controlled vocabulary used to build the taxonomy (Perez 2004). The steps proposed by this method are: 1. Data collection and pre-processing where the main objective is to select the appropriate keywords from the metadata files. That enables to discover similarities between the documents and, for that purpose, words such as articles or prepositions are dropped. 2. Pattern Discovery in this step cluster mining approach is used to discover and build clusters of similar documents and extract representative keywords of the documents content itself. 19

20 3. Pattern post processing and evaluation: the keywords extracted in the previous step are examined and measured with a statistical approach and the best keywords (the most representative of the content of the clusters) are selected. These keywords provide the vocabulary necessary to form the concepts of the taxonomy Deitel and colleagues s approach Deitel et al. (2001), present an approach for learning ontologies from RDF annotations of web sources. It focuses on learning, from the whole RDF graph new domain concepts, enriching, in this way, the ontology to which the RDF annotation belong. To extract the description of a resource from the graph this approach follows a criterion called description of length n of a resource that is the largest connected subgraph in the whole RDF graph containing all possible paths of length smaller or equal to n starting from ending to the considered resource. The proposed steps used for building a hierarchy based on resource description are the following: 1. Extract resource description of length one and repeat the process incementing the lenght until covered the maximum path in the graph. 2. Extraction of resource description of lenght one from the whole RDF graph (these descriptions form a set of RDF triple: resources, properties and values). 3. Iterative generalization of all possible pairs of triple: the generalization of two triples is the most specific triples subsuming them. 20

21 4. Construction of the intension 5 of length one. The triples sharing a same extension are grouped together. 5. Build the generalization hierarchy based on the inclusion relations between the node extension. 6. Repeate the process incrementing the length of the resource description to be extracted. 2.4 Methods and techniques for Ontology generation from relational schemata Kashyap s method Kashyap (1999), uses database schemas to build an initial ontology that will be then refined through a collection of queries made by users. The process is interactive, because an expert is involved in the process of deciding which classes and properties are important for the domain ontology, and iterative because it is repeated as many times as necessary. The process has two phases. At the first one the database schemas are analyzed in detail and, at the end of the process, a new database schema is created and, trough reverse engineering techniques, its content is mapped into an ontology. At the second phase the ontology built from the database schemas is refined by means of user queries that allow to add or delete attributes, to create new entities etc. 5 An intension may include redundant triples, one being more general than another. It is cleaned up by deleting triples subsuming another one (Perez, 2004). 21

22 2.4.2 Rubin and colleagues approach Rubin et al. (2002) has as objective to automate the process of creating instancies and their value using the data extracted from external relation sources. This method uses XML Schema as interface between the ontology and the data sources. The process allows the automatization of updating the links between ontology and data acquisition when the ontology changes. This approach needs the following components: an ontology (with domain classes and relations among them), an XML Schema (is the interface between the ontology and the data acquisition), and an XML translator (to convert external incoming data in XML). The method is based on a four step process: 1. An ontology model of a domain must be created (e.g. with Protegé) 2. An XML Schema must be generated from the ontology (once the onology is built and the constraint on the properties are explicitated the XML Schema is sufficiently determined and can be written directly from the ontology). 3. The data acquired from the external resources must be put into an XML document using the syntax specified into the XML Schema. 4. Ontology updating and propagating changes Stojanovic and colleagues approach Stojanovic et al (2002), try to build light ontologies from conceptual database schemas using a mapping process. The ontology generation follow a five step process: 22

23 1. The information from a relational schema is captured trough a reverse engineering. This process tries to preserve as much information as possible from the database schema. 2. The information obtained is analyzed applying a set of mapping rules in order to built ontological entities. These rules specify the way with which to migrate elements from the database into the ontology. These rules are applied in the following order: concept creation, inheritance and relations creation to incrementally create the ontology. 3. The ontology is created through the application of the rules mentioned in the previous step. 4. The ontology is evaluated and refined. 5. The ontological instancies are created on the base of the tuples of the relational database. 2.5 Research Projects and tools for the ontology generation Automatic and semi-automatic generation of ontology from documental corpus or from other types of data collections is nowadays one of the research challenges of the Semantic Web. Many research projects have been developed and many prototypes and tools have been created for that task. The following section overviews the majors tools. Mo K Workbench Mo K Workbench (Bisson, 2000) is a tool that semi-automatically creates ontology from a textual corpus using different techniques of conceptual clustering. It doesn t 23

24 need of a previous semantic knowledge (e.g. existing ontology) and, by applying NLP techniques, extracts from the documental corpus sets of triples. Each of them is formed by a verb, a word and by the syntactic role of that word within the sentence. Using the various triples, Mo K calculates the number of the occurrences for each one, removing from the list the triples with too much and too few occurrences. Finally it calculates the semantic distances between the triples to form conceptual clusters. Text to Onto This system (Maedche and Volz, 2000; Maedche and Staab, 2004) integrates the KAON environment, an Open source ontology management infrastructure, with a tool suite for building ontologies from an initial core ontology. It combines knowledge acquisition and machine learning techniques to discover conceptual structures. Terms are extracted according to their occurance frequency and distribution criteria. Semantic and hierarchical relations are extracted with association rules or linguistic patterns and the relationship are weighted according to support and confidence criteria. The result of Text to Onto is a domain ontology. The whole process is supervised by an ontologist. Text Storm and Clouds This system (Pereira, 1998) has been developed for semi-automatic construction of a semantic network using relevant text for target domain. It is composed of two modules: TextStorm (Oliveira et al. 2001) and Clouds (Pereira et al. 2000) that perform complementary activities. TextStorm is NL tool that extract binary predicates from text using syntactic and discourse knowledge. The predicates on 24

25 which the tool is focused are those that relate two concepts in a sentence. The process works as follows: target domain text is provided to the system and is tagged using Wordnet to find all parts of speech to which a word may belong. The text is then parsed using an augmented grammar to obtain a lexical classification of the words during the parsing process. Finally. TextStorm create a list with extracted terms that becomes an input for the Clouds tools. Clouds is responsible for the construction of the semantic network in an interactive way. Using the previous list of binary predicates extracted by the text, Clouds builds a hierarchical tree of concepts, learning some particulars of the domain using two techniques: best current hypothesis based algorithm (to learn the categories of the arguments of each relation) and Inductive Logic Programming based algorithm (to learn the recurrent context in each relation). OntoLT OntoLT is Protegé plugin developed by Buitelaar et al. (2004) that allows to extract automatically concepts (classes in Protegé), and relations (properties in Protegé), from a collection of linguistically annotated texts. To do that OntoLT uses some rules that allow mapping the linguistic entities within the text with the classes/properties in Protegé. For using this tool there is the need for a collection of texts automatically annotated with linguistic information provided in XML format. The annotation includes: part of speech tags, morphological analysis, sentences and predicates-arguments analysis. The annotation allows extracting automatically linguistic entities that can be used to build an ontology of concepts, subconcepts and 25

26 relations of a specific domain. The mapping rules are defined through X-Path expressions (used to extract requested elements or attributes from an XML document). A mapping rule defines how to map linguistic entities within a corpus (a collection of documents in XML) with classes and slots of Protegé. These rules are implemented through the use of pre-conditions (X-Path). If all the preconditions are satisfied then a set of linguistic units will be generated and one or plus operators will be activated to define in which way each of them will be mapped into the correspondent class/slot of Protegé. OntoLT includes statistics analysis functions for the construction of syntactic rules finalized to allow the identification of linguistic units relevant for the target domain. For each linguistic entity a system of ratings is computed and the linguistic entities more specific for the domain corpus receive the higher rating. CORPORUM Ontobuilder Corporum Ontobuilder ( extracts ontologies and taxonomies from natural language texts. The tools uses many linguistic techniques that drive the analysis and the information extractions. It extracts information from structured and unstructured documents using Ontowrapper (Engels, 2002) and Ontoextract (Engels, 2001). Ontowrapper extract infomations from texts and OntoExtract obtain taxonomies from natural language texts in RDF format and is also able to refine existing concepts taxonomies. JATKE JATKE (available open source on is a Protégé plug-in that offers an unifed platform for ontology construction. The 26

27 platform has high degree of flexibility and allows to combine a variety of approaches and even to create new ones. In fact, it allows the combination of preexisting modules guaranteeing a personalized set-up for each scenario. JATKE is composed of three different modules: (Information/Source, Evidence, Proposal) and its main characteristics are as follows: It allows the final user to define the correct mix of learning algorithms that better fits the domain or the objective desired, allowing a personalized setup of the different modules; Its modular design allows to develop rapidly new modules, re-using the existing ones; It allows a semi-automatic ontology building process, and generating proposals for the modifications of an exististing ontology, while the user has to decide if to accept the system proposals or not (each one has a certain degree of confidence). The communication between the different modules is possible because of an internal ontology that represents the internal domain of the system in which every type of data or command is represented as an instance. Within the ontology of the system there is the button Proposal (last button in figure 2.2) that allows activating a learning process. The proposals are divided in three categories: Class: is a proposal concerning a class, a concept, that can be created, deleted, re-named or moved; Slot: is a proposal referred to a relation or to a property; Instance: is a proposal related to an instance. 27

28 All the proposals are analyzed by the user that accepts or rejects them. If the user accepts the proposal then the adjustment is communicated to all the modules participating it the process and, in this way, the ontology is updated. Figure 2.2. The slot proposal in Jatke Ontogen Ontogen (Fortuna et al 2005) is a semi-automatic data driven topic ontology editor which integrates machine learning and text mining algorithms. OntoGen s mains features are represented by automatic keyword extraction from documents given as an input to the system (the extracted keywords are candidate concepts of the ontology) and by the concepts suggestions generation. This system will be discussed in a detailed manner in the next chapter. Ontobuilder This tool (Modica et al. 2001) helps the users in the ontology building process using as sources semi-structured data coded in XML or in HTML. It has been designed to work as a web browser (see figure 2.3). Once the URL of a web page is defined and the page has been downloaded by the system, it is possible to extract an ontology from that web source. The system is composed of three main modules: the user 28

29 interaction module, the observer module and the ontology modelling module. The process that enables building the ontology has two phases: a training phase in which an initial domain ontology is built using the data provided by the user (e.g. the user suggest browsing websites that contains relevant domain informations) and an adaptation phase in which the obtained ontology is gradually refined by the way (for each new site candidate ontology is extracted and merged with the existing one) OntoLearn Figure 2.3. The browser interface of Ontobuilder OntoLearn (Velardi 2004) is an ontology learning system based on NLP and machine learning techniques and is based on three phases: Term extraction: relevant terms (both n gram and bi grams as credit card ) are extracted with NLP statistical techniques. Specific and generic corpora are used to prune the non specific terminology of the domain. The domain documents are used as input and the system, after a parserization, extracts a list of terms syntactically plausible (e.g. Adj+N). Two measures based on the entropy are used to evaluate the importance of each term: Domain Consensus and Domain relevance. The first is used to select only the terms referred at the documental corpus. The second is used to select the terms belonging to the domain of interest, and is calculated considering as basic point a set of terms of different domains. Finally the extracted terms are filtered using the lexical cohesion measure that quantifies the degree of association 29

30 of all the words in a string of terms. The second phase consists on the semantic interpretation of terms which is based on the principle of compositional interpretation according to which, for example, the meaning of a composite word as business plan is derived by the association of each single term at the correct identificator of an existing ontology (e.g. Wordnet). In this phase a WordSense disambiguation task is executed applying an algorithm called SSI (Structural Semantic Interconnections) that is based on a syntactic patter matching. SSI produces a semantic graph which includes the selected senses and the semantic interconnection between them. The third phase is represented by the extension and refinement of the starting ontology. Once the terms are have been semantically interpreted they are organized in sub-trees and inserted on the appropriate node of the initial ontology. Furthermore some nodes of the initial ontology are pruned to create a specific view of the ontology. Finally the new ontology is converted in OWL format. WeDaX tool WeDaX (Snoussi 2002) was developed by a group of researchers of the Montreal University (Canada). Its aim is to extract information from web pages using an ontology to model the data to be extracted. More specifically: the web page is converted to XML format, then a mapping with an ontological model is done (the model definition and the mapping are manually done by the user trough a graphic interface). An automatic process extracts the information and the result is an XML document containing a set of standardized data that allows to execute queries. WebKB 30

31 WebKB (Craven 2000) has been developed by Carnegie Mellon University (USA), and its objective is to automatically create a knowledge base understandable by a computer extracting informations from web documents. In particular this system extracts the information it needs for knowledge base creation from an initial set of web pages and then search new web sites in order to automatically populate the knowledge base with new assertions. The tool needs two inputs: an ontology that specifies classes, and examples of web pages representing the classes or the instancies of such ontology, for mapping the ontology to the Web. SOBA (SmartWeb Ontology-based Annotation) SOBA is a component of the SmartWeb system developed at the University of Kalshruhe (Germany). It automatically populates a knowledge base starting from the information extracted from web pages about the football. It is composed of a web crawler, components for linguistic annotations and a final module for the transformation of the linguistic annotations in an ontological representation (see Cimiano 2006). RelExt tool RelExt ( is a tool for the relationship extraction from a collection of texts used for ontology building. Usually, domain ontology rarely model verbs as concept relations. The basic assumption of this system is that the role of the verbs as element of connection between the concepts is evident. RelExt is a system able to automatically identify relevant triples (a couple of concepts connected with a relation) within the concepts of an existing ontology. The system work by extracting relevant verbs and their arguments from a collection of domain 31

32 specific documents and calculate, through a combination of statistical and linguistics techniques, relevant relations with the concepts. Doddle This system (Yamaguchy 1999), aims to construct domain ontologies, in particular a hierarchically structured set of domain terms without concepts definitions, reusing a machine readable dictionary (MRD) and adjusting it for specific domains. Since Doddle just generates a hierarchically structured sets of domain terms, it support the user in the concept categorization task and concept name suggestion. The tools deals with the concepts drift (the senses of the concept change depending on applications domain). For this purpose, two strategies have been followed (match result analysis and trimmed result analysis). Both try to identify which part may stay or should be moved by the initial ontology. In order to analyze the concept drift between a MRD and a domain ontology there are involved two main activities. The first one is based on the building of an initial model from a MRD extracting informations about relevant terms in a given domain. The second one regards the management of concepts drifts making an initial model adjusted to the specific domain. SVETLAN Svetlan (Chaelendar and Grau, 2000), is a domain independent tool that creates clusters of the words appearing in the text. The scope of this tool is to build a hierarchy of concepts. Its learning method is based on distributional approach: nouns playing the same syntactic role in sentences with the same verb are grouped together in the same class. The learning process follows three steps: syntactic analysis, aggregation and filtering. At the first step the tools retrieve sentences from 32

33 the original text (it only accept French texts in natural language) in order to find the verbs inside the sentence since the basic assumption is that verbs allow categorizing nouns. The output of this step is a list of triplets of verbs, nouns and syntactic relations between them. The aggregation create cluster of nouns with similar meanings (using conceptual clustering techniques) and the filtering step is based on the weights of the nouns inside the classes (the cluster created) and on the removal of nouns not relevant inside the groups. TERMINAE This system (Biebow et al. 1999), integrates linguistic and knowledge engineering tools. The linguistic one allows defining terminological form from the analysis of term occurrences in a corpus: the ontologists analyze the uses of the terms in a corpus to define their meaning. The knowledge engineering tool, instead, helps to represent terminological forms as concepts. TERMINAE uses a method to build concepts from the study of the corresponding terms in a corpus. First it establishes, using an extractor tool, a list of candidate terms which are proposed to the ontologists (which select a set of terms). Then the ontologists conceptualize the terms and analyses the uses of the terms in corpus to define all their meanings. Finally the ontologists give a definition in natural language for each meaning and then translate it into an implementation language. The development of this kind of systems, and their alignment to the manual gold standard ontologies, represents one of the main challenges for the next future in the Semantic Web. To achieve this objective the evaluation phase of the systems and of their results plays a crucial role. In the next chapter will be 33

34 presented the results of an evaluation of a semi-automatically generated domain ontology with a manually built one. 34

35 Chapter 3: The Energy Domain Case Study In this chapter, a concrete case of comparison between manual versus semiautomatic ontology construction is presented. The main objective is to present a comparative evaluation of those two different approaches. The questions we ll try to answer are: which approach is more useful in the ontology building task? Are these approaches alternatives or can they be integrated? What are the pros and the cons for each one? And, finally, is it realistic, in the near future, to think of a complete automatization or, at least, at a semi-automatization of the whole process of ontology building? In order to answer these questions, a set of experiments was performed, in which a manually built ontology is compared with a semiautomatically generated one. The ontology compared is a domain ontology for energy security. It represents our case study and allows us to formulate some conclusions about the two approaches. The reason why we chose to focus our attention and our research efforts on energy domain is represented by the importance that, nowadays, energy issues gain importance in the global institutional and economical agenda. In this way, the realization (both with manual and semiautomatic approach) and the evaluation of a dedicated ontology for this field represents, at the same time, a big challenge for the ontology engineers and a great opportunity for the institutional and economic decision makers of the energy domain. In fact, the implementation of such ontology in a dedicated information 35

36 system may improve both the precision and recall 6 of search results in information retrieval tasks. This improvement may help decision makers of energy field, getting the relevant information which they need for making their decisions (in this view the ontology would be used to create a sort of dedicated ontology-based decision support system for energy domain ). In the following paragraphs we will explain the methodology used for the energy domain modelling (paragraph 3.1). Then, a more detailed description of the energy ontology manually built will be presented (paragraph 3.2). Ontogen, the tool used for the semi automatic ontology generation will be presented next (paragraph 3.3), followed by a presentation of the semiautomatic ontology. Finally, the experimental results of the manually built ontology, will be compared with the results obtained by semi-automatic ontology building process (paragraph 3.4). 3.1 Energy Domain Modelling Process The modelling approach Poesio (2005) states that there are, at least, two different research traditions in the domain modelling literature. One school of thought supports the thesis of the need of more rigorous logical and philosophical foundations for domain modeling formalisms. It s aim is both to establish a Tarskian Semantics for the formalism used in the domain ontologies (leading to description logics) and to have cleaner 6 Precision and recall are two standard measures of the Information Retrieval field. In few words they represent, respectively, how much relevant are the documents retrieved by a system after a query (precision) and how many relevant documents a system is able to retrieve on the total of the relevant documents (recall). These two measures will be presented widely and in a more detailed manner in the following pages. 36

37 domain ontologies (where the expression clean ontology stands for ontology with a clear semantics and based on sound philosophical and scientific principles ). Whoever supports this line of research argue that the ontologies built according to formalist principles are very beneficial for many NLP (Natural Language Processing) applications such as, for example, information extraction from a database using natural language or, vice versa, information extraction from texts and their addition to a database. The second school of thought, instead (that Poesio defines as cognitive ), argue that the best way to identify epistemological primitives is to study concept formation and learning in humans. Conversely, the best approach to the construction of domain ontologies is by the use of machine learning techniques to automatically extract ontologies from language corpora. This approach has its philosophical foundations in the work of the late Wittgenstein (Philosophicae Investigationes, 1959) and especially in its implicit critic, explicited in the psychological perspective by the work of Eleonoire Rosch (Rosch, 1973), to the classic Aristotelian theory of concepts 7 trough the introduction of the concepts of language games and family resemblance. These two concepts, in fact, had indirectly dealt a mortal blow to the Aristotelian theory bringing out that the humans, and their memory, aren t able to classify language and concept according with what the theory stated, even for the complexity intrinsic at the language itself. The failure of this theory, and in general of the pure formal logic approach as key to explain the way in which the humans thinks and categorize (Thaggard, 1998), has been used as argument against the formalistic approach to the domain modelling (the first one mentioned above). Regarding the specific case of energy modelling, 7 According to this theory a concept can be defined exclusively with a finite set of necessary and sufficient conditions. 37

38 it s not easy to classify our work within one of these two categories (Formalism vs Empiricism),because it has been used both a bottom up strategy and a top down one for the energy ontology building. To be clearer: mainly a bottom-up" ontology building strategy 8 was used. It was bottom up because the domain knowledge acquisition, and then the process of ontology creation, started from corpora of textual documents and not from abstract conceptual knowledge or conjectures made a priori by the ontologists about the world of the domain to be represented (in this view we could say that we privileged an empiric and language-based point of view). On the other hand, top-down approach was applied by the guidance of an energy domain expert (Dr. Brenda Schaffer of the University of Haifa). This help has been crucial for our work and allowed us to modify the ontology and the research directions in a more meaningful manner, allowing us to to better focus on specific relevant aspects. For example: after the reviews of Dr. Shaffer, we mainly focused our attention on the class energy security (which is only one of the 54 classes of the ontology) and on its implications regarding, for example, geopolitical, economical and environmental problems (we will see the class description and its ontological implication in the next paragraph). To summarize, for manual energy ontology building we mainly used a corpora based strategy. However, at the same time, we have been guided by a domain expert hence applying a to-down strategy. So our approach can maybe be considered as a middle up-down one. In the 8 The work of manual ontology building has been made in sharing with other Master Degree students of the University of Salerno which also worked, with different objectives and perspectives, at the same project I worked on. In order to facilitate the informations circulation about the common relevant issues of the project we created a wiki which revealed to be an excellent instrument for knowledge sharing. To know more about our different thesis work projects see 38

39 following paragraphs will be describe, step by step, the whole process of manual energy ontology building Information Sources and Tools The work has been based on information extracted and inferred from a document database of about 200 documents about energy. The database s documents has been recovered from the web; and mainly (but not only) from the online resources of the most important energy agencies and associations in the world. The strategy used for the documents recovery was the following: in a first phase the documents selected were those recovered by means of the results of the search queries on search engines about: energy policies, energy sources and energy use. In this first phase we didn t have a specific knowledge about energy domain and so even the search queries weren t precise. The first 30 documents of the database were recovered following this strategy. After readings them we started to acquire initial knowledge of the domain and that allowed us to better evaluate the relevance of documents found by search queries. So even the queries strategies changed and started to become more precise. In this second phase, in fact we started to have queries using the first classes inserted into the ontology (e.g. energy security + resources affordability or energy security + reliability of supply) in order to retrieve more relevant and specific documents. Then, even another strategy was followed. In fact we started to browse a list of websites considered authoritative sources about the energy issues and started to do a sort of human crawling work in the sense that, 39

40 starting from the first pages of the selected sources (and following all the links presented there), we recovered other documents and informations. The full list of the sources used is the following: the EIA (Energy International Association, see which is the USA s premier source of unbiased energy data, analysis and forecasting; the IEA (International Energy Agency, which acts as energy policy advisor to 26 member countries; the ENI (Ente Nazionale Idrocarburi), which is one the most important integrated energy companies in the world operating in the sectors of oil, gas, power generation etc.; the OECD (Organization for economic co-operation and development), that is one of the world's largest provider of sources of comparable statistics, and economic and social data; the Oil & Gas Journal, which delivers the latest international Oil and Gas news; the FERC (Federal Energy Regulatory Commission only for USA area) which regulates and oversees energy industries in the economic, environmental, and safety interests of the American public; the NREL (National Renewable Energy Laboratory, referred to USA area) which is the America s primary laboratory for renewable energy and energy efficiency research and development (R&D); the British Petroleum, another important global energy company in the world and, finally, the World Bank which is one of the most important institution interested in the themes of energy security and of their impact on the global economic aspects.. The reason why we privileged these sources and not others is strictly linked to the problem of the trust of the information provided on the web. In our opinion the informative sources selected by the ontology engineers to discover information and to acquire specific domain knowledge about the knowledge field to represent, are a sort of foundational bricks for the ontology infrastructure. And so they have a 40

41 very important role. According to this view we choose to have this kind of selection because we retained these sources as authoritative in the energy field and, therefore, with a good degree of trust. According with the domain information extracted from the selected documents we started to create the manual ontology using Protegé software editor, version ( We choose this software for the following reasons: It is an extensible knowledge model. The internal representational primitives in Protégé can be redefined declaratively. Protégé s primitives - the elements of its knowledge model - provide classes, instances of these classes, slot representing attributes of classes and instances, and facets expressing additional information about slots. Have a customizable user interface. The standard Protégé user interface components for displaying and acquiring data can be replaced with new components that fit particular types of ontologies best (e.g., for OWL). Allows importing ontologies in different formats. There are several plug-ins available for importing ontologies in different formats into Protégé, including XML, RDF, and OWL. Support data entry. Protégé provides facilities whereby the system can automatically generate data entry forms for acquiring instances of the concepts defined by the source ontology. Have a lot of ontology authoring and management tools. The PROMPT tools are Protégé plug-ins that allow developers to merge ontologies, to track changes in ontologies over time, and to create views of ontologies. The Protégé internal 41

42 knowledge representation can be translated into the various representations used in the different ontologies. Protégé has different back end storage mechanisms, including relational database, XML, and flat file. Presents an extensible architecture that enables integration with other applications. Protégé can be connected directly to external programs in order to use its ontologies in intelligent applications, such as reasoning and classification services. Availability of the Java Application Programming Interface (API). System developers can use the Protégé API to access and programmatically manipulate Protégé ontologies. Protégé can be run as a stand-alone application or through a Protégé client in communication with a remote server Definition of Domain Concepts Explicit presentation of the subdivision of the different steps followed for the manual ontology generation it s not easy, because different operations took place in parallel (by one or more persons involved into the project). However, in general, the process of the energy ontology modeling was the following: in a first phase, after initial documents reading and initial domain knowledge acquisition, a first hierarchy of concept classes was constructed (the first version of the manual ontology was formed by 16 classes, 11 properties, inverse functions included and 40 instances, see the table 3.1). The process followed for the domain concepts definition was based on the manual extraction of the main keywords presented into the read texts and on the inference we made on them (e.g. for each keyword was considered its hyponyms, hyperonyms, meronyms, and, finally, the potential related 42

43 concepts that was possible to extract). Then, after creation of the infrastructure of the ontology concepts, we started to insert more detailed knowledge into the ontology, defining instances and different type of relationships (properties in Protegé) in order to put more meaningful information within the ontology. Classes Instances Properties Energy Domain, Risks, Solutions, Energy Security, Reliability to Supply, Friendliness to Environment, Energy Sources, Primary and Secondary Sources, Nuclear, Nuclear Weapon Proliferation, Nuclear Energy Proliferation Alchohol Fuel, Biofuel, Ethanol, Coke, Diesel Fuel, Gasoline, JetFuel, Biomass, Corn,Geothermal, Photovoltaic, Solid Waste, Solar, Waste, Water, Wind, Wood, Anthracite, Coal, Bituminous Coal, Lignite, Oil, Propane, Is a type of, includes, is mostly exported by, exports, is one of the major producer, is mostly produced by, is used to, is made into, cause, is caused by, is useful in case of. Country, Infrastructure, Natural Gas, Peat,, Belarus, Renewable and Non Renewable Sources. USA, China, Russia, India, Canada, Brazil, Venezuela, Nigeria, Saudi Arabia, Norway, Mexico, UK, Uzbekistan,, Algeria, Kuwait. Table 3.1: Classes, Properties and Instances of the first version of the manual Ontology Horizontal Links After the properties and relations definition we started to create some horizontal links between different classes, subclasses and instances. The horizontal links are properties or relations which link individuals or classes which aren t in hierarchical 43

44 relationships. This kind of relations are very important because allow to represent, within the ontology, types of information of higher level if compared with the subsumption ( is-a relations) and similarity information usually provided by the knowledge representation systems. An example of the usual information presented is the following Margherita is a type of pizza. Through horizontal links, instead, is possible to give more information representing, for example, information as the following: In Naples there is the local X, in front of the place Y, that prepares a fantastic pizza Margherita. So it seems evident the improvement given by the horizontal links to the quality of information that is possible to provide within these system. To create the horizontal links we have used logical properties with one or more arguments. An example of a logical property with one argument is the following: Mario is a funny guy, where Mario is the argument and to be a funny guy is the property. An example of logical properties with two arguments (this kind of properties are usually named, in the Logic literature, relations ) is the following: Saudi Arabia is one of the major exporter of oil where Saudi Arabia and oil represents the arguments and to be one of the major exporter of is the relation asserted. This kind of properties allowed us to create relations within the ontology integrating up to 3 classes (a concrete example will be shown in the next paragraph with the ontology description). The table 3.2 show the full stats of the horizontal links presented within the ontology. The major part of them link 2 classes and, then, is based on properties with two arguments. 44

Horizontal Links between 2 classes Horizontal Links up to 3 classes 96 12 Table 3.2. Horizontal Links Data Figure 3.

45 Horizontal Links between 2 classes Horizontal Links up to 3 classes Table 3.2. Horizontal Links Data Figure 3.1 shows an example of the high degree of interconnection created within the ontology using horizontal links. An example easily seen in figure (on the right part) is represented by the link between the Class Solutions and the class Friendliness to Environment It show, and give us the information, that there is one instance of the class Solution (we recognize that is an instance because the colour of the link is pink) that is linked to the issue of friendliness to environment (and probably is linked in a positive manner in the sense that one of the solution for the energetic problems goes in the direction of the friendliness to environment). Figure 3.1. Horizontal Links within the ontology 45

46 3.1.5 Ontology Population Once the ontology structure was defined, ontology population phase started launching crawlers on the web in order to retrieve other meaningful documents about energy domain, both using automatic lexical acquisition systems in order to update the ontology with new concepts and new lexicon. The first strategy, with the crawlers, allowed us to recover relevant documents not previously retrieved with the search queries. The second strategy was followed through the use of an automatic keyphrase extraction system called LAKE (D Avanzo et al. 2004) that will be described in detail in the next chapter. This system was used to extract relevant keywords from the documents given in input (the input documents were that retrieved using the crawlers strategy) and new lexicon and new concepts, then manually inserted into the ontology, were recovered using the extracted keywords. This kind of work has been mainly done for a specific class of the ontology: the energy security class Logical Expressions In order to insert into the ontology different types of information, we used logic expressions. To create these expressions we used the quantifiers of the first order logic, as introduced by Frege (1879), to introduce more and better inferential mechanisms into the system. The introduction of descriptive logic expressions of first order was possible because of the choice of building the energy ontology using OWL DL. This language, in fact, allows enriching the ontology by inserting formal descriptive expressions. OWL DL can be viewed as an expressive Description 46

47 Logics, and an OWL DL ontology is the equivalent of a Description Logic knowledge base. We used two kinds of expressions to better specify, in a formal manner, the information provided to the system: expressions with the Existential and Universal Quantifiers: 1. ( ) Expressions with the Existential quantifier: which specify, for a set of individuals, the existence of a (at least one) relationship along a given property to an individual that is a member of a specific class. 2. ( ) Expressions with the Universal quantifier: which constrain the relationships along a given property to individuals that are members of a specific class (in other terms is used to state that all the individuals of the class X have a certain property Y ). A simple example of a double translation, first in formal logic and then in OWL DL construction, of a statement in natural language about the energy domain is the following: Verbal Proposition: Some Fossil Fuels cause some environmental consequences or some Risks for Energy domain First Order Predicate Logic: x (Fx Ce V Cr) Protégé OWL DL construction: Fossil Fuels cause some (Environmental Consequences or Risks, see figure 2 for example). 47

48 Figure 3.2. A logic expression inserted into the Energy Ontology Another reason that guided our choice of OWL DL language is represented by the fact that there are many implementations for reasoning testing and that was important for the evaluation of the consistency of the ontology. The OWL Plugin of Protégé, in fact, provides direct access to DL reasoners such as Racer Pro (that will be discussed in the next paragraph). The current user interface supports two types of DL reasoning: Consistency checking and classification (subsumption). Consistency checking (i.e., the test whether a class could have instances) can be invoked either for all classes with a single mouse click, or for selected classes only. Inconsistent classes are marked with a red bordered icon. Classification (i.e., inferring a new subsumption tree from the asserted definitions) can be invoked with the classify button on a one-shot basis. When the classify button is pressed, the system determines the OWL species, because some reasoners 48

49 are unable to handle the ontologies in OWL Full. If the ontology is in OWL Full (e.g., because metaclasses are used) the system attempts to convert the ontology temporarily into OWL DL. The OWL Plugin supports editing some features of OWL Full (e.g., assigning ranges to annotation properties, and creating metaclasses). These are easily detected and can be removed before the data are sent to the classifier. Once the ontology has been converted into OWL DL, a full consistency check is performed, because inconsistent classes cannot be classified correctly. Finally, the classification results are stored until the next invocation of the classifier, and can be browsed separately. Classification can be invoked either for the whole ontology, or for selected sub-trees only. In the latter case, the transitive closure of all accessible classes is sent to the classifier. This may return an incomplete classification because it does not take incoming edges into account, but in many cases it provides a reasonable approximation without having to process the whole ontology. OWL files store only the subsumptions that have been asserted by the user. However, experience has shown that, in order to edit and correct their ontologies, users need to distinguish between what they have asserted and what the classifier has inferred. Many users may find it more natural to navigate the inferred hierarchy, because it displays the semantically correct position of all the classes. The OWL Plugin addresses this need by displaying both hierarchies and making available extensive information on the inferences made during classification. After classification the OWL Plugin displays an inferred classification hierarchy beside the original asserted hierarchy. The classes that have changed their superclasses are highlighted in blue, and moving the mouse over them explains the changes. 49

50 Furthermore, a complete list of all changes suggested by the classifier is shown in the upper right area, similar to a list of compiler messages. A click on an entry navigates to the affected class. Also, the conditions widget can be switched between asserted and inferred conditions. All this allows the users to analyze the changes quickly. The manual ontology correctness has been evaluated using the first above mentioned task: consistency checking Racer Pro RacerPro stands for Renamed ABox and Concept Expression Reasoner Professional. Its origins are within the area of description logics. Since description logics provide the foundation of international approaches to standardize ontology languages in the context of semantic web, RacerPro is also used as a system for managing semantic web ontologies based on OWL. It can be used as a reasoning engine for ontology editors such as Protégé. An important aspect of this system is the ability to process OWL documents. The following services are provided for OWL ontologies and RDF data descriptions: Check the consistency of an OWL ontology and a set of data descriptions. Find implicit subclass relationships induced by the declaration in the ontology. Find synonyms for resources (either classes or instance names). Since extensional information from OWL documents (OWL instances and their interrelationships) needs to be queried for client applications, an OWL- 50

QL query processing system is available as an open-source project for RacerPro. HTTP client for retrieving imported resources from the web. Multiple resources can be imported into one ontology.

51 QL query processing system is available as an open-source project for RacerPro. HTTP client for retrieving imported resources from the web. Multiple resources can be imported into one ontology. Incremental query answering for information retrieval tasks (retrieve the next n results of a query). In addition, RacerPro supports the adaptive use of computational resource: answers which require few computational resources are delivered first, and user applications can decide whether computing all answers is worth the effort. In order to evaluate the manually built ontology we firstly connect this tool to the ontology in OWL (the process is shown in figure 3.3) and then run it to check the ontology consistency. Figure 3.3. Ontology connection with RacerProReasoner The process of ontology evaluation has been executed during the whole process of ontology building until the last updating of the ontology. And that allowed us to 51

52 correct immediately the logical inconsistencies we found during the process. During the last evaluation no logical inconsistency or taxonomy errors were found. That means that the manually built ontology have an it s internal logical coherence. 3.2 Manual approach: Energy Ontology Description In this paragraph will be described the classes of the manual energy ontology. Some of them revealed to be more meaningful with respect to others because of their high level of interconnection with other classes and instances. The full data of the manually built ontology are in the table 3.3. Table 3.3. Energy Ontology Data Documents Classes Instancies Properties (inv. Functions included) Energy Domain For the Energy ontology we have adopted a super-class named Energy_Domain composed of 6 sub-classes: Country, EnergySecurity, EnergySources, Infrastructures, Market and EnvironmentalConsequences (see the figures 3.4 and 3.5 below). 52

super-class two transverse classes such as Risks and Solutions have been inserted (figure 3.6).

53 Figure 3.4: Energy domain Subclasses in the Classes view tab of Protégé Figure 3.5 a visualization of the Energy domain subclasses with Ontoviz plugin At the same level of the energy domain super-class two transverse classes such as Risks and Solutions have been inserted (figure 3.6). This choice has been made because these two categories were presented in many of the classes of the ontology and so it wasn t possible to clearly decide in which class to include them and in 53

54 which not. With this strategy, instead, it was possible to create many horizontal relations between these classes and the other classes of the ontology. For the concept Risks we considered as subclasses different type of issues such as:: economic problems (e.g. market instabilities, market cartels, growth in energy price etc.), geopolitical problems (e.g. natural gas, oil and nuclear dispute) and technical problems. While, for Solutions, we mainly considered as subclasses the governments action and policies used to avoid problems about energy. We divided this class in 3 subclasses. long term-medium term and short term policies. OWL Thing Energy Domain Risks Solutions Figure 3.6: Energy Ontology Superclasses Energy Sources We divided this class into Primary, Secondary and Nuclear sources following the most common distinction made in the literature about these issues. Primary sources don t need to be subjected to any conversion or transformation process to be used. Secondary Energy Sources, instead, have to be transformed from one form to another to be used. Each of these two classes have other subclasses represented by the distinction between renewable and not renewable sources. For the nuclear class the subclasses nuclear energy sources and nuclear weapons proliferations have been 54

55 identified. This last class allowed us to create horizontal links with the class geopolitical problems which is a subclass of the superclass Risks. Energy Sources Primary Sources Nuclear Secondary Sources Renewable Not renewable Nuclear En. Sources Nuclear W. Profilerat. Renewable Not Renewable Figure 3.7. Energy Sources Classification The case of the Hydrogen: Renewable or not Renewable? One of the main problems encountered during the ontology building task was the difficulty to classify some entities as belonging to one class or to another one. A meaningful example of this type of classification problem was what we have called the case of the Hydrogen. The problem was the following: different sources which we consulted classified the hydrogen in different manners because the hydrogen can be both renewable and non renewable. It depends from what source it is extracted. So we had the problem that one concept of the ontology was member, at the same time, of one class and of its opposite. And, of course, that isn t allowed. In order to overcome this problem we created different hydrogen entities. One of that was hydrogen by renewable sources (solar, wind etc). And the other one was hydrogen by non renewable (hydrocarbon etc.). In the figure 3.8 is shown the way in which we solved the problem. 55

56 Renewable Sources Non Renewable Sources Hydrogen (water, solar) Hydrogen (hydrocarbon) Figure 3.8: the Classification of Hydrogen Country Country class has 2 subclasses that include OPEC members and NON OPEC members. It is an important class because it allows to create horizontal relationship between with energy sources class or with the economic and environmental policies aspects. This class can be modified and modelled in a different manner considering different classifications criteria. For example: if we want to focus on the geopolitical problems (which, of course, involve the countries) related to the natural gas energy source, a different classification can be proposed by a domain expert and the number of the subclasses can be easily enlarged Energy Security Class Energy Security is the class of the ontology on which we mainly focused during the last part of manual ontology building, following the suggestions of the energy domain expert Dr. Brenda Schaffer. This issue has risen to the top of the agenda among policy makers, international organizations and business because of in the last decade there was a sustained growth in demand for energy that seriously concerns the long-term availability of reliable and affordable supplies. 56

57 Energy security has broad economic, political, and societal consequences. A lack of energy security can exacerbate geopolitical tensions and impede development. Because of the interconnection of this concept with many other aspects (geopolitical and economic problems, technical aspect etc.) we spent time and efforts in the classification of this class because there isn t a clear definition of it. It is a sort of multiface concept because it regards different issues. So we created for this concept an infrastructure able to link it with other concepts within the ontology. Relations created allowed the connection of this concept until with 3 other classes. The main direct subclass of energy security are presented in figure 3.9. Energy Security Reliability of supply Scarce Resources Dependency Resources Affordability Friendliness to Environment Figure 3.9: Energy Security direct subclasses Figure 3.10 exemplifies the way we created horizontal links with the other classes of the ontology. For example: we linked the concept energy security with the concept geopolitical problems which is a sub-concept of the superclass Risks. Moreover, creating these links, we were able to connect three classes of the ontology: Energy security (reliability of supply is its subclass), Country and Geopolitical Problems. The created infrastructure, indeed, allowed us to provide this kind of information: Ex. Cut off in supply for USA (which is an energy 57

58 security subclass) is caused by USA vs Venezuela dispute (that is subclass of geopolitical problems) and, this fact, involves USA and Venezuela (which are instances of the Class country). In the figure the arrows represent the horizontal links (the position of the classes in figure do not show their real position within the ontology: e.g Risks and Energy Security aren t at the same level). USA and Venezuela are two instances and are shown in blue. The classes are in yellow. The property which link Cut off in supply for USA and the subclass Venezuela vs USA dispute is is caused by (and there is also the inverse property cause ) and is represented by the black arrow in the figure. The property which links the instances USA and Venezuela with the classes in figure (red arrows) is involve (with its inverse function is involved in ) and an example of information provided with this links is the following: Venezuela and USA are involved in dispute among them that caused cut off in supply for USA. 58

Figure n 3.10: Ontology connection between Energy Security and Geopolitical Problems Another example of horizontal connection created is shown in figure 3.11.

59 Figure n 3.10: Ontology connection between Energy Security and Geopolitical Problems Another example of horizontal connection created is shown in figure The link is between an energy security subclass (scarce resources dependency) and Economic Problems such as market cartels, growth of energy price, and market instabilities (Economic Problems is one of the subclass of the superclass Risks).. The horizontal links are represented, in figure, by the bidirectional arrows. The arrows indicates the property cause (and its inverse function is caused by ). So, in this case, the created links allowed us to provide, for example, this kind of information within the ontology: Scarce Resources Dependency cause growth of energy price and its inverse Growth of energy price is caused by scarce resources dependency. The same is for the other classes linked in figure. 59

Figure 3.11 Energy security and Economic Problems 3.2.6 Infrastructures This class represents the infrastructure used for the energy transformation, extraction and transportation.

60 Figure 3.11 Energy security and Economic Problems Infrastructures This class represents the infrastructure used for the energy transformation, extraction and transportation. For this class have been identified 10 instances which are listed here: Oil Tanker, Pipelines Turbine, Drilling Equipment, Barge, Heliostats, Methane Pipelines, Ship, Train, Refinery Environmental Consequences Environmental Consequences are a very important issue regarding the thematic of the energy domain. Within the energy ontology we created this class identifying 14 instances referred to this issue. They are listed here: climate change, pollution, deforestation, desertification, global warming, higher global temperatures, bird 60

61 flight patterns, CO2 Emissions, Damage to views, Flooding, Droughts, Increased Rains, Greenhouse gas emissions, impacts weather, urbanization. Many of these instances are highly correlated between them. For example: climate change refers to any significant change in measures of climate (such as temperature, precipitation, or wind) lasting for an extended period (decades or longer). And it may result from: 1. natural factors, such as changes in the sun's intensity or slow changes in the Earth's orbit around the sun; 2. natural processes within the climate system (e.g. changes in ocean circulation); 3. human activities that change the atmosphere's composition (e.g. through burning fossil fuels) and the land surface (e.g. deforestation, reforestation, urbanization, desertification, etc.) This last possibility allows to link the instances desertification, deforestation, urbanization and climate change through an information as the following: Deforestation is one of the cause of climate change. Even this information is provided by means of an horizontal link between the instances belonging to the same class (in fact even in this case the link is horizontal because the instances are all at the same level of the hierarchy). 61

62 3.2.8 Energy Use For the class energy use we followed the major distinction made in literature about this matter identifying 5 subclasses. One for each sector on which the energy is consumed: The subclasses identified are: Commercial, Electric, Industrial Transportation, Residential. For this subclasses we created some horizontal links regarding the energy sources mainly utilized for the different uses (linked, in this way, the concepts of Energy Sources with the above mentioned Energy Use subclasses ). For example: for the energetic use mainly are consumed non renewable sources and the information provided through the horizontal link is the following Residential Use mainly make use of Non Renewable Sources (the property identified for this link is mainly use and its inverse function is mainly used ). 62

63 3.3 A Semi automatic approach for ontology building There are pros and cons to manual ontology construction. The pros are that human expertise and background knowledge of a specific domain directs the definition of the concepts taxonomy and the creation of relevant properties. Moreover, in enables the insertion of meaningful information into the ontological system. On the other hand, the cons are, one of the major disadvantages of the manual ontology construction are the efforts required from ontology designers. Indeed, for an ontology engineer, to construct manually, without any help, a complex knowledge representation system as an ontology, means spending a lot of time in: reading documents, extracting the most relevant keywords for each document, inserting the keywords within the ontology in a proper manner (without creating confusion between classes-keywords and instancies-keywords), inferring relations between terms (represented as classes or as instances) etc. Furthermore manual ontology building implies that the ontology builder is a specialist in the ontology construction or, at least, a knowledge domain expert with skills of ontology engineering and familiar with the relevant ontology editor softwares (e.g. Protegé, OntoStudio etc.). The question is how many domain specialists in various domains also know something about the ontologies? In order to overcome these problems, different systems able to build ontologies in an automatic or a semi-automatic way have been developed in the last years. They may start by using dictionaries, knowledge bases, semi-structured schemata, relational schemata or from unstructured textual documents (see the Chapter 2 for a rapid overview). One of these, briefly described in the previous chapter, is Ontogen. It has been developed by Blaz Fortuna, Marco Grobelnik and Dunja Mladenic of 63

64 Joseph Stefan Institute, Ljubljana (Fortuna et al. 2007). It is a semi-automatic, data driven topic ontology 9 editor. It integrates machine learning and text mining algorithms to reduce both the time spent by the user in the ontology building and the complexity of the task itself (Fortuna et al 2006). It is semi-automatic because it automatically performs some tasks (e.g. concepts and concepts names suggestion, instancies assignment to the concepts etc.), but, at the same time, allows the user to keep the full control of the system because he/she can accept, adjust (even manually) or reject the modifications that are made automatically. It is data driven because the system is guided by the data given by the user during the initial phase of the ontology construction. The data provided by the users is a corpus of documents given in input to the system in order to generate the ontology and, for that reason, they reflect the domain knowledge for which the user is building the ontology. The documents are represented in a bag of words (BOW) representation, in which each document is encoded as a vector of terms frequencies. The terms are weighted with TFxIDF measure and the similarity between two documents is calculated with the cosine similarity measure, which is defined by the cosine of the angle between two bags of words vectors. TFxIDF is a quite common linguistic frequency measure. It integrates the terms frequency within a document (TF) with the frequency of that term in a corpora (hence integrating relative importance within a document with how well the term represent the document in the corpora). Ontogen s work is based on the automatic extraction of keywords from the documents given as input by the user for the ontology generation. There are two keywords extraction methods applied by the system. The first, shown under the 9 A topic ontology is a set of topics (or concepts) connected with different types of relations. Each topic includes a set of related documents. 64

label keywords in the system s interface (see figure 3.12), uses centroid vectors. Centroid is the sum of all the vectors of the documents inside the topic.

65 label keywords in the system s interface (see figure 3.12), uses centroid vectors. Centroid is the sum of all the vectors of the documents inside the topic. The keywords selected have the highest weights in the centroid vector. Figure Keywords extraction methods with centroid vectors. The set of keywords extracted with this method is composed by the most descriptive words of the concept s documents (or instances). The second method, shown under the label SVM keywords (see figure 3.13.), uses the Support Vector Machine classifier. This method is used to extract keywords describing a selected concept. The classifier is trained as follows: consider that A is the topic to be described with the keywords. All the documents from the concepts that have A as subtopics are marked as negative while the documents under A are marked as positive. Then the linear SVM classifier is trained by these documents and classify the centroid of the topic A. With this method the set of the extracted keywords is composed by the most distinctive words for a selected concept with regards to its sibling concepts in the hierarchy (Fortuna et al 2005). 65

Figure 3.13. Keyword extraction methods with Support Vector Machine Classifier.

66 Figure Keyword extraction methods with Support Vector Machine Classifier. The keywords by one of the above methods are used to suggest possible topics, subtopics or name of topics for the ontology building. The suggestions generation is one of the main features of OntoGen. It can be provided in two different manners: with unsupervised and supervised methods. In the unsupervised approach the system provides concept and sub-concepts suggestion using keywords extracted from the documents of the selected topics with Latent Semantic Indexing (LSI) or K-means algorithm 10. The main advantage of the unsupervised method is that it require very little input from the user (it must be indicated only the number of clusters). With the supervised method, instead, the concept suggestion is driven by the user, in the sense that the use of this method implies that the user has an initial idea about what a concept or a sub-concept of the ontology should be. Having 10 The algorithm method is chosen by the user. 66

this knowledge background he/she can enter a query and use it to train the system using relevant documents, as explained below (see figure 3.14). Figure 314.

67 this knowledge background he/she can enter a query and use it to train the system using relevant documents, as explained below (see figure 3.14). Figure 314. An example of user query with the supervised method for the suggestions generation in Ontogen. Once the user enter his/her query in form of keywords or keyphrases the system start asking him/her questions in which asks if a particular document belongs to the selected concept. And the user can selects Yes or No buttons as answers (see figure 3.15). 67

68 Figure Training phase in Ontogen using the supervised method for the concepts suggestion In other words, after the user s query, the system starts an active learning process based on the feedback provided by the user during a training phase. In this phase the user have to provide information to the system answering with positive or negative feedback (Yes or no buttons) to the question: Does this document belongs to the concept?. On the other hand, the system uses a machine learning technique for semi automatic acquisition of user knowledge. It refines the suggested concepts for each reply of the user and, after the user is satisfied with the suggestions, a concept is constructed and is added to the ontology as sub-concept of a selected concept. 68

69 3.3.1 Semi Automatic Ontology Generation Ontogen has been used for the semiautomatic ontology generation of a part of energy domain ontology. The part chosen is the energy security class that is a sort of micro-ontology within the general energy domain (it is formed by 14 classes in total). The semi automatic ontology for the pilot study experiments has been created with 25 selected domain documents about energy security. The system was able to build an ontology of 10 concepts. A screenshot of the generated concepts is presented by figure Figure 3.17 shows the semi-automatically generated ontology exported in OWL format within Protégé. Figure 3.16: A screenshot of the semi-automatic generated concepts 69

70 Figure The Semi-automatic generated ontology exported in OWL 70

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet Joerg-Uwe Kietz, Alexander Maedche, Raphael Volz Swisslife Information Systems Research Lab, Zuerich, Switzerland fkietz, volzg@swisslife.ch