Semi-automatic creation of domain ontologies with centroid based crawlers. Carel Fenijn

Size: px

Start display at page:

Download "Semi-automatic creation of domain ontologies with centroid based crawlers. Carel Fenijn"

Martin Carroll
5 years ago
Views:

1 Semi-automatic creation of domain ontologies with centroid based crawlers Carel Fenijn Graduate Thesis Doctoraal Linguistics Utrecht University, December 2007

2 Contents i i 1 Introduction The World Wide Web The Semantic Web From World Wide Web to Semantic Web Ontology Engineering Ontology Definitions Types of Ontologies Classification of Ontologies Ontology Languages Ontology Design Ontology Learning Ontology Learning Techniques Ontology Editors and Engineering Tools Ontology Learning Approaches Assessment of Ontology Learning Approaches Information Retrieval: Focused Crawling Definition Focused Crawling Focused Crawling Techniques Focused Crawling Approaches Assessment of Focused Crawling Approaches OntoSpider The Ontology Engineering Component of OntoSpider The IR Component of OntoSpider Assessment of OntoSpider Conclusion and Further Research Some Notes on Methodology Future Research Bibliography

3 ii List of Figures 1.1 Layered Stack of the Semantic Web, from An Ontology Scale by Lassila and McGuiness, An Ontology Scale by Daconta et al., Opening screen of Protege with the OntoLT plug-in marked in red Tabs of OntoLT Mapping Rule for Head Nouns and Modifiers Above Rule in an older version of OntoLT Simplified Possible View of OntoSpider OntoSpider with OntoLT as the Ontology Learning Component IR Component of OntoSpider Rich output of OntoSpider

4 Abstract Various approaches exist for the semi-automatic creation of ontologies from text. This thesis shows how centroid based focused crawlers can be used for this purpose, specifically with domain ontologies in specialistic fields like that of linguistics as the target. The approach is highly modular: the highly specialistic output corpus of the Information Retrieval component of the approach will be input to its Ontology Engineering component, which can create ontologies. With this approach, domain ontologies can be created for subjects like natural language morphology. The overall approach that is proposed combines techniques from Information Retrieval and of Ontology Engineering. Some systems that could form the Ontology Engineering part of the approach are discussed. In this study it is examined, whether the use of centroid based focused crawlers can help in the semi-automatic creation of ontologies from text. More specifically, two types of focused crawlers will be compared: A general purpose centroid based focused crawler and a literature crawler. The approach that is proposed here is called OntoSpider. iii

5 iv

6 Acknowledgements I would like to thank Dr. Paola Monachesi at Utrecht University, who supervised this thesis with good advice and much patience, as it was mostly difficult to have progress in this research in combination with a full-time job. Thanks also to my former boss at Demon, Jim Segrave, for allowing me to shift work hours in order to follow some courses that were relevant for this research. An interesting course in Information Retrieval by Maarten de Rijke and Valentin Jijkoun at the University of Amsterdam set me off into the direction of this research, and material of that course was used. v

7 vi

8 Chapter 1 Introduction 1.1 The World Wide Web Information on the current Web, the World Wide Web (WWW), is stored in a decentralized way and may be available in many formats, traditionally mostly embedded in the relatively loose format HTML, more and more in the more rigid XHTML, and is often represented in a natural language. Metadata is scarcely present in the form of keywords and DTD s. If DTD s are available, they are often very generic, to such an extent that no real specific semantic data can be derived from them. For agents like web crawlers, it is often difficult to extract the right information from the WWW because agents do not understand natural languages. Web pages are mostly created for human consumption only. Finding specific information on the WWW is often a very time consuming endeavour, with limited results. Because of the overload of information that is present on the World Wide Web, and the noise that is accompanied by it, much research takes place on Information Retrieval from the World Wide Web. For search engines like Google and Yahoo!, software agents called web crawlers or spiders crawl the Web to gather and index information and make this available. Typically, such search engines try to cover a very large part of the Web, and general purpose crawlers may be used for this. In order to keep the task manageable, efficient algorithms have been developed like the PageRank algorithm[55]. One type of crawler that has been developed, is the focused crawler or topicoriented crawler[17]. Focused crawling approaches try to deal with the enormous mass of information that is contained in the Web in a more efficient way than general purpose crawlers do, and offer ways to extract very specific on-topic data from it, by selectively crawling the Web. This saves network traffic and processor 1

9 2 CHAPTER 1. INTRODUCTION time, as only smaller subsets of the Web are crawled. Apart from the more limited use of resources of focused crawlers, they may also yield better results for specific domains than general purpose ones. A centroid based focused crawler is a special type of focused crawler, that makes use of a centroid, which is a representation of highly on-topic information. Focused crawlers in general and centroid based ones in particular, will be described in chapter The Semantic Web The Semantic Web as presented in e.g. Tim Berners-Lee et al. ([7], 2001) is a vision of Tim Berners-Lee, the inventor of the current World Wide Web (WWW) and director of the World Wide Web Consortium (W3C). In Berners-Lee s vision, the future web, the Semantic Web, will also contain metadata. This metadata will enable agents to extract very much specific information from web pages and act intelligently based on this information using logical inferences. Apart from the addition of metadata, other problems of the current Web will be addressed in the Semantic Web. One such problem is that of trust and reliability of data. In a layered model of the design of the Semantic Web, trust is even at the highest layer of the stack (see Figure 1.1), based on proof and cryptography. If intelligent agents can make logical inferences based on explicit formal metadata, they can also account for these inferences, which can be requested and verified by humans if necessary. The interpretation of metadata will be based on information that is present in ontologies. Like on the current World Wide Web, information on the Semantic Web will also be stored in a decentralized way. Ontologies are central to the implementation of the Semantic Web. They contain domain knowledge, specific data regarding a certain subject field, in a very structured way. Semantic Web Agents will be able to interpret information that is found in Web pages using these ontologies, as they give the agents precise information on those pages. Apart from that, based on ontologies, such agents will be able to communicate with each other, as the ontologies provide a shared understanding of a given domain. The development and maintenance of ontologies is part of Ontology Engineering. For the Semantic Web to work, it will be necessary that very many ontologies will be available. In the literature, this is referred to as the bootstrapping problem of the Semantic Web, which is in a way some sort of a chicken and egg problem: Ontologies are a necessary prerequisite for a working Semantic Web, but as long as there is no Semantic Web yet, many people are not interested in producing ontolo-

1.3. FROM WORLD WIDE WEB TO SEMANTIC WEB 3 Figure 1.1: Layered Stack of the Semantic Web, from http://www.w3.org/ gies.

10 1.3. FROM WORLD WIDE WEB TO SEMANTIC WEB 3 Figure 1.1: Layered Stack of the Semantic Web, from gies. One way of overcoming this problem may be to (semi-)automatically create a lot of ontologies from existing resources, like knowledge bases and the World Wide Web. As Ontology Engineering is such a central part of this study, it will be treated in more detail in a separate chapter. 1.3 From World Wide Web to Semantic Web The (semi-)automatic creation of ontologies has received quite some attention these last years. The main reasons for this are the fact that manual creation of ontologies is tedious and costly work. On the one hand, we know that very many ontologies will need to be made in the bootstrapping phase and later phases of the Semantic Web, and much research has been carried out in this direction. On the other hand, in the Information Retrieval community, much work has been done in the improvement of approaches and algorithms for efficient and effective IR on the current World Wide Web. One of the approaches in this field is that of focused crawlers, and here, a centroid based approach is one of the options. We will investigate how these two research areas can be combined. For domains that have already been covered by ontologies, the creation of such ontologies from scratch might be less useful, unless this is for the purpose of evaluating an Ontology Engineering approach. Enriching such existing ontologies makes more sense. This approach may be particularly useful

11 4 CHAPTER 1. INTRODUCTION in case no such ontologies exist yet and repositories of specialized research papers on a subject are available. Even once the Semantic Web will be in a very advanced stage, there may still be highly specialistic subjects on which no ontologies are available at all, and data on the World Wide Web can be of use. The reason why not much other research combines the two fields of Information Retrieval and Ontology Engineering may be the fact, that the semi-automatic creation of ontologies by itself is already a very difficult research field, which has many problems that still need to be solved, like NLP problems and knowledge engineering problems, and most researchers concentrate on just that, or even on sub problems of it. For many specific purposes, selecting a set of relevant documents on which ontologies are based in a different way, like with clustering techniques, often suffices. Also, the field of Information Retrieval has its own issues that need to be resolved, both for IR on the Web and on the Semantic Web. Yet, a motivation for the combination of the two fields will be presented here. The World Wide Web, with all the shortcomings it has compared to the Semantic Web, does contain a huge wealth of information. As mentioned, focused crawlers can extract highly relevant and specialistic information from this World Wide Web. The centroid that is used by centroid based focused crawlers and the set of downloaded pages should contain very specialized data. Usually the downloaded pages will be in a rough format like HTML embedded free text. As was mentioned above, ontologies also contain very specific data. Clearly, it is a far cry from the raw specialized data that is gathered by focused crawlers, and the richly structured data that is contained in ontologies, still it is interesting to study the hypothesis that a combination of a focused crawler approach and an approach of semi-automatic creation of ontologies from text can be fruitful. A study which seems to confirm this hypothesis is Ehrig (([31]), 2002). His approach, which will be described in some more detail below, also combines a focused crawler with ontology creation. However, it uses ontological metadata for the enhancement of focused crawls, instead of simple vector based centroids. The study at hand will do the opposite: Examine how focused crawlers may help in the semi-automatic creation of ontologies from text. Another, more recent, study that combines focused crawling with ontology learning is Su et al ([67] (2004)). They also use ontologies for improving focused crawls. As a side-effect, the ontologies that are used are enriched in an automatic way. Some more detail on their approach will also follow in chapter 3. Note that the definition of Ontology that is adopted here, does not mention the Semantic Web at all. Even though ontologies play a crucial part in the emergence of the Semantic Web, the use of ontologies is more universal than that. In research projects, company intranets, etcetera ontologies may play an important role as well, as part of Knowledge Management. In general, one of the first stages in Ontology Engineering processes, like the manual construction of ontologies, is the enumeration of terms that will be part of the ontology. Noy et

12 1.3. FROM WORLD WIDE WEB TO SEMANTIC WEB 5 al. ([52],2001) describe this as step 3, after determining the domain and scope of the ontology (step 1) and considering reusing existing ontologies (step 2). It may be interesting to see whether the resulting set of terms that are in the centroid of the focused crawler might be a good starting point for this third step in manual ontology creation as well. Now that a motivation for combining a focused crawler with the semi-automatic creation of ontologies from text has been presented, the central areas of this approach of combining Ontology Engineering and centroid based focused crawling, will be described in more detail in the following chapters. Chapter 2 presents the field of Ontology Engineering, describing the main concepts of this field. Chapter 3 is on Ontology Learning. This chapter mainly presents concepts, techniques and approaches that are specific to the (semi)automatic creation of ontologies. In the last chapter, chapter 5, the approach itself, OntoSpider, will be presented. This approach employs the use of centroid based focused crawlers to semi-automatically create domain ontologies based on data that is available on the World Wide Web. More specifically, the results of a General Purpose Focused Crawler will be compared with those of a Literature Crawler, from an Ontology Engineering point of view. For this purpose, hypotheses will be proposed.

13 6 CHAPTER 1. INTRODUCTION

14 Chapter 2 Ontology Engineering The field of Ontology Engineering studies the theory and practice of how Ontologies are designed and created. An overview of a recent state of the art in Ontology Engineering can be found in Gómez-Pérez et al. ([34], 2004). Much of this chapter is based on information from their book. 2.1 Ontology Definitions Traditionally, Ontology is a branch of Philosophy, that studies the being, of things, their essence, existence, properties, nature, classification, etcetera. Ancient Greek philosophers like Parmenides and Aristotle made important contributions to this discipline. Throughout history until the modern time, various philosophers have studied this discipline. More recently, the term Ontology has been adopted within a Knowledge Engineering setting. One very frequently cited definition is that of Gruber ([36], 1993), An ontology is an explicit specification of a conceptualization. Clearly, this definition is rather vague, and other researchers have proposed definitions that are based on Gruber s, but that are more precise. Struder et al. (1998), as cited in [34], define an ontology as a formal explicit specification of a shared conceptualization. Conceptualization refers to an abstract model of some phenomenon in the world by having identified the relevant concepts of that phenomenon. Explicit means that the type of concepts used, and the constraints on their use are explicitly defined. Formal refers to the fact that the ontology should be machine-readable and processable. Shared reflects the notion that an ontology captures consensual knowledge, that is, it is not private of some individual, but accepted by a group. For the purpose of this study, this definition will be adopted. Ontology specifications are formulated in 7

15 8 CHAPTER 2. ONTOLOGY ENGINEERING ontology languages, and various of these have been developed in recent years. 2.2 Types of Ontologies In the literature, various types of ontologies have been proposed. Gómez-Pérez et al. ([34], 2004) mention Top-Level Ontologies which mainly deal with universal abstract categories, General or Common Ontologies that contain common sense information, Knowledge Representation Ontologies for which a KR paradigm is characteristic, Task Ontologies that are focused on a task or activity, Method Ontologies which center around some method, and Application Ontologies that are made for a specific application. The type of Ontology that this study is concerned with, is that of the Domain Ontology. Characteristics of a Domain Ontology are, that a specific domain like a scientific discipline or a specific business is the subject of the Ontology, and that the Ontology therefore typically uses a more specialized vocabulary. 2.3 Classification of Ontologies It is very common to distinguish between lightweight and heavyweight ontologies, and scales or hierarchies of ontologies have been proposed, that place ontologies on such a scale between shallow and heavyweight ones, which also reflects the expressiveness of the formalisms that are used for these ontologies. Such ontology scales can help understand the differences, commonalities and relationships between e.g. semantic networks, thesauri, taxonomies, catalogs, ontologies, relational databases, UML, logics and the Object Oriented paradigm. One classification is that of Lassila and McGuinness ([41], 2001). In their paper, they argue that the RDF formalism can be seen as a frame based formalism, and that frame-based representation is a suitable paradigm for ontology creation. They point out the connection between frame-based systems, object oriented programming and description logics, and argue that even catalogs, glossaries and controlled vocabularies could be seen as potential ontology specifications. The classification, which they present in the paper as An Ontology Spectrum, ranges from such catalogs and glossaries on one end, to systems with general logical constraints on the other end. There is a clear line between systems with informal is-a relations and those with formal is-a relations. While explaining the characteristics of taxonomies and thesauri, the difference between taxonomies and ontologies, and as a basis for their definition of ontologies, Daconta et al. ([23], 2003) propose an Ontology Spectrum with weak semantics on

16 2.3. CLASSIFICATION OF ONTOLOGIES 9 Figure 2.1: An Ontology Scale by Lassila and McGuiness, 2001 Figure 2.2: An Ontology Scale by Daconta et al., 2003

17 10 CHAPTER 2. ONTOLOGY ENGINEERING the lower end of the scale, and strong semantics at the higher end. The scale ranges, from the weakest, the Relational Model, via Taxonomy, Schema, ER, Thesaurus, Extended ER, XTM, RDF/S, Conceptual Model, UML, DAML+OIL, Description Logic, Local Domain Theory and First Order Logic to Modal Logic, which is at the high end of the spectrum. They go in great length describing what taxonomies and ontologies are, and hold that the main difference between taxonomies and ontologies is, that the former do not have rigorous logic, that machines can base inferences on, and the latter do have such rigorous logic. 2.4 Ontology Languages Ontology languages are the formal languages in which ontologies are defined. In this section, only some of the important characteristics of what are currently the main ontology languages will be described in a general informal way. For formal specifications, there is ample literature available XML/XML Schema XML (Extensible Markup Language) is a formal language the conforms with the SGML specifications. It can be seen as a subset of SGML, which is simpler and more practical in its use than SGML. Because XML and XHTML which was derived from it, is more rigid in its definition than HTML, it is easier to process XML and XHTML automatically in a consistent way than HTML. One of the reasons for the W3C to develop XML was, to deal with the shortcomings of HTML. In HTML, the representation of data and its presentation are mixed and messy, in XML they are strictly separated, enabling clear unambiguous data representation with welldefined syntactic means. However, the use of XML is far broader and far reaching than just for applications on the World Wide Web. At the time of writing, XML is the most common standard that is in use for Business to Business (B2B) information interchange. XML Schema and its formal language XML Schema Definition (XSD) allows one to create data models and specify data types and criteria by which XML document are valid or not. Thus XML documents can be syntactically correct according to the XML specifications, but invalid given a specific XML Schema specification. An older schema language that was in common use for HTML and XML, is DTD. DTD s are making way more and more for XML Schema, but for historic reasons they are still in wide use. Unlike DTD s, XML Schema itself conforms to the XML specifications.

18 2.4. ONTOLOGY LANGUAGES 11 In and of themselves, XML and XML Schema do not suffice as ontology languages, for only the correctness and validity of the syntax of XML documents can be determined, not the semantics of these documents. In itself, e.g. the XML markup <dictator>john</dictator> and <gardiner>john</gardiner> do not mean anything different to an XML parser, even though humans who choose or read these tags will most likely assign a certain meaning to them. However, fully-fledged ontology languages which are capable of expressing complex meaning have been formulated fully in XML and XML Schema, which is the reason for mentioning XML here RDF(S) One of the many formal languages that have been constructed in accordance with the XML specifications, is the Resource Description Framework (RDF)[3]. It was developed by the W3C to provide a solid formal basis for ontology languages, expressing meaning with RDF-triples. RDF-triples are sets of three identifiers, resources, one of which intuitively functions as a subject, one as an object, and one as a predicate or relation between subject and object, much like meaning can be represented in many natural languages and in First Order Predicate Logic (FOL). A triplet like a, R, b could be represented in FOL with the two place predicate R like so: Rab or R(a,b). The identifiers of RDF-triples are often URI s, for the subject and relation or predicate, this is always the case. The object can be either a URI or a literal. The URI s, which are often URL s on the Web, ensure explicitness and precision of data representation. For example, thousands of different entities called John can all have their own URL disambiguating them. Even though RDF s data model with RDF-triples is simple, its expressiveness is very great. Many RDF-triples can combine into complicated webs of knowledge that are equivalent to semantic nets. Even though more place predicates in FOL cannot be represented with a single RDFtriplet, they can be represented with multiple RDF-triples in an indirect way. Also, reification is part of RDF, so it is possible to make statements about RDF statements in this data model. Furthermore, RDF containers are part of the RDF data model, with groups of resources like bags (unordered sets) and sequences (ordered sets). RDF has been extensively documented by the W3C and all specifications are open in the RDF Concepts and Abstract Syntax document, the RDF Semantics document ([57]) and other documents. Also, a document like the RDF Primer ([56]) makes the technology accessible to the public. RDF does not necessarily have to be represented in XML. Shorthand notations exist like N3, and tuple notation like subject, predicate, object and <subject> <predicate> <object> are in use, as well as graphical representations with directed labeled graphs. Although RDF graphs are easy to consume by humans, it is more efficient to serialize the data in

19 12 CHAPTER 2. ONTOLOGY ENGINEERING XML format so that it is easy for computer programs to process it. Unlike XML Schema is to XML, RDF Schema is not a schema language in which valid RDF representations are defined. RDF Schema was built on top of RDF, and can be seen as a limited, lightweight ontology language, in that in it, the class and subclass relations are defined in a formal way, and RDF vocabularies can be formulated, in which classes and properties are distinguished OWL The constraints that RDFS imposes on RDF are quite limited. Other ontology languages were developed, which impose more and preciser constraints and allow the formulation of heavyweight ontologies. One such language is the Web Ontology Language (OWL), which exists in three types: OWL Lite, OWL DL and OWL Full. Like RDF, OWL is well documented with extensive open documentation, like the OWL Web Ontology Language Reference ([54]). Historically it descends from earlier ontology languages, DAML+OIL, which like OWL itself was based on RDF. In OWL Lite, relatively lightweight ontologies like taxonomies can be formulated, in OWL DL, which is more expressive, more heavyweight ones, and in OWL Full which is most expressive of the three, any ontologies that the RDF formalism allows for can be formulated. The choice of the type of OWL can depend on the purpose of a project, if one only needs to produce a taxonomy, the choice of OWL Lite can be evident, also for reasons of decidability and efficiency. A quick overview of the OWL specifications can be found in the OWL Web Ontology Language Overview ([53]). 2.5 Ontology Design Various very specific methodologies for ontology design have been proposed in the literature. They are applicable both to manual and to (semi-)automatic Ontology Design. The main methodologies for defining a classification of classes or concepts in ontologies, are the Top-Down one, which goes from general to specific, and the Bottom-Up one which goes the opposite direction, from specific to more general classes or concepts. The Top-Down methodology departs from general concepts, going to specific ones. According to Uschold and Gruninger ([71],1996), the amount of detail of the ontology is better controlled with this methodology as compared with the Bottom- Up methodology. A disadvantage of this methodology is however, that it can become arbitrary which high-level concepts will get a place in the ontology when this

20 2.5. ONTOLOGY DESIGN 13 methodology is followed. In the Top-Down approach, the high-level concepts do not follow from the lower level concepts themselves. Therefore, the ontology could become less stable and the process may require more effort and re-work. The Bottom-Up methodology goes from detailed and specific concepts to more general ones. Uschold and Gruninger ([71],1996) maintain that this approach may also result in more effort and re-work, but for different reasons. The level of detail in the ontology may become very high in this approach, which may increase the chance of inconsistencies and which may make commonalities between related concepts less transparent. In the Middle-Out approach, one starts with the main concepts in the middle, i.e. those which are neither very high-level nor at the maximum of specificity. Uschold and Gruninger ([71],1996) hold, that this approach strikes a balance in the level of detail of the resulting ontology. High-level and low-level concepts only follow naturally from these main concepts from which one departs. An approach that ([71],1996) do not mention, is that of Mixture Ontology design. Here, one could start with both high-level concepts and concepts at the lowest level, which have most detail, thus mixing the Top-Down and Bottom-Up approach. It is expected, that this methodology would suffer from the drawbacks of both of the other methodologies. All in all, the Middle-Out Ontology Design strategy seems to be the most promising. Many approaches that involve some cyclic, iterative way of constructing ontologies, will include the possibility of enriching existing ontologies because of this. Approaches may also focus on the enrichment of existing ontologies as a goal in itself. Apart from enriching existing ontologies, there are also strategies that reuse existing ontologies to create totally new ontologies. Also, strategies that merge two or more existing ontologies into one ontologies exist.

21 14 CHAPTER 2. ONTOLOGY ENGINEERING

22 Chapter 3 Ontology Learning Ontology Learning is the acquisition of knowledge for the (semi)automatic creation of ontologies. Very often Ontology Learning is from text, but it can also be from other sources, like databases. Because of the interdisciplinary nature of the subject, very many Ontology Learning approaches exist, and many methods and techniques are used in this field. Buitelaar et al. ([11], 2003) argue, that in spite of this multidisciplinary nature, Ontology Learning is a new and challenging area in its own right. In this chapter, some existing surveys will first be treated. Then some specific approaches that are somehow similar to the OntoSpider approach that is presented in chapter 5, or that are somehow related to it will be described. Finally, some general aspects of various approaches will be mentioned, like commonalities in system designs, convergence or divergence of NLP approaches, choice of AI technologies, etcetera. This study will mainly focus on ontology learning from text. It is not an exhaustive survey. The reason for examining various approaches apart from using existing surveys was, to get a better grasp of the subject matter and to avoid reinventing the wheel. Roughly, two types of related work can be distinguished: Work that is very similar to the total approach, i.e. it both involves a focused crawler and (semi-)automatic creation of ontologies from text, and work that is only similar to part of it, i.e. work that only involves the use of focused crawlers, mainly for scientific data gathering purposes, or that involves the (semi-)automatic creation of ontologies. 15

23 16 CHAPTER 3. ONTOLOGY LEARNING 3.1 Ontology Learning Techniques Because of the multidisciplinary nature of the field, many existing techniques from fields like Artificial Intelligence and Information Retrieval are used for Ontology Learning. This section presents some techniques that may be used by various Ontology Learning approaches. Many of these techniques are related to text processing and analysis. There is often a choice of algorithms that can be used for the implementation of the techniques that are being described here. Certain techniques are very general in nature, and might as well have been presented in the chapter on Information Retrieval, chapter 4. Web Mining is a research area in which information is extracted from the World Wide Web. For this extraction, among other things Text Mining may be used, here the information extraction is specifically from texts in natural languages like English and French. Further techniques that are used include Chunk Parsing, POS Tagging and Semantic Tagging. Stopping or stopword removal is a standard technique in NLP. The most frequent words in corpora, the stopwords, will occur in practically any document, hence they are not significant for most IR purposes, and are removed at a very early stage. Like stopping, stemming is very standard in NLP and IR. Stemming reduces the various forms of a word that may be the result of morphological processes like inflection and derivation, to a single stem or root. Some often used stemmers are the Porter stemmer, which has modules for various languages, and the Lovins stemmer. Often, from a morphological point of view, the results of stemmers are quite crude, but from a pragmatic point of view they are still very effective. Part-Of-Speech tagging or POS-tagging is also a very common technique in NLP and IR. One of the most famous one is the Brill POS-tagger, another one is the Monty POS-tagger. Chunk Parsing is a shallow technique, by which natural language sentences are parsed in chunks. Very roughly, these chunks correspond to syntactic phrases. Often, chunk parsing takes place after a POS-tagging phase, and the technique is widely used in IR. Approaches that have a chunk parser, include SMES and SymOntos. The latter uses the CHAOS chunk parser. A specific application of chunk parsing is cascaded chunk parsing. In this approach, the output of one round of chunk parsing can be input to a next round of chunk parsing at which new chunks can be parsed, thus multiple rounds of consequent chunk parsing can take place. Semantic Tagging or Semantic Annotation is the enrichment of natural language texts like corpora with semantic tags. Often a semantically tagged text comes in a way closer to an ontology, as it could be input to concept extraction modules

24 3.2. ONTOLOGY EDITORS AND ENGINEERING TOOLS 17 or otherwise be part of approaches. As part of Semantic Tagging, various types of resolution may take place, like Synonymy, Hyponomy, Hyperonomy and Meronymy Resolution. Dill et al. ([28], 2003) maintain, that automated large-scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the Semantic Web. Their approach consists of SemTag, a Semantic Tagger that works through three stages, one for spotting, with tokenizing and label extraction, one for learning and finally one for the actual semantic tagging itself. The other half of the approach, Seeker, will be described in another section. In practice, most of the semantic taggers that exist today only produce shallow results. If the resulting ontologies of systems that use these should not be shallow, that could be achieved by combining shallow semantic taggers with other techniques. SMES is presented by Maedche and Staab ([42], 2000; [43], 2000; [44], 2001) as part of the Text-To-Onto approach. In [19] a Concept Extractor was developed for the Ontolo approach. Clustering is an IR technique in which documents are grouped together in so-called clusters. This technique can be used for classification purposes, or as a preparatory step for further analysis of the documents. Various approaches use simple pattern matching approaches. E.g. Perl or sed regexes can be very powerful. Especially as an additional technique to other techniques or as part of other techniques like POS-tagging it can be very useful. Maedche and Staab emphasize the difference between taxonomic and nontaxonomic relation extraction from text. Much of the work that precedes theirs consists of very shallow approaches, which only succeed in taxonomic relation extraction. What is necessary according to these researchers, is nontaxonomic relation extraction from text. 3.2 Ontology Editors and Engineering Tools Clearly, if the creation of ontologies from text is not done fully automatically but semi-automatically, an ontology engineer will have to correct, refine or expand the ontologies. For this purpose, Ontology Editors and Engineering Tools can be used. They can be considered part of overall semi-automatic approaches. Da Silva et al. ([60], 2004) present a survey on econstruction, Ontology Engineering/Design tools, and Ontology Exploitation software tools. The focus of the study is on software tools and the following Ontology Design tools are described: LexiCon, OilED, Protégé2000, OntoEdit, LinkFactory, e-cognos and e-coser, TERMINAE, Text-to-Onto and OntoLearn. In the conclusion on Ontology Design tools, the authors state that Protégé is the most recommended software tool for

25 18 CHAPTER 3. ONTOLOGY LEARNING various reasons, including OWL-compliance, the fact that it is freeware and it has a good base of developers around the world that support it. The Ontology Exploitation that is evaluated in the survey, is outside the scope of this study. OntoEdit is an ontology engineering environment that is presented by Maedche and Staab ([42], 2000; [43], 2000; [44], 2001) as part of the Text-To-Onto approach. Only a limited version of OntoEdit is free of charge. The tool will run on Windows and Linux platforms. Protégé is a very popular Ontology Engineering Tool. It is an Open Source Java tool, that can be used for editing domain ontologies or knowledge bases in a user friendly way with a GUI. The tool comes with a clear tutorial and good documentation, and is used by a large community. It is scalable, platform-independent and easy to extend with plugins. Furthermore, it supports data in various formats, like RDF and OWL. Many specialistic ontologies have been developed with Protégé with various domains. Linguistics related ontologies include GOLD, an ontology for descriptive linguistics and GUM, a general task and domain independent linguistically motivated ontology. 3.3 Ontology Learning Approaches Surveys of Ontogy Learning Approaches Various surveys of ontology-learning approaches exist. presented here in chronological order. Some of these are briefly Maedche and Staab ([44, p.76-78], 2001) include a brief survey of ontologylearning approaches in the presentation of their own ontology-learning framework, which includes Text-To-Onto, SMES and OntoEdit. The survey covers the following domains: free text, dictionary, knowledge base, semistructured and relational schemata. The methods mentioned for free text, the subject matter of OntoSpider, are clustering, inductive logic programming, association rules, frequency based, pattern-matching and classification methods. No extensive evaluation is made of the various approaches, they are presented in a table with references to the corresponding literature. Ying Ding and Schubert Foo ([30], 2002) present a review of ontology generation, in which Infosleuth, SKC, AIFB approaches like SMES, OntoEditor, and Textto-Onto, ECAI 2000 (SVETLAN, Mo K, SYLEX, ASIUM), Inductive Logic Programming (WOLFIE), DELOS, OntoWeb, DODDLE, and some more approaches are described. Before presenting these approaches, some general notes on ontology creation are given. An important conclusion they draw is, that the complexity of

26 3.3. ONTOLOGY LEARNING APPROACHES 19 relation extraction is the main impedance to ontology learning and its application, and that learning ontologies from text is still largely a theoretic enterprise, which is not advanced enough yet for real applications. A more extensive and recent survey of existing approaches can be found in Gomez et al. ([33],2003). Many researchers have contributed to this survey, and it is very systematic. The following domains are covered: text, machine-readable dictionaries, knowledge bases, structured data, semi-structured data and unstructured data. Of ontology learning from text, both methods and tools are described. The methods are usually named after one of the authors of the papers. The tools that are described, are Caméléon, CORPORUM-Ontobuilder, DOE, KEA, LTG Text Processing Workbench, Mo K Workbench, the Ontolearn Tool, Prométhé, SOAT, Sub- WordNet Engineering Process Tool, SVETLAN, TFIDF based term classification system, TERMINAE, Text-To-Onto, TextStorm and Clouds, Welkin and WOLFIE. The authors do not pretend to present a complete survey, but do claim that the main approaches have been covered. The systematic presentation of the methods and approaches gives a very clear overview and one thing that may strike the reader because of this, is the fact that in many cases certain aspects of approaches are not disclosed in papers at all, which is indicated in the text with information not available in papers. Approaches with semi-automatic creation of ontologies incorporate various modules Descriptions of Ontogy Learning Approaches TERMINAE is presented in various papers, like Biebow et al ([8], 1999), as a methodology and a tool for building ontologies from text or from scratch. Much attention is given to linguistics, and formality and traceability are requirements. Lexter is used for the extraction of terms from text. The approach, that focuses on technical text, evolved over time, one version uses Syntex and Caméléon as NLP tools for the subsequent linguistic analyses. The knowledge engineer is expected to have expertise in the area of the subject of the ontology and to have a good idea of how the resulting ontology will be applied, intuitive GUI s can be used to construct and adapt ontologies. The role of the knowledge expert is crucial in this approach. After normalization, the domain knowledge is formalized in some kind of a description logic. This description logic has rather limited expressive power. Subsequent work by the authors included work on other systems, like Géditerm, which also implemented part of the tasks of their methodology. Text-To-Onto is presented by Maedche and Staab ([42], 2000; [43], 2000; [44], 2001) as an architecture and a system for semi-automatic creation of ontologies from text. It was used in the On-To-Knowledge project. The authors stress that

27 20 CHAPTER 3. ONTOLOGY LEARNING most of the approaches prior to the year 2000 only got to the taxonomic level, but not further than that, and that non-taxonomic conceptual relations are an important goal in ontology engineering. This view corresponds with the classification of ontologies by Lassila and McGuinness (2001). They use a balanced cooperative modeling paradigm as proposed by Morik (1993), which includes the use of Text Mining. An NLP module, SMES is used for shallow text processing, with some extensions for heuristic correlations in order to attain a high recall of relevant linguistic dependency relations. SMES has access to a lexical database with German words. Dependency relations form the main output of SMES. Concept and relation extraction are performed by the learning module, the algorithm of which is based on Ramakrishnan Srikant and Rakesh Agrawal, Mining Generalized association rules (1995). The ontology engineer gets presented pairs of concepts which can be included in the ontology as non-taxonomic relations. For this purpose, OntoEdit is used. Furthermore, the ontology engineer can prune the resulting ontology, and decide whether it is necessary to iterate the ontology learning cycle or not. The authors stress that this is just one of various possible strategies. OntoLearn, presented in Missikoff et al. ([48], 2002) is a system that can automatically extract concepts from text to form semantic nets and specialized domain ontologies from corpora. It uses WordNet and large domain corpora. Projects that used OntoLearn include Harmonise, which produced a large ontology on tourism. Other applications involved ontologies in the fields of Economy and Computer Networks. OntoLearn incorporates mainly three algorithms, one for terminology extraction, one for semantic disambiguation and one for semantic annotation and the creation of ontologies. A special algorithm, SSI (Structural Semantic Interconnections), was designed for semantic interpretation, which is also done based on the principle of compositionality of meaning. The relevance of concepts that are extracted from a corpus is determined by comparison with frequencies of occurrence in a generic corpus, which functions as a contrast corpus. For the purpose of evaluating resulting ontologies, glosses were added to the Ontolearn system. These will be described in a later section. Symontos is described by Missikoff et al. ([47], 2001). It is an approach that uses Web Mining for Ontology creation and enrichment. The ontological data that is created is not very rich, it is about at the taxonomic level, but can be used for the creation or enrichment of ontologies. Ontolo as presented in Chetrit ([19], 2004) is a tool for facilitating Ontology Construction from texts, in fact that is literally the title of the thesis. The user manually inserts articles from the PubMed database. After POS-tagging, stemming and concept extraction, rudimentary ontologies are created by the Ontology Construction tool.

28 3.3. ONTOLOGY LEARNING APPROACHES 21 The system Asium is presented in Faure et al. ([32], 1998). It is a system that automatically acquires semantic knowledge and ontologies from text, with Machine Learning techniques. Another system, Sylex, is used for syntactic parsing and after this parsing and post-processing, syntactic frames of clauses are produced. Along with these, an ontology of concepts can be formed. The clustering of words can be done in a hierarchical or in a pyramidal way. Pyramids of clusters are richer than simple hierarchies because multiple parents are possible. The relevance of concepts that can be derived from clusters is determined with a similarity measure, that determines how close clusters are to each other. The user interactively validates learned clusters. For this purpose, a GUI is part of the system. GATE is presented in Cunningham et al. ([22], 2002) and Bontcheva et al. ([9], 2004) as a framework and a graphical development environment for Language Engineering. It was implemented in Java, is freely available, modular, Open Source and well documented in articles and with an extensive user guide. GATE can be seen as a general-purpose and flexible tool for NLP processing. Here, only GATE v2 will be described. The authors distinguish the following GATE resources that are available: language resources (LRs), processing resources (PRs) and visual resources (VRs). The language resources with declarative data are strictly separated from the processing resources and the visual resources, which enables the users, e.g. linguists or programmers, to concentrate on their field of expertise in their work on GATE. All resources together are called CREOLE, a Collection of REusable Objects for Language Engineering. The GATE resources can be accessed both via a GUI and via the GATE API. The API makes it easier to automate certain tasks. GATE can deal with various data formats, which are converted into a GATE specific XML format before they are further processed. Examples of processing resources that are available in GATE, are tokenizers, POS-taggers, An important part of GATE v2 is JAPE, an engine for regular expressions, that is based on finite state technology. Although the use of finite state technology does not guarantee efficient processing, generally most tasks that are performed with the JAPE engine are efficient. Other modules that are available as plugins to GATE, are an implementation of the Google API, A web crawler. Various resources that can process ontologies are available for Gate v2. The OntoGazetteer is an interface that enables one to view ontologies. With the OntoGazetteer Editor, the class hierarchies of RDF or RDF(S) ontologies can be edited. Protégé has been integrated with GATE. OntoLT is a very likely candidate for the Ontology Engineering component of OntoSpider. For this reason, it will be described into more detail here. OntoLT

22 CHAPTER 3. ONTOLOGY LEARNING Figure 3.1: Opening screen of Protege with the OntoLT plug-in marked in red is a plug-in for Protégé that requires Sun s Java Runtime Environment (JRE).

29 22 CHAPTER 3. ONTOLOGY LEARNING Figure 3.1: Opening screen of Protege with the OntoLT plug-in marked in red is a plug-in for Protégé that requires Sun s Java Runtime Environment (JRE). A first beta version of the plug-in was made available to the public in November The current version is 2.0 and it works with version 3.x of Protégé. The following description is based on [10], [12], [13], [14], [15], [16] and [62], and evaluations that were done with an earlier version on a machine running FreeBSD 5.x. Most screenshots were taken from the latest version. In Protégé, the OntoLT plug-in is represented by a tab. OntoLT takes an XML-annotated corpus as input. The format of this XML annotation is proprietary and is called MM. This MM format encodes morphological, syntactic and semantic information. A software package that can produce the necessary XML annotation automatically, is SCHUG/WebSchug. SCHUG, which stands for Shallow and Chunk based Unification Grammar, was introduced by Declerck et al. ([24], 2002) and later in other work like Declerck et al. ([26], 2003). SCHUG maps XML with linguistic information onto feature structures, on which unification can work, activating rules that can work on the linguistic data. The

30 3.3. ONTOLOGY LEARNING APPROACHES 23 technique of Cascaded Chunk Processing is used at this point to perform various kinds of linguistic processing. The output of SCHUG is again data which is XML encoded, enriched with more linguistic annotations. SCHUG is able to process various natural languages, like German and Spanish, which is demonstrated in Declerc et al. ([24], 2002) and, as is demonstrated in Declerck et al. ([26], 2003), in e.g. Central and Eastern European languages. Another application outside OntoLT that uses SCHUG is the MUMIS project, which performs Information Extraction on multimedia resources in the field of soccer. MUMIS is described in various papers, like Declerck et al. ([25], 2002). A corpus consists of one or more documents that are marked in XML with <document> tags. Every document is represented by a separate file on disk and can consist of one or more sentences, indicated by <sentence> tags. Sentences consist of clauses, phrases and text, indicated with tags of the same name. Text contains <token> tags, from which the original sentences can be reconstructed. A simplified abstract example of the XML structure, the dots are an informal representation of partial information: <?xml version= 1.0 encoding= ISO ?> <document name=./example.xml date= > <sentence id= 1 stype= decl corresp= > <clauses> </clauses> <phrases> </phrases> <text> <token> </token> </text> </sentence> <sentence id=... >... </sentence> </document> A sample XML annotated English corpus that was supplied by SCHUG is included in the OntoLT package. Manual XML annotation is tedious, and the OntoLT

31 24 CHAPTER 3. ONTOLOGY LEARNING Figure 3.2: Tabs of OntoLT plug-in is meant for semi-automatic use anyway, so the only realistic alternatives to using SCHUG are writing an alternative semi-automatic XML annotator or adapting an existing one to deal with this specific XML format. For the purpose of this study, WebSchug was chosen. Much is done by the module that produces the input XML to OntoLT. It will take care of POS-tagging, (other) morphological analysis, syntactic analysis and lexical semantic tagging, and provide XML markup for all this. When the OntoLT tab is clicked, tabs for Operators, Mappings, Conditions and Corpora will be visible (Figure 3.2). In the Corpora tab, new corpora can be imported. For this purpose, multiple XML annotated files can be selected and together given a corpus name. Clicking on the binoculars in the Candidate View tab and selecting the corpus then extracts candidate classes, slots and instances. The name of the extraction is derived from the time at which it took place. Extracted candidates can be inspected by clicking on key icons. The user can choose with which candidates the resulting ontology should be enriched. At the time of writing, OntoLT only allows for the extension of ontologies, not creating smaller ontologies from existing ones. The extraction takes place based on XPATH expressions. These can be found under the XPaths tab. If an XPATH expression matches, a mapping rule is activated based on which candidates may be extracted. Both the XPATH expressions and the mapping rules can be adjusted or added to by the user. For the XPATH expressions, a precondition language is available, comprising of the predicates containspath, HasValue, HasConcept, AND, OR, NOT and EQUAL, and the function ID. The beta version of OntoLT 1.0 includes two mapping rules which consist of large conjunctions of conditions (Figure 3).

32 3.3. ONTOLOGY LEARNING APPROACHES 25 Figure 3.3: Mapping Rule for Head Nouns and Modifiers Figure 3.4: Above Rule in an older version of OntoLT

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet Joerg-Uwe Kietz, Alexander Maedche, Raphael Volz Swisslife Information Systems Research Lab, Zuerich, Switzerland fkietz, volzg@swisslife.ch