Using WordNet to Disambiguate Word Senses

Size: px

Start display at page:

Download "Using WordNet to Disambiguate Word Senses"

Brianne Gray
5 years ago
Views:

1 Using WordNet to Disambiguate Word Senses by Ying Liu Electrical and Computer Engineering

2 Acknowledgements I would like to first thank Prof. Peter Scheuermann without whose constant guidance, support and encouragement, this work would not have been possible. I would also like to thank Bin Chen who gladly discussed various issues related to my work with me. This work is the result of many insightful discussions that I have had with Prof. Scheuermann who inspired me all through and also guided me as and when required. I would also like to thank the members of the Database System group for their friendship and help. They are Shayan Zaidi, Mehmet Sayal, Olga Shumsky, and Chris Fernandes. Further, I would like to thank Dr. Ellen M. Voorhees for her suggestions. Finally I would like to thank my parents Zongli Liu and Huifang Xu who have guided me all through my life. I would like to thank them for all the love, encouragement and virtues that I have received while I was growing. 2

3 Introduction Motivation Contribution Organization... 6 Background Knowledge WordNet Part-of-Speech Taggers Stemming Stopwords Work related to Word Sense Disambiguation Survey of Approaches to Word Sense Disambiguation Knowledge Based Corpus Based Disambiguated Corpora Raw Corpora Hybrid Approaches Using Hood Algorithm to Disambiguate Word Senses Converting WordNet into Relational Database Algorithm Hood Construction Word Sense Disambiguation Experiments Part-of-Speech Tagged Brown Corpus Flow of Experiment Quality of Results Result Analysis Conclusion and Future Work Conclusion Future Work and Application References Appendix A: Definition of Tables

4 CHAPTER 1 Introduction Text retrieval, also known as document or information retrieval, is concerned with locating natural language documents whose contents satisfy a user's information need. Unfortunately, there are billions of documents, many of which don't have abstracts or even titles, on the Internet today. Therefore, there is considerable interest in developing techniques that automatically index full-text documents and provide access to heterogeneous collections of full-text documents. 1.1 Motivation Search engines are great tools that help users find desired documents. Whenever a user submits a set of query keywords, documents that contain part of the entire keywords will be returned. But, these search engines are not good enough to answer all the queries. For instance, most web users have experienced troubles with search engines that when a large number of web pages are returned, one may have to go through many unrelated pages to identify useful ones. Sometimes not only is the number of documents returned large, but also the categories identified by the search engines are not relevant. Let s look at an example. If a computer hardware engineer wants to search documents related to "board", Yahoo returns the following categories: 1

5 Figure 1-1: Yahoo! Category Matches (1-8 of 2394) I only list the first 8 matches of 2394 matches. The results are organized in a hierarchical structure, i.e., in the first row, Recreation is the top category, Games is the second level category,, and Board Games is the category or web page that contains the keyword board. There are only 2 categories related to circuit_board. Obviously, it is a big burden for the computer hardware engineer to sift the documents that he is really interested in from such a large number of categories remember that within each category, there may be numerous web sites. Although the user is only interested in the meaning circuit_board of board, the search engine returns him all the documents that contain board. To explain why sometimes search engines cannot generate satisfactory categories, we will review how their hierarchical classification structures are generated. Most hierarchical categories or classes employed by search engines were either manually 2

6 set up or automatically constructed by data clustering algorithms. Since the class hierarchies generated by clustering algorithms lack semantic information, it is likely that they perform poorly when the number of query terms is small or a query term has more than one meaning. Although manually constructed hierarchies of classes normally have higher accuracy, there are also a number of problems with them. First, manually constructed classes are not concept oriented, that is, for each keyword more than one class can have the keyword as name or label. For example, there are multiple classes of board in Figure 1-1. Consequently, users have to explore a huge number of categories in order to identify the desired pages. Secondly, since the hierarchies are often maintained by a group of people, over times update procedure is prone to conflicting classification criteria. To overcome the disadvantages caused by manually constructed classes, an algorithm that constructs a hierarchical classification model based on keywords and their relationships from thesauri is proposed. Specifically, each class corresponds to one concept since in human memory different keywords are used to represent different objects, ideas, or activities. The topics of documents in each class are similar. The hierarchical structure is maintained via IS-A or PART-OF relationships between classes, i.e., class homer is PART-OF class baseball, hence class baseball is a super class of homer". The advantages of such a novel class hierarchy can be summarized as follows: 1. Each class name corresponds to one word (actually it is concept, or keyword sense), suitable for keyword-based query. 3

7 2. The relationships between classes are semantically defined by the thesauri, therefore, it is much more stable than traditional hierarchical classes. With this thesaurus-based hierarchy, documents are then mapped to the class hierarchy. During the mapping, a threshold min_sim [28] is employed to determine whether a document and a class are similar to each other or not. After documents are mapped to the class hierarchy, class representative vectors [28] are adjusted to reflect the topics of the documents. Next, documents are re-mapped by using the adjusted class representative vectors. Such re-mapping iteration may repeat a number of times. Then some classes which contain too few documents are removed by a hierarchy refinement procedure [28]. The resulting class hierarchy and the document mapping are the final hierarchy classification. Fortunately, WordNet, an electronic dictionary invented by Princeton University, is a concept based dictionary, whose lexical relations are IS-A and PART_OF. It is used as the frame for this proposed hierarchy classification. Assume that the class hierarchy is already constructed, what we need to do is to classify documents to their appropriate classes. Polysemy, which is defined as a single word form having more than one meaning, causes false classifying. For example, if we failed to tell which meaning of board is used in a given situation, we would probably map that document to a wrong class. Synonymy, which is defined as multiple words having the same meaning, causes true conceptual mapping to be missed. Therefore, the critical step of classifying is to recognize synonyms and detect uses of different meanings of a word for each word in each document. For example, if we failed to recognize notebook and laptop mean the 4

8 same thing, all those documents that use notebook in the place of laptop would be left out of the class laptop. The issue is how to automatically detect polysems and synonyms. In principle, polysems and synonyms can be handled by assigning the different senses of a word to different concept identifiers and assigning the same concept identifier to synonyms. In practice, this requires procedures that not only are able to recognize synonyms but also can detect uses of different senses of a word. 1.2 Contribution In this report, we implemented the disambiguation algorithm introduced by Ellen M. Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval ([5]). This algorithm is supposed to automatically detects and resolves the senses of the polysemous nouns occurring in the texts of documents and queries. Each word processed by this technique in any document is mapped to a unique concept, which is the meaning intended in this case. However, she didn't apply this idea to any text document. We applied this algorithm to a set of documents Brown Corpus, one of the most widely used document collections in a variety of fields. At last, we tested effectiveness of this automatic disambiguation algorithm by comparing with the manual disambiguation results offered by Princeton University. Our experiments verified Dr. Voorhees conclusion in her paper [5] that this algorithm is not sufficient to reliably select the correct sense of a noun from the set of sense disambiguation in WordNet. 5

9 1.3 Organization The remaining part of the thesis is organized as follows. Chapter 2 gives some background on text retrieval and WordNet. Chapter 3 explores the work done in the area of sense disambiguation. Chapter 4 explains the algorithm introduced by Dr. Voorhees in detail. The first section explains the hood construction part of the algorithm, the second section explains word sense disambiguation part of it. Chapter 5 presents our experiment results. A qualitative analysis of this algorithm is also performed. Chapter 6 draws a conclusion for the work I have done with this topic. Finally, we make comment on the future work that can be explored in this area and its potential application. 6

10 CHAPER 2 Background Knowledge In our work, we plan to apply our algorithm on Brown Corpus. We download part-ofspeech tagged Brown Corpus from University of Pennsylvania. First, we remove all the tags and all those non-noun words since most of the semantics is carried by noun words [2]. Secondly, we convert each word to its stem with Porter's algorithm. Thirdly, we remove all those words that are not in WordNet or stopwords list. Thus, every document in this corpus is represented only by its valid noun words after the three steps of processing. Finally, our algorithm is performed and results are analyzed. In order to help you understand our work, this chapter gives some background knowledge involved in our work. 2.1 WordNet WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the Cognitive Science Laboratory at Princeton University [12]. Originating from a project whose goal was to produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into a system that reflects current psycholinguistic theories about how humans organize their lexical memories. 7

11 In WordNet, the basic building block is a synset consisting of all the words that express a given concept. Synsets, which senses are manually classified into, denote synonym sets. Within each synset the senses, although from different keywords, denote the same meaning. For example, board has several senses, so does "plank". Each of the two words has a sense means a stout length of sawn timber; made in a wide variety of sizes and used for many purposes. The synset corresponding to this sense is composed of "board" and "plank". In this example, this given sense of "plank" and "board" are synonymous and form one synset. Because all synonymous senses are grouped into one synset and all different senses of the same word are separated into different synsets, there are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized concept. There are four main divisions in WordNet, one each for nouns, verbs, adjectives and adverbs. Within a division, synsets are organized by the lexical relations defined on them. For nouns, the only division of WordNet used in my work, the lexical relations include IS-A and PART-OF relations. For example, Figure 2-1 shows the hierarchy relating the eight different senses of the noun board. The synsets with the heavy boarder are the actual senses of board, and the remaining synsets are either ancestors or descendents of one of the senses. The synsets {group, grouping} and {entity, thing} in Figure 2-1 are examples of heads of the hierarchies. Other heads include {act, human_action, human_activity}, {abstraction}, {possession} and {psychologival_feature}. 8

12 group grouping entity thing people folk object social_group artifact artefact article substance material matter organization device sheet flat_solid article_of_ commerce equipment food nutrient material stuff unit electrical_device board furniture sport_equipment board mess ration building _material administrative Word S _unit controlpanel displaypanel panel board circuit closed_circuit pegboard palette pallet table spring_board k-ration lumber timber committee commission dashboard computer _circuit bulletin_board notice_board dining_table board board diving_board board plank deal board printed_circuit refectory_table highboard governing_board directorate board_of_director circuit_board circuit_card board card advisory_board cabinet school_board board_of_education Figure 2-1: The IS-A hierarchy for eight different senses of the noun board. 9

13 WordNet 1.6 (2000), the version of WordNet used in this work, contains words and senses in the noun division. Because synsets contain only strict synonyms, the majority of synsets are quite small. Similarly, the average number of senses per word is close to one. This seems to suggest that polysemy and synonymy occur too infrequently to be a problem, but they are misleading. The more frequently a word is used, the more polysemous it tends to be [13]. The more common words also tend to appear in the larger synsets. Thus, it is precisely those nouns that actually get used in documents are most likely to have many senses and synonyms. 2.2 Part-of-Speech Taggers Many corpora are, in addition to structural and bibliographic information, annotated with linguistic knowledge. The most basic and common form this annotation takes is marking up the words in the corpus with their part-of-speech tags. This adds value to the corpus because, for example, searches can be performed not only on the word-forms as strings but also on whether they belong to a certain linguistic category. Such tags are typically taken to be atomic labels attached to words, denoting the part of speech of the word, together with shallow morphosyntactic information, e.g. they specify the word as a proper singular noun, or a plural comparative adjective. For English and other Western European languages, for which most such annotated corpora have been produced, the tagset size ranges from about forty to several hundred distinct categories [8]. For example, since "happy" is an adjective, it is tagged with "JJ", which is the representation 10

14 of adjectives, as follows, so are one-of-a-kind and run-of-the-mill. Every word in every document is well tagged in this way. happy/jj one-of-a-kind/jj run-of-the-mill/jj 2.3 Stemming Stemming is a technique for reducing words to their grammatical roots. A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for variants connected, connecting, connection, and connections. Stems are thought to be useful because they reduce variants of the same root word to a common concept. Furthermore, stemming has the effect of reducing the size of the indexing structure because the number of distinct word is reduced. The best known algorithm for stemming is Porter's algorithm [9] introduced by M.F.Porter. This program is given an explicit list of suffixes and with each suffix, the criterion under which it may be removed from a word to leave a valid stem. The main merits of the present program are that it is small, fast and reasonably simple while the success rate is reasonably good. It is quite realistic to apply it to every word in a large file of continuous text. 11

15 2.4 Stopwords Words which are too frequent among the documents are not good discriminators. In fact, a word which occurs in 80% of the documents in the document collection is useless for purpose of retrieval or classification. Such words are frequently referred to as stopwords and should be filtered out. Articles, prepositions and conjunctions are candidates for a list of stopwords, such as an, against, and. Removal of stopwords can not only improve the accuracy of retrieval or classification, but also reduce the size of the document. 12

16 CHAPTER 3 Work related to Word Sense Disambiguation One of the first problems that is encountered by any natural language processing system is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's syntactic ambiguity has largely been solved in language processing by part-of-speech taggers which predict the syntactic category of words in text with high levels of accuracy (for example[14]). The problem of resolving semantic ambiguity is generally known as word sense disambiguation and has proved to be more difficult than syntactic disambiguation. The problem is that words often have more than one meaning, sometimes fairly similar and sometimes completely different. The meaning of a word in a particular usage can only be determined by examining its context. This is, in general, a trivial task for the human language processing system. However, the task has proved to be difficult for computer and some have believed that it would never be solved. However, there have been several advances in word sense disambiguation and we are now at a stage where lexical ambiguity in text can be resolved with a reasonable degree of accuracy. 13

17 3.1 Survey of Approaches to Word Sense Disambiguation It is useful to distinguish some different approaches to the word sense disambiguation problem. In general we can categorize all approaches to the problem into one of three general strategies: knowledge based, corpus based and hybrid. We shall now go on to look at each of these three strategies in turn Knowledge Based Under this approach disambiguation is carried out using information from an explicit lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus or it may be hand-crafted. This is one of most popular approaches to word sense disambiguation and amongst others, work has been done using existing lexical knowledge sources such as WordNet [16,17,18,19,5], LDOCE [20,21], and Roget's International Thesaurus [22]. The information in these resources has been used in several ways, for example Wilks and Stevenson [23], Harley and Glennon [24] and McRoy [25] all use large lexicons (generally machine readable dictionaries) and the information associated with the senses (such as part-of-speech tags, topical guides and selectional preferences) to indicate the correct sense. The word sense disambiguation algorithm in our work introduced by Voorhees [5] takes advantage of WordNet and part-of-speech tags. Another approach is to treat the text as an unordered bag of words where similarity measures are calculated by 14

18 looking at the semantic similarity (as measured from the knowledge source) between all the words in the window regardless of their positions, as was used by Yarowsky [22] Corpus Based This approach attempts to disambiguate words using information which is gained by training on some corpus, rather than taking it directly from an explicit knowledge source. This training can be carried out on either a disambiguated or raw corpus, where a disambiguated corpus is one where the semantics of each polysemous lexical item is marked and a raw corpus one without such marking. Disambiguated Corpora This set of techniques requires a training corpus which has already been disambiguated. In general, a machine learning algorithm of some kind is applied to certain features extracted from the corpus and used to form a representation of each of the senses. This representation can then be applied to new instances in order to disambiguate them. Different researchers have made use of different sets of features, for example [15] used local collocates such as first noun to the left and right, second word to the left/right and so on. The general problem with these methods is their reliance on disambiguated corpora which are expensive and difficult to obtain. This has meant that many of these algorithms have been tested on very small numbers of different words, often as few as

19 Raw Corpora It is often difficult to obtain appropriate lexical resources and we have already noted the difficulty in obtaining disambiguated text for supervised disambiguation. This lack of resources has led several researchers to explore the use of raw corpora to perform unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot actually label specific terms as a referring to a specific concept: that would require more information than is available. What unsupervised disambiguation can achieve is word sense discrimination, which clusters the instances of a word into distinct categories without giving those categories labels from a lexicon (such as WordNet synsets) Hybrid Approaches These approaches can be neither properly classified as knowledge or corpus based but use part of both approaches. A good example of this is Luk's system [26] which uses the textual definitions of senses from a machine readable dictionary to identify relations between senses. He then uses a corpus to calculate mutual information scores between these related senses in order to discover the most useful information. This allowed Luk to produce a system which used the information in lexical resources as a way of reducing the amount of text needed in the training corpus. 16

20 CHAPTER 4 Using Hood Algorithm to Disambiguate Word Senses In this chapter we present our implementation of the algorithm in Voorhees [15] with the help of WordNet. It is based on the idea that a set of words occurring together in context will determine appropriate senses for one another despite each individual word being multiply ambiguous. A common example of this effect ([27]) is the set of nouns base, bat, glove and hit. While most of these words have several senses, when taken together the intent is clearly the game of baseball. To exploit this idea automatically, a set of categories representing the different senses of words needs to be defined. Once such categories are defined, the number of words in the text that have senses that belong to a given category is counted. The senses that correspond to the categories with the largest counts are selected to be the intended sense of the ambiguous words. Obviously, the category definitions are a critical component. 4.1 Converting WordNet into Relational Database WordNet, dictionary system by Cognitive Science Department of Princeton University, is stored in flat files, not in database. In order to make the implementation easy and get good performance, WordNet should be converted into relational database version. Four relations created for WordNet are shown from Table 4-1 to Table 4-4: (For detailed definition of the tables, refer to appendix A.) Each of the four definitions is in third normal form. Relation synsets has distinct records, relation words has

21 distinct records, relation synset_word has distinct records and relation synset_relations has distinct records. 1. synsets(synset_id, category, hierarchy, meaning) synset_id an unique decimal integer which represents a synset in WordNet. category one character code indicating the synset type. For example, n indicates noun. hierarchy the hierarchy the synset belongs to. In WordNet, the hierarchies for noun range from 3 to 28. meaning definition for the synset. This table contains the basic information of each synset in WordNet. However, only the attribute synset_id is used in our work. Example tuples are shown in Table is the synset_id for a synset in WordNet; n means that this synset is in noun division of WordNet (WordNet also has verb division); 28 means this synest belongs to the hierarchy 28; meaning is the gloss for this synset. synset_id category Hierarchy meaning n 28 a period of the year marked by special events or activities in some field n 14 a committee having supervisory powers Table 4-1: Relation Definition for synsets and Tuples 2. synset_relations(synset_id1, synset_id2, rel_str) synset_id1 a child synset of synset_id2. This table does not store any relationship other than parent-child. synset_id2 a parent synset of synset_id1. 18

22 rel_str the actual symbol in WordNet to describe the relationship. ~: synset_id1 is a hypernym of synset_id2 (synset_id1 is a superordinate <generic> of synset_id1 is a hyponym of synset_id2 (synset_id1 is a subordinate <specific> of synset_id2<generic>) %: synset_id1 is a holonym of synset_id2 (synset_id2 is part of synset_id1) #: synset_id1 is a meronym of synset_id2 (synset_id1 is part of synset_id2) #p: synset_id1 is part of synset_id2 #m: synset_id1 is a member of synset_id2 #s: synset_id1 is the stuff that synset_id2 is made from =: synset_id1 has an attribute synset_id2 (synset_id2 is an adjective)!: synset_id1 and synset_id2 are antonyms. (not stored in this table) Although there are many kinds of relationship, we can treat all of them as childparent relationship. Each synset can have multiple parents or multiple children. Each pair of child and parent is a tuple in this table. Example tuples are shown in Table 4-2. In this example, the synsets numbered as and are two children of synset means that synset is a subordinate of synset , so is This relation is frequently used in our work. We depend on the parent-child relationship to find ancestors of a given synset. synset_id1 synset_id2 rel_str #p #m Table 4-2: Relation Definition for synset_relations and Tuples 3. words(word_id, word) word_id an unique decimal integer for each meaning of each word in WordNet. 19

23 word the word one of whose meaning is numbered as word_id. Each word in WordNet may have multiple meanings. Therefore, for every meaning we assign it a unique identification number word_id Example tuples are shown in Table is the word_id for one of the 3 meanings of season. board has 9 meanings, one of which is numbered as and another is numbered as This relation is also frequently used in our work. We depend on it to find all the synsets a given word belongs to. Word_id word season board board Table 4-3: Relation Definition for words and Tuples 4. synset_word(synset_id, word_id) synset_id defined in relation synsets. word_id defined in relation words. This table is the connection between synsets and words. One synset may consist of more than one word_id, while each word_id is only assigned to one synset. This guarantees all different meanings of the same word are separated into different synsets, in other words, there are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized concept. Example tuples are shown in Table 4-4. The word board is one of the members of the synset numbered as because one of its meanings numbered as is very close in meaning to this synset. 20

24 synset_id word_id Table 4-4: Relation Definition for synset_word and Tuples All the major information of noun division in WordNet is stored in two files: noun.dat and noun.idx. The data format in noun.dat is as follows: synset_id hierarchy category w_cnt word lex_id [word lex_id...] p_cnt [pointer_symbol synset_id pos source/target] gloss NOTE: w_cnt number of words in the synset. lex_id one digit hexadecimal integer that uniquely identifies a meaning of this word. It usually starts with 0. p_cnt number of pointers from this synset to other synsets. pointer_symbol refer to definition for relation synset_relations. pos syntactic category, n for noun. source/target a value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset. For example, is a synset_id; 28 is the hierarchy; n means noun; 01 indicates there is only one word in this synset; season is the word in this synset; 2 indicates that this meaning is the second meaning of the word season ; 015 indicates synset has 15 pointers to other synsets; one of the 15 target synsets is its parent synset due to the relation while another target is its child synset due to ~, so are the other 13 targets; the last part is the definition or example sentences n 01 season n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 a period of the year marked by special events or activities in some field; "he celebrated his 10th season with the ballet company" or "she always looked forward to the avocado season" 21

25 On the other hand, the data format in noun.idx is as follows: word pos poly_cnt p_cnt [pointer_symbol...] sense_cnt tagsense_cnt synset_id [synset_id...] NOTE: pos syntactic category, "n" for noun. poly_cnt number of different meanings (polysemy) the current word has in WordNet. This is the same value as sense_cnt, but is retained for historical reasons. p_cnt number of different types of pointers the current word has in all synsets containing it. pointer_symbol refer to definition for relation synset_relations. sense_cnt number of different meanings the current word has in WordNet. tagsense_cnt number of meanings of the current word that are ranked according to their frequency of occurrence in semantic concordance texts. synset each synset_id in the list corresponds to a different meaning of the current word in WordNet. For example, seat is a word; n means noun; 6 indicates seat has 6 senses; 5 means seat has 5 different types of pointers (@, ~, #m, #p, %p) in all the 6 synsets containing it; again, 6 tells that seat is in 6 synsets; finally, the synsets containing seat are listed one by one. seat n 6 ~ #m #p %p Pseudo-code 1 shows the steps to convert data in the two flat files into relational database. The for loop from line1 to line 8 extracts data from noun.dat to construct table synsets. The second for loop from line 9 to line 21 extracts data from noun.dat again to construct table synset_relations. The inner loop generates a separate tuple for every pair of child and parent. That means, if a synset has multiple pointers to other synsets, 22

26 there are multiple tuples for it to present the multiple children or multiple parents relationship. Then, the code from line 22 to line 32 extracts data from noun.idx to construct table words and synset_word. The inner loop generates a separate tuple for every sense in words and synset_word. Pseudo-code 1 build _wordnet() for each line in noun.dat 2 synset_id retrieve synset_id 3 hierarchy retrieve hierarchy 4 category retrieve category 5 skip the next items until gloss 6 meaning retrieve gloss 7 insert tuple (synset_id, hierarchy, category, meaning) into table synsets 8 end 9 for each line in noun.dat 10 synset_id retrieve synset_id 11 skip the next item until p_cnt 12 num_pointers retrieve p_cnt 13 for each pointer 14 relationstr retrieve pointer_symbol 15 relationsynset_id retrieve synset_id 16 if (synset_id is the parent of relationsynset_id) 17 insert tuple (relationsynset_id, synset_id, relationstr) into table synset_relations 18 else 19 insert tuple (synset_id, relationsynset_id, relationstr) into table synset_relations 23

27 20 end 21 end 22 for each line in noun.idx 23 word retrieve word 24 skip the next items until sense_cnt 25 numsenses retrieve sense_cnt 26 for each sense 27 generate a unique id word_id for this sense 28 insert tuple (word_id, word) into table words 29 synset_id retrieve synset_id 30 insert tuple (synset_id, word_id) into table synset_word 31 end 32 end Algorithm Hood Construction Using each separate hierarchy as a category is well defined but too coarse grained. For example, in Figure 2-1 seven of the eight senses of board are in the {entity, thing} hierarchy. Similarly, using individual synsets is well defined but too fine grained. Therefore, this algorithm is intended to define an appropriate middle level category hood. To define the hood of a given synset, s, consider the set of synsets and the hyponymy links in WordNet as the set of vertices and directed edges of a graph. Then the hood of s is the largest connected subgraph that contains s, contains only descendents of an ancestor of s, and contains no synset that has a descendent that includes another 24

28 instance of a member of s as a member. A hood is represented by the synset that is the root of the hood. In other words, shown in Figure 4-1, assume synset s consists of k words w(1), w(2), w(3) w(k), p(1), p(2), p(3) p(n) are n ancestors of s, where p(m) is a father of p(m-1). p(m) (m is a number in 1 n) has a descendent synset which also includes w(j) (j is a number in 1 k)as a member and p(m) is the closest one with this feature to s. So, p(m-1) is one of the root(s) of the hood(s) of s, as shown in Case 1. If m is 1, s itself is the root, as shown in Case 2. If no such m is found, the root of this WordNet hierarchy r is the root of the hood of s, as shown in Case 3. If s itself has a descendent synset that includes w(j) (j is a number in 1 k) as a member, there is no hood in WordNet for s, as shown in case 4. Because some synsets have more than one parent, synsets can have more than one hood. A synset has no hood if the same word is a member of both the synset and one of its descendents. For example, in Figure 2-1 the hood of the synset for committee sense of board is rooted at the synset {group, grouping} (and thus the hood for that sense is the entire hierarchy in which it occurs) because no other synset containing "board" in this hierarchy (Case 3), the hood for the circuit_board sense of board is rooted at {circuit, closed_circuit} because the synset {electrical_device} has a descendent synset {control_panel, display_panel, panel, board} containing "board" (Case 1), and the hood for the panel sense of board is rooted at the synset itself because its direct parent synset {electrical_device} has a descendent synset {circuit_board, circuit_card, board, card} containing "board" (Case 2). Pseudo-code 2 shows the steps to find the root(s) of the hood(s) for a given synset. The input for this procedure is a given synset_id, s. The output is the synset_id(s) of the 25

29 root(s) of hood(s) for s. The code from line1 to line 10 is to get all the synsets each of which has at least one member word which is also a member word of s and save them in a hashtable synset_id_hashtable. From line 11 to line 22, we get all the ancestors for every synset in synset_id_hashtable and keep them in another hashtable all_ancestors_hashtable. From line 23 to line 43, we find the find the ancestors of s one p(m) p(1) p(m-1) w(j) w(1), w(k) w(1), w(k) w(j) Case 1 Case 2 r w(1), w(k) w(j) w(j) Case 3 Case 4 Figure 4-1 Root of Hood(s) of Synset s 26

30 by one from the closest to the farthest. Whenever one ancestor a is in all_ancestors_hashtable, in other words, a has a descendent that includes another instance of a member of s as a member, its child that is in the path from s to a is a root of the hood(s) of s. In our work, we apply find_hood_root(s) procedure to all the synsets in WordNet. The output is stored in hood_root.txt for further computation. Pseudo-code 2 find_hood_root( s) word_id_set π word_id (σ synset_id=s (synset_word)) 2 for each word_id in word_id_set 3 word_set π word (σ word_id=word_id ( words)) 4 for each word in word_set 5 all_word_id_set π word_id (σ word=word ( words)) 6 end 7 end 8 for each word_id in all_word_id_set 9 synset_id_hashtable π synset_id (σ word_id=word_id ( synset_word)) 10 end 11 for each synset_id in synset_id_hashtable except s 12 current_id_hashtable synset_id 13 while (current_id_hashtable is not empty) 14 for each synset_id in current_id_hashtable 15 parent_id_hashtable π synset_id2 (σ synset_id1=synset_id ( synset_relations)) 16 end 17 clear current_id_hashtable 18 copy parent_id_hashtable to current_id_hashtable 19 copy parent_id_hashtable to all_ancestors_hashtable 27

31 20 clear parent_id_hashtable 21 end 22 end 23 if (s is in all_ancestors_hashtable) 24 s has no hood in WordNet 25 else 26 current_id_hashtable s 27 while (current_id_hashtable is not empty) 28 for each current_synset_id in current_id_hashtable 29 parent_id_hashtable π synset_id2 (σ synset_id1=current_synset_id ( synset_relations)) 30 for each parent_synset_id in parent_id_hashtable 31 if (parent_synset_id is in all_ancestors_hashtable) 32 root_found true 33 root_set current_synset_id 34 remove parent_synset_id from parent_id_hashtable 35 break 36 end 37 end 38 clear current_id_hashtable 39 copy parent_id_hashtable to current_id_hashtable 40 clear parent_id_hashtable 41 end 42 if (root_found is false) 43 root_set root of this entire hierarchy in WordNet

32 17954 entity thing people folk artifact artefact article substance material matter equipment material stuff sport equipment building _material Word S pegboard palette pallet table spring board k-ration lumber timber dashboard bulletin_board notice_board refectory_table highboard governing_board directorate board_of_director advisory_board cabinet school_board board_of_education Figure 4-2: The IS-A hierarchy for eight different senses of the noun board. 29

33 Let's take synset {circuit_board, circuit_card, board, card} as an example (Figure 4-2, refer to Figure 2-1). All the 9 synsets ( , , , , , , , , ) for "board" are stored in synset_id_hashtable, as well as those synsets of "circuit_board", "circuit_card" and "card"; all_ancestors_hashtable contains , , , , , , , etc., but none of , , is in it because each one is only an ancestor of {circuit_board, circuit_card, board, card}, not an ancestor for any other synsets which contain "circuit_board" or "circuit_card" or "board" or "card". When we follow the parent-child relationship to find ancestors for , we finally stop at because is the parent of synset {control_panel, display_panel, panel, board}. Therefore, the root of the hood for is synset Word Sense Disambiguation After hoods for each synset in WordNet are constructed, they can be used to select the sense of an ambiguous word in a given text-document. The senses of the nouns in a textdocument of a given collection are selected by the following two-stage process. A marking procedure that visits synsets and maintains a count of the number of times each synset is visited is fundamental to both stages. Given a word, the procedure finds all instances of the word in (the noun portion of) WordNet. For each identified synset, the procedure follows the IS-A links up to the root of the hierarchy incrementing a counter at each synset it visits. In the first stage the marking procedure is called once for each occurrence of a content word (i.e., a word that is not a stop word) in all of the documents in the collection. The number of times the procedure was called and found the word in 30

34 WordNet is also maintained. This produces a set of global counts (relative to this particular collection) at each synset. In the second stage, the marking procedure is called once for each occurrence of a content word in an individual text (document or query). Again the number of times the procedure was called and found the word in WordNet for the individual text is maintained. This produces a set of local counts at the synsets. Given the local and global counts, a sense for a particular ambiguous word contained within the text that generated the locals is selected as follows: # local visits # global visits The difference = # local calls # global calls The difference is computed at the root of the hood for each sense of the word. If a sense does not have a hood or if the local count at its hood root is less than two, that difference is set to zero. If a sense has multiple hoods, that difference is set to the largest difference over the set of hoods. The sense corresponding to the hood root with the largest positive difference is selected as the sense of the word in the text. If no sense has a positive difference, no WordNet sense is chosen for the word. Pseudo-code 3 shows the steps to disambiguate sense for every word in a document. Pseudo-code 3 disambiguation() global_counts() For each document in the document collection local_counts(document) Load words in this document into word_in_doc_hashtable 31

35 Remove stopwords from word_in_doc_hashtable Remove words that are not in WordNet noun division For each word in word_in_doc_hashtable difference(word) end end Pseudo-code for global_counts() For each word in the document collection if (word is not a stopword and word is in WordNet noun division) marking(word) #_of_global_calls is incremented by 1 end Pseudo-code for local_counts(document) For each word in this document if (word is not a stopword and word is in WordNet noun division) marking(word) #_of_locall_calls is incremented by 1 end Pseudo-code for marking(word) Find all the synset(s) that contains the word and save in synset_id_hashtable For each synset in synset_id_hashtable Find all its ancestors and save in ancestors_hashtable 32

36 For each synset in ancestors_hashtable end Increment its counter by 1 end Pseudo-code for difference (word) Find all the synset(s) that contains this word and save them in synset_id_hashtable For each synset in synset_id_hashtable Find the root(s) of the hood(s) of this synset if this synset has no hood at all max_diff =0 else For each root Calculate the diff with that formula described above Compare diff with max_diff and keep the max_diff end end The true sense this word used in this document is the synset whose hood is rooted with the max_diff The idea behind this disambiguation procedure is to select senses from the areas of the WordNet hierarchies in which document-induced (local) activity is greater than the expected (global) activity. The hood construct is designed to provide a point of comparison that is broad enough to encompass markings from several different words yet narrow enough to distinguish among senses. 33

37 CHAPTER 5 Experiments In this chapter I shall describe my experiment that verifies the effectiveness of hood algorithm for word sense disambiguation. This experiment is performed on part-ofspeech tagged Brown Corpus. The flow of this experiment will be described in detail. I will report the results of my experiment and analyze the quality of the results. 5.1 Part-of-Speech Tagged Brown Corpus Brown Corpus consists of 1,014,312 words of running text of edited English prose printed in the United States during the calendar year So far as it has been possible to determine, the writers were native speakers of American English. This Corpus is divided into 500 samples of words each. Each sample begins at the beginning of a sentence but not necessarily of a paragraph or other larger division, and each ends at the first sentence ending after 2000 words. The samples represent a wide range of styles and varieties of prose. Samples were chosen for their representative quality rather than for any subjectively determined excellence. A corpus is intended to be "a collection of naturally occurring language text, chosen to characterize a state or variety of a language" (Sinclair, 1991). As such, very few of the so-called corpora used in current natural language processing and speech recognition work deserve the name. For English, the only true corpus that is widely available is the Brown Corpus. It has been extensively used for natural language processing work. 34

38 A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by noun words. Thus, nouns can be taken out through the systematic elimination of verbs, adjectives, adverbs, connectives, articles and pronouns. Therefore, in this experiment, we make use of the part-of-speech tagged Brown Corpus provided by Treebank Project, Computer and Information Science Department, University of Pennsylvania. This document set consists of 479 tagged documents. Each word in every document is tagged with its certain linguistic category. 5.2 Flow of Experiment Figure 5-1 shows the steps of my experiment. First of all, I convert WordNet from files (noun.dat and noun.idx) to relational database. Tables are created and all the data contained in noun.dat and noun.idx are loaded into these tables (see Pseudo-code 1). Then, for each synset in WordNet, the root(s) of the hood(s) is found and saved in hood_root.txt. On the other hand, first, for each part-of-speech tagged document in Brown Corpus, such as a01, all the tags and non-nouns in a01 are removed and the result is saved in a01_noun. Second, a01_noun is processed by the stemming algorithm. After this step, all the words remained in a01_noun_stem are stems for the words in a01. Finally, a01_noun_stem is processed by Dr. Voorhees disambiguation algorithm. The 35

39 final result we get is saved in disambiguation_result_a01, where each word is mapped to a unique synset that represents the sense this word is used in this context. WordNet files, i.e. noun.dat, noun.idx a tagged document (i.e. a01.txt) convert WordNet from files to relational daqtabase (see Pseudo-code 1) remove tags and nonnouns, then generate a derivative document all the synset_ids in WordNet the file for nouns without tags (i.e. a01_noun.txt) find the root(s) of the hood(s) for each synset (see Pseudo-code 2) apply stemming algorithm on each document the file for the root(s) of the hood(s) for each synset (i.e. hood_root.txt) the file for stemmed nouns (i.e. a01_noun_stem.txt) disambiguate each word in this file (see Pseudo-code 3) in the disambiguation result file, each word is mapped to a unique synset Figure 5-1 Steps of Experiment 36

40 5.3 Quality of Results The results shown in Table 5-1 are for 50 documents randomly chosen from Brown Corpus. I randomly choose 50 documents to be processed as shown in Figure 5-1. Since WordNet provides semantically tagged Brown Corpus files, I compare my results with the manually identified results. # of words that is selected the same synset as the manually identified one Hit Rate = # of words in the stemmed file Hit Rate <15% 15%-20% 20%-25% 25%-30% 30%-35% >40% # of docs that get this hit rate Table 5-1 Hit Rate of Experiment for Voorhees Algorithm. From this table, we can see that the hit rate is not as high as our expectation. No one is higher than 40%, while most are between 15% and 35%. It means that Dr. Voorhees disambiguation algorithm is not an effective one to automatically disambiguate word senses. 5.4 Result Analysis So far we can say the algorithm doesn t work well to disambiguate word senses. The reasons are listed as following: 1. Although most of the semantics is carried by the noun words, verbs, adjectives, adverbs are important factors that can help determine appropriate senses for an ambiguous word. 37

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing