Using WordNet to Disambiguate Word Senses

Size: px
Start display at page:

Download "Using WordNet to Disambiguate Word Senses"

Transcription

1 Using WordNet to Disambiguate Word Senses by Ying Liu Electrical and Computer Engineering

2 Acknowledgements I would like to first thank Prof. Peter Scheuermann without whose constant guidance, support and encouragement, this work would not have been possible. I would also like to thank Bin Chen who gladly discussed various issues related to my work with me. This work is the result of many insightful discussions that I have had with Prof. Scheuermann who inspired me all through and also guided me as and when required. I would also like to thank the members of the Database System group for their friendship and help. They are Shayan Zaidi, Mehmet Sayal, Olga Shumsky, and Chris Fernandes. Further, I would like to thank Dr. Ellen M. Voorhees for her suggestions. Finally I would like to thank my parents Zongli Liu and Huifang Xu who have guided me all through my life. I would like to thank them for all the love, encouragement and virtues that I have received while I was growing. 2

3 Introduction Motivation Contribution Organization... 6 Background Knowledge WordNet Part-of-Speech Taggers Stemming Stopwords Work related to Word Sense Disambiguation Survey of Approaches to Word Sense Disambiguation Knowledge Based Corpus Based Disambiguated Corpora Raw Corpora Hybrid Approaches Using Hood Algorithm to Disambiguate Word Senses Converting WordNet into Relational Database Algorithm Hood Construction Word Sense Disambiguation Experiments Part-of-Speech Tagged Brown Corpus Flow of Experiment Quality of Results Result Analysis Conclusion and Future Work Conclusion Future Work and Application References Appendix A: Definition of Tables

4 CHAPTER 1 Introduction Text retrieval, also known as document or information retrieval, is concerned with locating natural language documents whose contents satisfy a user's information need. Unfortunately, there are billions of documents, many of which don't have abstracts or even titles, on the Internet today. Therefore, there is considerable interest in developing techniques that automatically index full-text documents and provide access to heterogeneous collections of full-text documents. 1.1 Motivation Search engines are great tools that help users find desired documents. Whenever a user submits a set of query keywords, documents that contain part of the entire keywords will be returned. But, these search engines are not good enough to answer all the queries. For instance, most web users have experienced troubles with search engines that when a large number of web pages are returned, one may have to go through many unrelated pages to identify useful ones. Sometimes not only is the number of documents returned large, but also the categories identified by the search engines are not relevant. Let s look at an example. If a computer hardware engineer wants to search documents related to "board", Yahoo returns the following categories: 1

5 Figure 1-1: Yahoo! Category Matches (1-8 of 2394) I only list the first 8 matches of 2394 matches. The results are organized in a hierarchical structure, i.e., in the first row, Recreation is the top category, Games is the second level category,, and Board Games is the category or web page that contains the keyword board. There are only 2 categories related to circuit_board. Obviously, it is a big burden for the computer hardware engineer to sift the documents that he is really interested in from such a large number of categories remember that within each category, there may be numerous web sites. Although the user is only interested in the meaning circuit_board of board, the search engine returns him all the documents that contain board. To explain why sometimes search engines cannot generate satisfactory categories, we will review how their hierarchical classification structures are generated. Most hierarchical categories or classes employed by search engines were either manually 2

6 set up or automatically constructed by data clustering algorithms. Since the class hierarchies generated by clustering algorithms lack semantic information, it is likely that they perform poorly when the number of query terms is small or a query term has more than one meaning. Although manually constructed hierarchies of classes normally have higher accuracy, there are also a number of problems with them. First, manually constructed classes are not concept oriented, that is, for each keyword more than one class can have the keyword as name or label. For example, there are multiple classes of board in Figure 1-1. Consequently, users have to explore a huge number of categories in order to identify the desired pages. Secondly, since the hierarchies are often maintained by a group of people, over times update procedure is prone to conflicting classification criteria. To overcome the disadvantages caused by manually constructed classes, an algorithm that constructs a hierarchical classification model based on keywords and their relationships from thesauri is proposed. Specifically, each class corresponds to one concept since in human memory different keywords are used to represent different objects, ideas, or activities. The topics of documents in each class are similar. The hierarchical structure is maintained via IS-A or PART-OF relationships between classes, i.e., class homer is PART-OF class baseball, hence class baseball is a super class of homer". The advantages of such a novel class hierarchy can be summarized as follows: 1. Each class name corresponds to one word (actually it is concept, or keyword sense), suitable for keyword-based query. 3

7 2. The relationships between classes are semantically defined by the thesauri, therefore, it is much more stable than traditional hierarchical classes. With this thesaurus-based hierarchy, documents are then mapped to the class hierarchy. During the mapping, a threshold min_sim [28] is employed to determine whether a document and a class are similar to each other or not. After documents are mapped to the class hierarchy, class representative vectors [28] are adjusted to reflect the topics of the documents. Next, documents are re-mapped by using the adjusted class representative vectors. Such re-mapping iteration may repeat a number of times. Then some classes which contain too few documents are removed by a hierarchy refinement procedure [28]. The resulting class hierarchy and the document mapping are the final hierarchy classification. Fortunately, WordNet, an electronic dictionary invented by Princeton University, is a concept based dictionary, whose lexical relations are IS-A and PART_OF. It is used as the frame for this proposed hierarchy classification. Assume that the class hierarchy is already constructed, what we need to do is to classify documents to their appropriate classes. Polysemy, which is defined as a single word form having more than one meaning, causes false classifying. For example, if we failed to tell which meaning of board is used in a given situation, we would probably map that document to a wrong class. Synonymy, which is defined as multiple words having the same meaning, causes true conceptual mapping to be missed. Therefore, the critical step of classifying is to recognize synonyms and detect uses of different meanings of a word for each word in each document. For example, if we failed to recognize notebook and laptop mean the 4

8 same thing, all those documents that use notebook in the place of laptop would be left out of the class laptop. The issue is how to automatically detect polysems and synonyms. In principle, polysems and synonyms can be handled by assigning the different senses of a word to different concept identifiers and assigning the same concept identifier to synonyms. In practice, this requires procedures that not only are able to recognize synonyms but also can detect uses of different senses of a word. 1.2 Contribution In this report, we implemented the disambiguation algorithm introduced by Ellen M. Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval ([5]). This algorithm is supposed to automatically detects and resolves the senses of the polysemous nouns occurring in the texts of documents and queries. Each word processed by this technique in any document is mapped to a unique concept, which is the meaning intended in this case. However, she didn't apply this idea to any text document. We applied this algorithm to a set of documents Brown Corpus, one of the most widely used document collections in a variety of fields. At last, we tested effectiveness of this automatic disambiguation algorithm by comparing with the manual disambiguation results offered by Princeton University. Our experiments verified Dr. Voorhees conclusion in her paper [5] that this algorithm is not sufficient to reliably select the correct sense of a noun from the set of sense disambiguation in WordNet. 5

9 1.3 Organization The remaining part of the thesis is organized as follows. Chapter 2 gives some background on text retrieval and WordNet. Chapter 3 explores the work done in the area of sense disambiguation. Chapter 4 explains the algorithm introduced by Dr. Voorhees in detail. The first section explains the hood construction part of the algorithm, the second section explains word sense disambiguation part of it. Chapter 5 presents our experiment results. A qualitative analysis of this algorithm is also performed. Chapter 6 draws a conclusion for the work I have done with this topic. Finally, we make comment on the future work that can be explored in this area and its potential application. 6

10 CHAPER 2 Background Knowledge In our work, we plan to apply our algorithm on Brown Corpus. We download part-ofspeech tagged Brown Corpus from University of Pennsylvania. First, we remove all the tags and all those non-noun words since most of the semantics is carried by noun words [2]. Secondly, we convert each word to its stem with Porter's algorithm. Thirdly, we remove all those words that are not in WordNet or stopwords list. Thus, every document in this corpus is represented only by its valid noun words after the three steps of processing. Finally, our algorithm is performed and results are analyzed. In order to help you understand our work, this chapter gives some background knowledge involved in our work. 2.1 WordNet WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the Cognitive Science Laboratory at Princeton University [12]. Originating from a project whose goal was to produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into a system that reflects current psycholinguistic theories about how humans organize their lexical memories. 7

11 In WordNet, the basic building block is a synset consisting of all the words that express a given concept. Synsets, which senses are manually classified into, denote synonym sets. Within each synset the senses, although from different keywords, denote the same meaning. For example, board has several senses, so does "plank". Each of the two words has a sense means a stout length of sawn timber; made in a wide variety of sizes and used for many purposes. The synset corresponding to this sense is composed of "board" and "plank". In this example, this given sense of "plank" and "board" are synonymous and form one synset. Because all synonymous senses are grouped into one synset and all different senses of the same word are separated into different synsets, there are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized concept. There are four main divisions in WordNet, one each for nouns, verbs, adjectives and adverbs. Within a division, synsets are organized by the lexical relations defined on them. For nouns, the only division of WordNet used in my work, the lexical relations include IS-A and PART-OF relations. For example, Figure 2-1 shows the hierarchy relating the eight different senses of the noun board. The synsets with the heavy boarder are the actual senses of board, and the remaining synsets are either ancestors or descendents of one of the senses. The synsets {group, grouping} and {entity, thing} in Figure 2-1 are examples of heads of the hierarchies. Other heads include {act, human_action, human_activity}, {abstraction}, {possession} and {psychologival_feature}. 8

12 group grouping entity thing people folk object social_group artifact artefact article substance material matter organization device sheet flat_solid article_of_ commerce equipment food nutrient material stuff unit electrical_device board furniture sport_equipment board mess ration building _material administrative Word S _unit controlpanel displaypanel panel board circuit closed_circuit pegboard palette pallet table spring_board k-ration lumber timber committee commission dashboard computer _circuit bulletin_board notice_board dining_table board board diving_board board plank deal board printed_circuit refectory_table highboard governing_board directorate board_of_director circuit_board circuit_card board card advisory_board cabinet school_board board_of_education Figure 2-1: The IS-A hierarchy for eight different senses of the noun board. 9

13 WordNet 1.6 (2000), the version of WordNet used in this work, contains words and senses in the noun division. Because synsets contain only strict synonyms, the majority of synsets are quite small. Similarly, the average number of senses per word is close to one. This seems to suggest that polysemy and synonymy occur too infrequently to be a problem, but they are misleading. The more frequently a word is used, the more polysemous it tends to be [13]. The more common words also tend to appear in the larger synsets. Thus, it is precisely those nouns that actually get used in documents are most likely to have many senses and synonyms. 2.2 Part-of-Speech Taggers Many corpora are, in addition to structural and bibliographic information, annotated with linguistic knowledge. The most basic and common form this annotation takes is marking up the words in the corpus with their part-of-speech tags. This adds value to the corpus because, for example, searches can be performed not only on the word-forms as strings but also on whether they belong to a certain linguistic category. Such tags are typically taken to be atomic labels attached to words, denoting the part of speech of the word, together with shallow morphosyntactic information, e.g. they specify the word as a proper singular noun, or a plural comparative adjective. For English and other Western European languages, for which most such annotated corpora have been produced, the tagset size ranges from about forty to several hundred distinct categories [8]. For example, since "happy" is an adjective, it is tagged with "JJ", which is the representation 10

14 of adjectives, as follows, so are one-of-a-kind and run-of-the-mill. Every word in every document is well tagged in this way. happy/jj one-of-a-kind/jj run-of-the-mill/jj 2.3 Stemming Stemming is a technique for reducing words to their grammatical roots. A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes). A typical example of a stem is the word connect which is the stem for variants connected, connecting, connection, and connections. Stems are thought to be useful because they reduce variants of the same root word to a common concept. Furthermore, stemming has the effect of reducing the size of the indexing structure because the number of distinct word is reduced. The best known algorithm for stemming is Porter's algorithm [9] introduced by M.F.Porter. This program is given an explicit list of suffixes and with each suffix, the criterion under which it may be removed from a word to leave a valid stem. The main merits of the present program are that it is small, fast and reasonably simple while the success rate is reasonably good. It is quite realistic to apply it to every word in a large file of continuous text. 11

15 2.4 Stopwords Words which are too frequent among the documents are not good discriminators. In fact, a word which occurs in 80% of the documents in the document collection is useless for purpose of retrieval or classification. Such words are frequently referred to as stopwords and should be filtered out. Articles, prepositions and conjunctions are candidates for a list of stopwords, such as an, against, and. Removal of stopwords can not only improve the accuracy of retrieval or classification, but also reduce the size of the document. 12

16 CHAPTER 3 Work related to Word Sense Disambiguation One of the first problems that is encountered by any natural language processing system is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's syntactic ambiguity has largely been solved in language processing by part-of-speech taggers which predict the syntactic category of words in text with high levels of accuracy (for example[14]). The problem of resolving semantic ambiguity is generally known as word sense disambiguation and has proved to be more difficult than syntactic disambiguation. The problem is that words often have more than one meaning, sometimes fairly similar and sometimes completely different. The meaning of a word in a particular usage can only be determined by examining its context. This is, in general, a trivial task for the human language processing system. However, the task has proved to be difficult for computer and some have believed that it would never be solved. However, there have been several advances in word sense disambiguation and we are now at a stage where lexical ambiguity in text can be resolved with a reasonable degree of accuracy. 13

17 3.1 Survey of Approaches to Word Sense Disambiguation It is useful to distinguish some different approaches to the word sense disambiguation problem. In general we can categorize all approaches to the problem into one of three general strategies: knowledge based, corpus based and hybrid. We shall now go on to look at each of these three strategies in turn Knowledge Based Under this approach disambiguation is carried out using information from an explicit lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus or it may be hand-crafted. This is one of most popular approaches to word sense disambiguation and amongst others, work has been done using existing lexical knowledge sources such as WordNet [16,17,18,19,5], LDOCE [20,21], and Roget's International Thesaurus [22]. The information in these resources has been used in several ways, for example Wilks and Stevenson [23], Harley and Glennon [24] and McRoy [25] all use large lexicons (generally machine readable dictionaries) and the information associated with the senses (such as part-of-speech tags, topical guides and selectional preferences) to indicate the correct sense. The word sense disambiguation algorithm in our work introduced by Voorhees [5] takes advantage of WordNet and part-of-speech tags. Another approach is to treat the text as an unordered bag of words where similarity measures are calculated by 14

18 looking at the semantic similarity (as measured from the knowledge source) between all the words in the window regardless of their positions, as was used by Yarowsky [22] Corpus Based This approach attempts to disambiguate words using information which is gained by training on some corpus, rather than taking it directly from an explicit knowledge source. This training can be carried out on either a disambiguated or raw corpus, where a disambiguated corpus is one where the semantics of each polysemous lexical item is marked and a raw corpus one without such marking. Disambiguated Corpora This set of techniques requires a training corpus which has already been disambiguated. In general, a machine learning algorithm of some kind is applied to certain features extracted from the corpus and used to form a representation of each of the senses. This representation can then be applied to new instances in order to disambiguate them. Different researchers have made use of different sets of features, for example [15] used local collocates such as first noun to the left and right, second word to the left/right and so on. The general problem with these methods is their reliance on disambiguated corpora which are expensive and difficult to obtain. This has meant that many of these algorithms have been tested on very small numbers of different words, often as few as

19 Raw Corpora It is often difficult to obtain appropriate lexical resources and we have already noted the difficulty in obtaining disambiguated text for supervised disambiguation. This lack of resources has led several researchers to explore the use of raw corpora to perform unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot actually label specific terms as a referring to a specific concept: that would require more information than is available. What unsupervised disambiguation can achieve is word sense discrimination, which clusters the instances of a word into distinct categories without giving those categories labels from a lexicon (such as WordNet synsets) Hybrid Approaches These approaches can be neither properly classified as knowledge or corpus based but use part of both approaches. A good example of this is Luk's system [26] which uses the textual definitions of senses from a machine readable dictionary to identify relations between senses. He then uses a corpus to calculate mutual information scores between these related senses in order to discover the most useful information. This allowed Luk to produce a system which used the information in lexical resources as a way of reducing the amount of text needed in the training corpus. 16

20 CHAPTER 4 Using Hood Algorithm to Disambiguate Word Senses In this chapter we present our implementation of the algorithm in Voorhees [15] with the help of WordNet. It is based on the idea that a set of words occurring together in context will determine appropriate senses for one another despite each individual word being multiply ambiguous. A common example of this effect ([27]) is the set of nouns base, bat, glove and hit. While most of these words have several senses, when taken together the intent is clearly the game of baseball. To exploit this idea automatically, a set of categories representing the different senses of words needs to be defined. Once such categories are defined, the number of words in the text that have senses that belong to a given category is counted. The senses that correspond to the categories with the largest counts are selected to be the intended sense of the ambiguous words. Obviously, the category definitions are a critical component. 4.1 Converting WordNet into Relational Database WordNet, dictionary system by Cognitive Science Department of Princeton University, is stored in flat files, not in database. In order to make the implementation easy and get good performance, WordNet should be converted into relational database version. Four relations created for WordNet are shown from Table 4-1 to Table 4-4: (For detailed definition of the tables, refer to appendix A.) Each of the four definitions is in third normal form. Relation synsets has distinct records, relation words has

21 distinct records, relation synset_word has distinct records and relation synset_relations has distinct records. 1. synsets(synset_id, category, hierarchy, meaning) synset_id an unique decimal integer which represents a synset in WordNet. category one character code indicating the synset type. For example, n indicates noun. hierarchy the hierarchy the synset belongs to. In WordNet, the hierarchies for noun range from 3 to 28. meaning definition for the synset. This table contains the basic information of each synset in WordNet. However, only the attribute synset_id is used in our work. Example tuples are shown in Table is the synset_id for a synset in WordNet; n means that this synset is in noun division of WordNet (WordNet also has verb division); 28 means this synest belongs to the hierarchy 28; meaning is the gloss for this synset. synset_id category Hierarchy meaning n 28 a period of the year marked by special events or activities in some field n 14 a committee having supervisory powers Table 4-1: Relation Definition for synsets and Tuples 2. synset_relations(synset_id1, synset_id2, rel_str) synset_id1 a child synset of synset_id2. This table does not store any relationship other than parent-child. synset_id2 a parent synset of synset_id1. 18

22 rel_str the actual symbol in WordNet to describe the relationship. ~: synset_id1 is a hypernym of synset_id2 (synset_id1 is a superordinate <generic> of synset_id1 is a hyponym of synset_id2 (synset_id1 is a subordinate <specific> of synset_id2<generic>) %: synset_id1 is a holonym of synset_id2 (synset_id2 is part of synset_id1) #: synset_id1 is a meronym of synset_id2 (synset_id1 is part of synset_id2) #p: synset_id1 is part of synset_id2 #m: synset_id1 is a member of synset_id2 #s: synset_id1 is the stuff that synset_id2 is made from =: synset_id1 has an attribute synset_id2 (synset_id2 is an adjective)!: synset_id1 and synset_id2 are antonyms. (not stored in this table) Although there are many kinds of relationship, we can treat all of them as childparent relationship. Each synset can have multiple parents or multiple children. Each pair of child and parent is a tuple in this table. Example tuples are shown in Table 4-2. In this example, the synsets numbered as and are two children of synset means that synset is a subordinate of synset , so is This relation is frequently used in our work. We depend on the parent-child relationship to find ancestors of a given synset. synset_id1 synset_id2 rel_str #p #m Table 4-2: Relation Definition for synset_relations and Tuples 3. words(word_id, word) word_id an unique decimal integer for each meaning of each word in WordNet. 19

23 word the word one of whose meaning is numbered as word_id. Each word in WordNet may have multiple meanings. Therefore, for every meaning we assign it a unique identification number word_id Example tuples are shown in Table is the word_id for one of the 3 meanings of season. board has 9 meanings, one of which is numbered as and another is numbered as This relation is also frequently used in our work. We depend on it to find all the synsets a given word belongs to. Word_id word season board board Table 4-3: Relation Definition for words and Tuples 4. synset_word(synset_id, word_id) synset_id defined in relation synsets. word_id defined in relation words. This table is the connection between synsets and words. One synset may consist of more than one word_id, while each word_id is only assigned to one synset. This guarantees all different meanings of the same word are separated into different synsets, in other words, there are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized concept. Example tuples are shown in Table 4-4. The word board is one of the members of the synset numbered as because one of its meanings numbered as is very close in meaning to this synset. 20

24 synset_id word_id Table 4-4: Relation Definition for synset_word and Tuples All the major information of noun division in WordNet is stored in two files: noun.dat and noun.idx. The data format in noun.dat is as follows: synset_id hierarchy category w_cnt word lex_id [word lex_id...] p_cnt [pointer_symbol synset_id pos source/target] gloss NOTE: w_cnt number of words in the synset. lex_id one digit hexadecimal integer that uniquely identifies a meaning of this word. It usually starts with 0. p_cnt number of pointers from this synset to other synsets. pointer_symbol refer to definition for relation synset_relations. pos syntactic category, n for noun. source/target a value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset. For example, is a synset_id; 28 is the hierarchy; n means noun; 01 indicates there is only one word in this synset; season is the word in this synset; 2 indicates that this meaning is the second meaning of the word season ; 015 indicates synset has 15 pointers to other synsets; one of the 15 target synsets is its parent synset due to the relation while another target is its child synset due to ~, so are the other 13 targets; the last part is the definition or example sentences n 01 season n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 ~ n 0000 a period of the year marked by special events or activities in some field; "he celebrated his 10th season with the ballet company" or "she always looked forward to the avocado season" 21

25 On the other hand, the data format in noun.idx is as follows: word pos poly_cnt p_cnt [pointer_symbol...] sense_cnt tagsense_cnt synset_id [synset_id...] NOTE: pos syntactic category, "n" for noun. poly_cnt number of different meanings (polysemy) the current word has in WordNet. This is the same value as sense_cnt, but is retained for historical reasons. p_cnt number of different types of pointers the current word has in all synsets containing it. pointer_symbol refer to definition for relation synset_relations. sense_cnt number of different meanings the current word has in WordNet. tagsense_cnt number of meanings of the current word that are ranked according to their frequency of occurrence in semantic concordance texts. synset each synset_id in the list corresponds to a different meaning of the current word in WordNet. For example, seat is a word; n means noun; 6 indicates seat has 6 senses; 5 means seat has 5 different types of pointers (@, ~, #m, #p, %p) in all the 6 synsets containing it; again, 6 tells that seat is in 6 synsets; finally, the synsets containing seat are listed one by one. seat n 6 ~ #m #p %p Pseudo-code 1 shows the steps to convert data in the two flat files into relational database. The for loop from line1 to line 8 extracts data from noun.dat to construct table synsets. The second for loop from line 9 to line 21 extracts data from noun.dat again to construct table synset_relations. The inner loop generates a separate tuple for every pair of child and parent. That means, if a synset has multiple pointers to other synsets, 22

26 there are multiple tuples for it to present the multiple children or multiple parents relationship. Then, the code from line 22 to line 32 extracts data from noun.idx to construct table words and synset_word. The inner loop generates a separate tuple for every sense in words and synset_word. Pseudo-code 1 build _wordnet() for each line in noun.dat 2 synset_id retrieve synset_id 3 hierarchy retrieve hierarchy 4 category retrieve category 5 skip the next items until gloss 6 meaning retrieve gloss 7 insert tuple (synset_id, hierarchy, category, meaning) into table synsets 8 end 9 for each line in noun.dat 10 synset_id retrieve synset_id 11 skip the next item until p_cnt 12 num_pointers retrieve p_cnt 13 for each pointer 14 relationstr retrieve pointer_symbol 15 relationsynset_id retrieve synset_id 16 if (synset_id is the parent of relationsynset_id) 17 insert tuple (relationsynset_id, synset_id, relationstr) into table synset_relations 18 else 19 insert tuple (synset_id, relationsynset_id, relationstr) into table synset_relations 23

27 20 end 21 end 22 for each line in noun.idx 23 word retrieve word 24 skip the next items until sense_cnt 25 numsenses retrieve sense_cnt 26 for each sense 27 generate a unique id word_id for this sense 28 insert tuple (word_id, word) into table words 29 synset_id retrieve synset_id 30 insert tuple (synset_id, word_id) into table synset_word 31 end 32 end Algorithm Hood Construction Using each separate hierarchy as a category is well defined but too coarse grained. For example, in Figure 2-1 seven of the eight senses of board are in the {entity, thing} hierarchy. Similarly, using individual synsets is well defined but too fine grained. Therefore, this algorithm is intended to define an appropriate middle level category hood. To define the hood of a given synset, s, consider the set of synsets and the hyponymy links in WordNet as the set of vertices and directed edges of a graph. Then the hood of s is the largest connected subgraph that contains s, contains only descendents of an ancestor of s, and contains no synset that has a descendent that includes another 24

28 instance of a member of s as a member. A hood is represented by the synset that is the root of the hood. In other words, shown in Figure 4-1, assume synset s consists of k words w(1), w(2), w(3) w(k), p(1), p(2), p(3) p(n) are n ancestors of s, where p(m) is a father of p(m-1). p(m) (m is a number in 1 n) has a descendent synset which also includes w(j) (j is a number in 1 k)as a member and p(m) is the closest one with this feature to s. So, p(m-1) is one of the root(s) of the hood(s) of s, as shown in Case 1. If m is 1, s itself is the root, as shown in Case 2. If no such m is found, the root of this WordNet hierarchy r is the root of the hood of s, as shown in Case 3. If s itself has a descendent synset that includes w(j) (j is a number in 1 k) as a member, there is no hood in WordNet for s, as shown in case 4. Because some synsets have more than one parent, synsets can have more than one hood. A synset has no hood if the same word is a member of both the synset and one of its descendents. For example, in Figure 2-1 the hood of the synset for committee sense of board is rooted at the synset {group, grouping} (and thus the hood for that sense is the entire hierarchy in which it occurs) because no other synset containing "board" in this hierarchy (Case 3), the hood for the circuit_board sense of board is rooted at {circuit, closed_circuit} because the synset {electrical_device} has a descendent synset {control_panel, display_panel, panel, board} containing "board" (Case 1), and the hood for the panel sense of board is rooted at the synset itself because its direct parent synset {electrical_device} has a descendent synset {circuit_board, circuit_card, board, card} containing "board" (Case 2). Pseudo-code 2 shows the steps to find the root(s) of the hood(s) for a given synset. The input for this procedure is a given synset_id, s. The output is the synset_id(s) of the 25

29 root(s) of hood(s) for s. The code from line1 to line 10 is to get all the synsets each of which has at least one member word which is also a member word of s and save them in a hashtable synset_id_hashtable. From line 11 to line 22, we get all the ancestors for every synset in synset_id_hashtable and keep them in another hashtable all_ancestors_hashtable. From line 23 to line 43, we find the find the ancestors of s one p(m) p(1) p(m-1) w(j) w(1), w(k) w(1), w(k) w(j) Case 1 Case 2 r w(1), w(k) w(j) w(j) Case 3 Case 4 Figure 4-1 Root of Hood(s) of Synset s 26

30 by one from the closest to the farthest. Whenever one ancestor a is in all_ancestors_hashtable, in other words, a has a descendent that includes another instance of a member of s as a member, its child that is in the path from s to a is a root of the hood(s) of s. In our work, we apply find_hood_root(s) procedure to all the synsets in WordNet. The output is stored in hood_root.txt for further computation. Pseudo-code 2 find_hood_root( s) word_id_set π word_id (σ synset_id=s (synset_word)) 2 for each word_id in word_id_set 3 word_set π word (σ word_id=word_id ( words)) 4 for each word in word_set 5 all_word_id_set π word_id (σ word=word ( words)) 6 end 7 end 8 for each word_id in all_word_id_set 9 synset_id_hashtable π synset_id (σ word_id=word_id ( synset_word)) 10 end 11 for each synset_id in synset_id_hashtable except s 12 current_id_hashtable synset_id 13 while (current_id_hashtable is not empty) 14 for each synset_id in current_id_hashtable 15 parent_id_hashtable π synset_id2 (σ synset_id1=synset_id ( synset_relations)) 16 end 17 clear current_id_hashtable 18 copy parent_id_hashtable to current_id_hashtable 19 copy parent_id_hashtable to all_ancestors_hashtable 27

31 20 clear parent_id_hashtable 21 end 22 end 23 if (s is in all_ancestors_hashtable) 24 s has no hood in WordNet 25 else 26 current_id_hashtable s 27 while (current_id_hashtable is not empty) 28 for each current_synset_id in current_id_hashtable 29 parent_id_hashtable π synset_id2 (σ synset_id1=current_synset_id ( synset_relations)) 30 for each parent_synset_id in parent_id_hashtable 31 if (parent_synset_id is in all_ancestors_hashtable) 32 root_found true 33 root_set current_synset_id 34 remove parent_synset_id from parent_id_hashtable 35 break 36 end 37 end 38 clear current_id_hashtable 39 copy parent_id_hashtable to current_id_hashtable 40 clear parent_id_hashtable 41 end 42 if (root_found is false) 43 root_set root of this entire hierarchy in WordNet

32 17954 entity thing people folk artifact artefact article substance material matter equipment material stuff sport equipment building _material Word S pegboard palette pallet table spring board k-ration lumber timber dashboard bulletin_board notice_board refectory_table highboard governing_board directorate board_of_director advisory_board cabinet school_board board_of_education Figure 4-2: The IS-A hierarchy for eight different senses of the noun board. 29

33 Let's take synset {circuit_board, circuit_card, board, card} as an example (Figure 4-2, refer to Figure 2-1). All the 9 synsets ( , , , , , , , , ) for "board" are stored in synset_id_hashtable, as well as those synsets of "circuit_board", "circuit_card" and "card"; all_ancestors_hashtable contains , , , , , , , etc., but none of , , is in it because each one is only an ancestor of {circuit_board, circuit_card, board, card}, not an ancestor for any other synsets which contain "circuit_board" or "circuit_card" or "board" or "card". When we follow the parent-child relationship to find ancestors for , we finally stop at because is the parent of synset {control_panel, display_panel, panel, board}. Therefore, the root of the hood for is synset Word Sense Disambiguation After hoods for each synset in WordNet are constructed, they can be used to select the sense of an ambiguous word in a given text-document. The senses of the nouns in a textdocument of a given collection are selected by the following two-stage process. A marking procedure that visits synsets and maintains a count of the number of times each synset is visited is fundamental to both stages. Given a word, the procedure finds all instances of the word in (the noun portion of) WordNet. For each identified synset, the procedure follows the IS-A links up to the root of the hierarchy incrementing a counter at each synset it visits. In the first stage the marking procedure is called once for each occurrence of a content word (i.e., a word that is not a stop word) in all of the documents in the collection. The number of times the procedure was called and found the word in 30

34 WordNet is also maintained. This produces a set of global counts (relative to this particular collection) at each synset. In the second stage, the marking procedure is called once for each occurrence of a content word in an individual text (document or query). Again the number of times the procedure was called and found the word in WordNet for the individual text is maintained. This produces a set of local counts at the synsets. Given the local and global counts, a sense for a particular ambiguous word contained within the text that generated the locals is selected as follows: # local visits # global visits The difference = # local calls # global calls The difference is computed at the root of the hood for each sense of the word. If a sense does not have a hood or if the local count at its hood root is less than two, that difference is set to zero. If a sense has multiple hoods, that difference is set to the largest difference over the set of hoods. The sense corresponding to the hood root with the largest positive difference is selected as the sense of the word in the text. If no sense has a positive difference, no WordNet sense is chosen for the word. Pseudo-code 3 shows the steps to disambiguate sense for every word in a document. Pseudo-code 3 disambiguation() global_counts() For each document in the document collection local_counts(document) Load words in this document into word_in_doc_hashtable 31

35 Remove stopwords from word_in_doc_hashtable Remove words that are not in WordNet noun division For each word in word_in_doc_hashtable difference(word) end end Pseudo-code for global_counts() For each word in the document collection if (word is not a stopword and word is in WordNet noun division) marking(word) #_of_global_calls is incremented by 1 end Pseudo-code for local_counts(document) For each word in this document if (word is not a stopword and word is in WordNet noun division) marking(word) #_of_locall_calls is incremented by 1 end Pseudo-code for marking(word) Find all the synset(s) that contains the word and save in synset_id_hashtable For each synset in synset_id_hashtable Find all its ancestors and save in ancestors_hashtable 32

36 For each synset in ancestors_hashtable end Increment its counter by 1 end Pseudo-code for difference (word) Find all the synset(s) that contains this word and save them in synset_id_hashtable For each synset in synset_id_hashtable Find the root(s) of the hood(s) of this synset if this synset has no hood at all max_diff =0 else For each root Calculate the diff with that formula described above Compare diff with max_diff and keep the max_diff end end The true sense this word used in this document is the synset whose hood is rooted with the max_diff The idea behind this disambiguation procedure is to select senses from the areas of the WordNet hierarchies in which document-induced (local) activity is greater than the expected (global) activity. The hood construct is designed to provide a point of comparison that is broad enough to encompass markings from several different words yet narrow enough to distinguish among senses. 33

37 CHAPTER 5 Experiments In this chapter I shall describe my experiment that verifies the effectiveness of hood algorithm for word sense disambiguation. This experiment is performed on part-ofspeech tagged Brown Corpus. The flow of this experiment will be described in detail. I will report the results of my experiment and analyze the quality of the results. 5.1 Part-of-Speech Tagged Brown Corpus Brown Corpus consists of 1,014,312 words of running text of edited English prose printed in the United States during the calendar year So far as it has been possible to determine, the writers were native speakers of American English. This Corpus is divided into 500 samples of words each. Each sample begins at the beginning of a sentence but not necessarily of a paragraph or other larger division, and each ends at the first sentence ending after 2000 words. The samples represent a wide range of styles and varieties of prose. Samples were chosen for their representative quality rather than for any subjectively determined excellence. A corpus is intended to be "a collection of naturally occurring language text, chosen to characterize a state or variety of a language" (Sinclair, 1991). As such, very few of the so-called corpora used in current natural language processing and speech recognition work deserve the name. For English, the only true corpus that is widely available is the Brown Corpus. It has been extensively used for natural language processing work. 34

38 A sentence in natural language text is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs and connectives. While the words in each grammatical class are used with a particular purpose, it can be argued that most of the semantics is carried by noun words. Thus, nouns can be taken out through the systematic elimination of verbs, adjectives, adverbs, connectives, articles and pronouns. Therefore, in this experiment, we make use of the part-of-speech tagged Brown Corpus provided by Treebank Project, Computer and Information Science Department, University of Pennsylvania. This document set consists of 479 tagged documents. Each word in every document is tagged with its certain linguistic category. 5.2 Flow of Experiment Figure 5-1 shows the steps of my experiment. First of all, I convert WordNet from files (noun.dat and noun.idx) to relational database. Tables are created and all the data contained in noun.dat and noun.idx are loaded into these tables (see Pseudo-code 1). Then, for each synset in WordNet, the root(s) of the hood(s) is found and saved in hood_root.txt. On the other hand, first, for each part-of-speech tagged document in Brown Corpus, such as a01, all the tags and non-nouns in a01 are removed and the result is saved in a01_noun. Second, a01_noun is processed by the stemming algorithm. After this step, all the words remained in a01_noun_stem are stems for the words in a01. Finally, a01_noun_stem is processed by Dr. Voorhees disambiguation algorithm. The 35

39 final result we get is saved in disambiguation_result_a01, where each word is mapped to a unique synset that represents the sense this word is used in this context. WordNet files, i.e. noun.dat, noun.idx a tagged document (i.e. a01.txt) convert WordNet from files to relational daqtabase (see Pseudo-code 1) remove tags and nonnouns, then generate a derivative document all the synset_ids in WordNet the file for nouns without tags (i.e. a01_noun.txt) find the root(s) of the hood(s) for each synset (see Pseudo-code 2) apply stemming algorithm on each document the file for the root(s) of the hood(s) for each synset (i.e. hood_root.txt) the file for stemmed nouns (i.e. a01_noun_stem.txt) disambiguate each word in this file (see Pseudo-code 3) in the disambiguation result file, each word is mapped to a unique synset Figure 5-1 Steps of Experiment 36

40 5.3 Quality of Results The results shown in Table 5-1 are for 50 documents randomly chosen from Brown Corpus. I randomly choose 50 documents to be processed as shown in Figure 5-1. Since WordNet provides semantically tagged Brown Corpus files, I compare my results with the manually identified results. # of words that is selected the same synset as the manually identified one Hit Rate = # of words in the stemmed file Hit Rate <15% 15%-20% 20%-25% 25%-30% 30%-35% >40% # of docs that get this hit rate Table 5-1 Hit Rate of Experiment for Voorhees Algorithm. From this table, we can see that the hit rate is not as high as our expectation. No one is higher than 40%, while most are between 15% and 35%. It means that Dr. Voorhees disambiguation algorithm is not an effective one to automatically disambiguate word senses. 5.4 Result Analysis So far we can say the algorithm doesn t work well to disambiguate word senses. The reasons are listed as following: 1. Although most of the semantics is carried by the noun words, verbs, adjectives, adverbs are important factors that can help determine appropriate senses for an ambiguous word. 37

Information Retrieval. Chap 7. Text Operations

Information Retrieval. Chap 7. Text Operations Information Retrieval Chap 7. Text Operations The Retrieval Process user need User Interface 4, 10 Text Text logical view Text Operations logical view 6, 7 user feedback Query Operations query Indexing

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Web Information Retrieval using WordNet

Web Information Retrieval using WordNet Web Information Retrieval using WordNet Jyotsna Gharat Asst. Professor, Xavier Institute of Engineering, Mumbai, India Jayant Gadge Asst. Professor, Thadomal Shahani Engineering College Mumbai, India ABSTRACT

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE COMP90042 LECTURE 3 LEXICAL SEMANTICS SENTIMENT ANALYSIS REVISITED 2 Bag of words, knn classifier. Training data: This is a good movie.! This is a great movie.! This is a terrible film. " This is a wonderful

More information

What is this Song About?: Identification of Keywords in Bollywood Lyrics

What is this Song About?: Identification of Keywords in Bollywood Lyrics What is this Song About?: Identification of Keywords in Bollywood Lyrics by Drushti Apoorva G, Kritik Mathur, Priyansh Agrawal, Radhika Mamidi in 19th International Conference on Computational Linguistics

More information

WordNet-based User Profiles for Semantic Personalization

WordNet-based User Profiles for Semantic Personalization PIA 2005 Workshop on New Technologies for Personalized Information Access WordNet-based User Profiles for Semantic Personalization Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano LACAM

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

Wordnet Based Document Clustering

Wordnet Based Document Clustering Wordnet Based Document Clustering Madhavi Katamaneni 1, Ashok Cheerala 2 1 Assistant Professor VR Siddhartha Engineering College, Kanuru, Vijayawada, A.P., India 2 M.Tech, VR Siddhartha Engineering College,

More information

Taxonomies and controlled vocabularies best practices for metadata

Taxonomies and controlled vocabularies best practices for metadata Original Article Taxonomies and controlled vocabularies best practices for metadata Heather Hedden is the taxonomy manager at First Wind Energy LLC. Previously, she was a taxonomy consultant with Earley

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

International ejournals

International ejournals Available online at www.internationalejournals.com International ejournals ISSN 0976 1411 International ejournal of Mathematics and Engineering 112 (2011) 1023-1029 ANALYZING THE REQUIREMENTS FOR TEXT

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

A Short Introduction to CATMA

A Short Introduction to CATMA A Short Introduction to CATMA Outline: I. Getting Started II. Analyzing Texts - Search Queries in CATMA III. Annotating Texts (collaboratively) with CATMA IV. Further Search Queries: Analyze Your Annotations

More information

LexiRes: A Tool for Exploring and Restructuring EuroWordNet for Information Retrieval

LexiRes: A Tool for Exploring and Restructuring EuroWordNet for Information Retrieval LexiRes: A Tool for Exploring and Restructuring EuroWordNet for Information Retrieval Ernesto William De Luca and Andreas Nürnberger 1 Abstract. The problem of word sense disambiguation in lexical resources

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

A Comprehensive Analysis of using Semantic Information in Text Categorization

A Comprehensive Analysis of using Semantic Information in Text Categorization A Comprehensive Analysis of using Semantic Information in Text Categorization Kerem Çelik Department of Computer Engineering Boğaziçi University Istanbul, Turkey celikerem@gmail.com Tunga Güngör Department

More information

Influence of Word Normalization on Text Classification

Influence of Word Normalization on Text Classification Influence of Word Normalization on Text Classification Michal Toman a, Roman Tesar a and Karel Jezek a a University of West Bohemia, Faculty of Applied Sciences, Plzen, Czech Republic In this paper we

More information

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity

More information

Parsing tree matching based question answering

Parsing tree matching based question answering Parsing tree matching based question answering Ping Chen Dept. of Computer and Math Sciences University of Houston-Downtown chenp@uhd.edu Wei Ding Dept. of Computer Science University of Massachusetts

More information

Final Project Discussion. Adam Meyers Montclair State University

Final Project Discussion. Adam Meyers Montclair State University Final Project Discussion Adam Meyers Montclair State University Summary Project Timeline Project Format Details/Examples for Different Project Types Linguistic Resource Projects: Annotation, Lexicons,...

More information

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens

Syntactic Analysis. CS345H: Programming Languages. Lecture 3: Lexical Analysis. Outline. Lexical Analysis. What is a Token? Tokens Syntactic Analysis CS45H: Programming Languages Lecture : Lexical Analysis Thomas Dillig Main Question: How to give structure to strings Analogy: Understanding an English sentence First, we separate a

More information

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD 10 Text Mining Munawar, PhD Definition Text mining also is known as Text Data Mining (TDM) and Knowledge Discovery in Textual Database (KDT).[1] A process of identifying novel information from a collection

More information

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet Automatic Wordnet Mapping: from CoreNet to Princeton WordNet Jiseong Kim, Younggyun Hahm, Sunggoo Kwon, Key-Sun Choi Semantic Web Research Center, School of Computing, KAIST 291 Daehak-ro, Yuseong-gu,

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example

Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Running Example. Mention Pair Model. Mention Pair Example Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Many machine learning models for coreference resolution have been created, using not only different feature sets but also fundamentally

More information

Lecture 14: Annotation

Lecture 14: Annotation Lecture 14: Annotation Nathan Schneider (with material from Henry Thompson, Alex Lascarides) ENLP 23 October 2016 1/14 Annotation Why gold 6= perfect Quality Control 2/14 Factors in Annotation Suppose

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal

More information

CHAPTER-26 Mining Text Databases

CHAPTER-26 Mining Text Databases CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other

More information

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Question Answering Approach Using a WordNet-based Answer Type Taxonomy Question Answering Approach Using a WordNet-based Answer Type Taxonomy Seung-Hoon Na, In-Su Kang, Sang-Yool Lee, Jong-Hyeok Lee Department of Computer Science and Engineering, Electrical and Computer Engineering

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using

More information

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 6, June 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Internet

More information

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Department of Electronic Engineering FINAL YEAR PROJECT REPORT Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2007/08-HCS-HCS-03-BECE Natural Language Understanding for Query in Web Search 1 Student Name: Sit Wing Sum Student ID: Supervisor:

More information

Punjabi WordNet Relations and Categorization of Synsets

Punjabi WordNet Relations and Categorization of Synsets Punjabi WordNet Relations and Categorization of Synsets Rupinderdeep Kaur Computer Science Engineering Department, Thapar University, rupinderdeep@thapar.edu Suman Preet Department of Linguistics and Punjabi

More information

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY Ankush Maind 1, Prof. Anil Deorankar 2 and Dr. Prashant Chatur 3 1 M.Tech. Scholar, Department of Computer Science and Engineering, Government

More information

Boolean Queries. Keywords combined with Boolean operators:

Boolean Queries. Keywords combined with Boolean operators: Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents. Optimal Query Assume that the relevant set of documents C r are known. Then the best query is: q opt 1 C r d j C r d j 1 N C r d j C r d j Where N is the total number of documents. Note that even this

More information

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 25 Tutorial 5: Analyzing text using Python NLTK Hi everyone,

More information

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney Hidden Markov Models Natural Language Processing: Jordan Boyd-Graber University of Colorado Boulder LECTURE 20 Adapted from material by Ray Mooney Natural Language Processing: Jordan Boyd-Graber Boulder

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

How to.. What is the point of it?

How to.. What is the point of it? Program's name: Linguistic Toolbox 3.0 α-version Short name: LIT Authors: ViatcheslavYatsko, Mikhail Starikov Platform: Windows System requirements: 1 GB free disk space, 512 RAM,.Net Farmework Supported

More information

Correlation to Georgia Quality Core Curriculum

Correlation to Georgia Quality Core Curriculum 1. Strand: Oral Communication Topic: Listening/Speaking Standard: Adapts or changes oral language to fit the situation by following the rules of conversation with peers and adults. 2. Standard: Listens

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

Noida institute of engineering and technology,greater noida

Noida institute of engineering and technology,greater noida Impact Of Word Sense Ambiguity For English Language In Web IR Prachi Gupta 1, Dr.AnuragAwasthi 2, RiteshRastogi 3 1,2,3 Department of computer Science and engineering, Noida institute of engineering and

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Evaluating a Conceptual Indexing Method by Utilizing WordNet Evaluating a Conceptual Indexing Method by Utilizing WordNet Mustapha Baziz, Mohand Boughanem, Nathalie Aussenac-Gilles IRIT/SIG Campus Univ. Toulouse III 118 Route de Narbonne F-31062 Toulouse Cedex 4

More information

Assignment 4 CSE 517: Natural Language Processing

Assignment 4 CSE 517: Natural Language Processing Assignment 4 CSE 517: Natural Language Processing University of Washington Winter 2016 Due: March 2, 2016, 1:30 pm 1 HMMs and PCFGs Here s the definition of a PCFG given in class on 2/17: A finite set

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering 1 G. Loshma, 2 Nagaratna P Hedge 1 Jawaharlal Nehru Technological University, Hyderabad 2 Vasavi

More information

Review on Text Mining

Review on Text Mining Review on Text Mining Aarushi Rai #1, Aarush Gupta *2, Jabanjalin Hilda J. #3 #1 School of Computer Science and Engineering, VIT University, Tamil Nadu - India #2 School of Computer Science and Engineering,

More information

A. The following is a tentative list of parts of speech we will use to match an existing parser:

A. The following is a tentative list of parts of speech we will use to match an existing parser: API Functions available under technology owned by ACI A. The following is a tentative list of parts of speech we will use to match an existing parser: adjective adverb interjection noun verb auxiliary

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study 1746-2014 Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study Dr. Goutam Chakraborty, Professor, Department of Marketing, Spears School of Business, Oklahoma

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

Castanet: Using WordNet to Build Facet Hierarchies. Emilia Stoica and Marti Hearst School of Information, Berkeley

Castanet: Using WordNet to Build Facet Hierarchies. Emilia Stoica and Marti Hearst School of Information, Berkeley Castanet: Using WordNet to Build Facet Hierarchies Emilia Stoica and Marti Hearst School of Information, Berkeley Motivation Want to assign labels from multiple hierarchies Motivation Hot and Sweet Chicken:

More information

Ranking in a Domain Specific Search Engine

Ranking in a Domain Specific Search Engine Ranking in a Domain Specific Search Engine CS6998-03 - NLP for the Web Spring 2008, Final Report Sara Stolbach, ss3067 [at] columbia.edu Abstract A search engine that runs over all domains must give equal

More information

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Reza Taghizadeh Hemayati 1, Weiyi Meng 1, Clement Yu 2 1 Department of Computer Science, Binghamton university,

More information

Annotation by category - ELAN and ISO DCR

Annotation by category - ELAN and ISO DCR Annotation by category - ELAN and ISO DCR Han Sloetjes, Peter Wittenburg Max Planck Institute for Psycholinguistics P.O. Box 310, 6500 AH Nijmegen, The Netherlands E-mail: Han.Sloetjes@mpi.nl, Peter.Wittenburg@mpi.nl

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame structure of the presentation Frame Semantics semantic characterisation of situations or states of affairs 1. introduction (partially taken from a presentation of Markus Egg): i. what is a frame supposed

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Parallel Concordancing and Translation. Michael Barlow

Parallel Concordancing and Translation. Michael Barlow [Translating and the Computer 26, November 2004 [London: Aslib, 2004] Parallel Concordancing and Translation Michael Barlow Dept. of Applied Language Studies and Linguistics University of Auckland Auckland,

More information

CHAPTER 2: DATA MODELS

CHAPTER 2: DATA MODELS CHAPTER 2: DATA MODELS 1. A data model is usually graphical. PTS: 1 DIF: Difficulty: Easy REF: p.36 2. An implementation-ready data model needn't necessarily contain enforceable rules to guarantee the

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96

Semantic Web. Ontology Alignment. Morteza Amini. Sharif University of Technology Fall 95-96 ه عا ی Semantic Web Ontology Alignment Morteza Amini Sharif University of Technology Fall 95-96 Outline The Problem of Ontologies Ontology Heterogeneity Ontology Alignment Overall Process Similarity (Matching)

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Ontology Based Search Engine

Ontology Based Search Engine Ontology Based Search Engine K.Suriya Prakash / P.Saravana kumar Lecturer / HOD / Assistant Professor Hindustan Institute of Engineering Technology Polytechnic College, Padappai, Chennai, TamilNadu, India

More information

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering A. Anil Kumar Dept of CSE Sri Sivani College of Engineering Srikakulam, India S.Chandrasekhar Dept of CSE Sri Sivani

More information

2. An implementation-ready data model needn't necessarily contain enforceable rules to guarantee the integrity of the data.

2. An implementation-ready data model needn't necessarily contain enforceable rules to guarantee the integrity of the data. Test bank for Database Systems Design Implementation and Management 11th Edition by Carlos Coronel,Steven Morris Link full download test bank: http://testbankcollection.com/download/test-bank-for-database-systemsdesign-implementation-and-management-11th-edition-by-coronelmorris/

More information

Encoding Words into String Vectors for Word Categorization

Encoding Words into String Vectors for Word Categorization Int'l Conf. Artificial Intelligence ICAI'16 271 Encoding Words into String Vectors for Word Categorization Taeho Jo Department of Computer and Information Communication Engineering, Hongik University,

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction 2 Machine Learning Field of study that gives computers the ability to learn without

More information

Introduction to Lexical Analysis

Introduction to Lexical Analysis Introduction to Lexical Analysis Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical analysis Lookahead Ambiguities Specifying lexers Regular expressions Examples

More information

EDMS. Architecture and Concepts

EDMS. Architecture and Concepts EDMS Engineering Data Management System Architecture and Concepts Hannu Peltonen Helsinki University of Technology Department of Computer Science Laboratory of Information Processing Science Abstract

More information

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9 1 INF5830 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 4, 10.9 2 Working with texts From bits to meaningful units Today: 3 Reading in texts Character encodings and Unicode Word tokenization

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

Removing Belady s Anomaly from Caches with Prefetch Data

Removing Belady s Anomaly from Caches with Prefetch Data Removing Belady s Anomaly from Caches with Prefetch Data Elizabeth Varki University of New Hampshire varki@cs.unh.edu Abstract Belady s anomaly occurs when a small cache gets more hits than a larger cache,

More information

Putting ontologies to work in NLP

Putting ontologies to work in NLP Putting ontologies to work in NLP The lemon model and its future John P. McCrae National University of Ireland, Galway Introduction In natural language processing we are doing three main things Understanding

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information