ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 3, May 2014

Size: px

Start display at page:

Download "ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 3, May 2014"

Deborah Day
5 years ago
Views:

1 Literature Survey on Knowledge Based Information Retrieval on Web Samiksha Chakule, Ashwini Borse, Shalaka Jadhav, Dr. Mrs. Y.V. Haribhakta Abstract - It has been observed that search engines lack in satisfying ambiguous queries. User has to browse through large amount of information on web to get the results of his interests. This problem has resulted into web page references mapped to different meanings mixed together in the result list. This increases the burden on search engine and thus decreases its performance. Here we have proposed an approach based on the use of Wikipedia as a knowledge base in information retrieval activity on web. Our objective is to resolve ambiguity in query using semantic knowledge of Wikipedia. Our module finds supporting words for a given query containing ambiguous terms. Wikipedia provides large knowledge base in structured fashion to compute semantic relatedness between texts. Also it is a large source of information which is updated continuously. Hence focus will be on exploiting its knowledge to disambiguate the query dynamically by fetching the supporting words from Wikipedia. Index Terms Information Retrieval, Knowledge Base, Search Engines, Web, Wikipedia, I. INTRODUCTION In the present world, Internet is the biggest source of information on web. Internet surfing has become inevitable for the people to get acquainted with the happenings around the world. Thus using suitable resource and effective techniques for extracting information that correlates user query, is very important. There are various kinds of search engines like Google, Yahoo etc. which are information retrieval applications on the web. Google uses page rank algorithm to rank the websites in the search results [12]. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is (source: Wikipedia). The assumption behind this is that more important websites are likely to receive more links from other websites. However search engines lack in satisfying ambiguous queries. For eg. the input word 'JAVA' can refer to 'Java Coffee', or 'Java Programming Language ' or 'Java Island'. This problem has resulted into web page references mapped to different meanings mixed together in the result list. This increases the burden on search engine and thus decreases its performance. Solution to this problem can be obtained by associating appropriate sense to tokens in the query. Various approaches have been proposed till now where dictionaries [24], thesaurus, ontologies are used as sense repositories that define possible senses of words. WordNet has also been used as sense repository. Most of the work dealing with relatedness and similarity measures has been developed using WordNet (Fellbaum, 1998). WordNet represents a well structured taxonomy organized in a meaningful way. But questions arise about larger coverage since it does not include information about named entities. Further advancements in NLP [5] are crucially depended on availability of large world and domain knowledge. Out of those merely available knowledge bases, Wikipedia provides large amount of information about named entities and contains continuously updated knowledge for processing current information [11]. Strength of the Wikipedia lies in its larger coverage as compared to WordNet [26]. Every entry on Wikipedia is an article that defines or describes single entity or concept. A page is uniquely identified by its title. In Wikipedia every entity page is associated with one or more categories each of which can have subcategories. Other resource available with it is disambiguation pages. They are created for ambiguous names. Also redirect page is written on Wikipedia to avoid misspelling and capitalization and acts as alternate name for entity. It does not have any content in itself but sends user to another page. Thus overall knowledge of Wikipedia lies in disambiguation pages, redirect pages, hyperlinks, categories available with it [4]. Here an approach is proposed to overcome the well known knowledge acquisition bottleneck by deriving a knowledge resource from a very large, collaboratively created encyclopedia i.e. Wikipedia. At present it contains over 44 million articles (source: Wikipedia). In this paper we discussed about using Wikipedia as a knowledge base to decide semantic relatedness of query tokens and thereby to disambiguate ambiguous tokens in the query leading to the refined 352

2 search results. Rest of the paper is organized as follows: Section II discusses related work done in disambiguation using knowledge bases like Wikipedia and WordNet in order to get search results more relevant to user needs. In section III various approaches are compared. These approaches are analyzed in section IV and proposed method for disambiguation of query is described in section V. The paper is concluded in Section VI. II. RELATED WORK Various approaches are used till now for word sense disambiguation and thereby to get more relevant search results. Michalcea emphasized that Wikipedia can be used for word sense disambiguation [18]. Few researchers have explored that Wikipedia can be used for question answering as well as for the disambiguation of named entities [4] and showed promising results. In CSAW various named entities are annotated and mapped to entities in Wikipedia catalog [25]. Thus disambiguating them explicitly. In another case general collective disambiguation approach was proposed based on the assumption that coherent documents generally address to the entities from related documents or same domain [15]. Word senses can be described through structured features. SSI (Structural pattern Recognition) algorithm has been used for the disambiguation of noun terms in WordNet and could give high precision [20]. In another approach to search ambiguous term in corpus quickly, inverted index table is created on annotated corpus [10]. This approach could improve retrieval speed for polysemous term 3 to 6 times. Along with this other approaches used are: A. Efficient Information Retrieval Using Dynamic Page Rank Algorithm Here Dynamic Page Rank algorithm has been discussed which uses WSD and thereby acts as layer on the top of search engine [1]. In WSD first all senses of all query words are identified and appropriate sense is assigned to each occurrence of word in textual context. The query is enhanced by applying various steps like tokenization, stemming, stop word removal as well as sense disambiguation approach. Enhanced query is passed to search engine which gives results in the form of web pages. Dynamic page rank of web page is calculated separately by matching it with enhanced query. Finally the result is rearranged from higher to lower dynamic page rank values. In this way user receives more meaningful contents at top of the search results. Mean Reciprocal Rank and Mean Average Precision were applied to compare efficiency of Goggles' Page Rank Algorithm and Dynamic Page Rank Algorithm. Here Mean Reciprocal Rank is the average of the reciprocal ranks of results for a sample of queries.also precision is the fraction of retrieved documents that are relevant and recall is the fraction of relevant instances that are retrieved. A measure that uses both precision and recall is the average precision. Experimental results of this work showed that this approach can give more efficient results than existing Google s Page Rank algorithm and helps in resolving ambiguity through query enhancement. Since WSD is used here, sufficient training time is required for each term. B. Clustering of Word Senses for An Ambiguous Query Here an approach is proposed to increase the effectiveness of search engine by forming cluster of word senses for an ambiguous query [24]. Cluster formation is based on association concept of data mining and also uses vector space model of Gensim (python library) and the free dictionary. It includes extracting web pages for a query and preprocessing extracted web pages. Preprocessing of web page includes removal of stop words, stemming with the help of porter stemmer algorithm, extracting only nouns from corpus. Each processed web page is saved in the form of document and document vector is formed for each retrieved page using Gensim. Now community vector is formed for a queried word using free dictionary. Later Cosine Similarity is found between document vector and community vector. Finally Web pages are clustered based on similarity. Experimental results of this work showed that this approach is an effective way to form clusters for an ambiguous query and also user intention behind ambiguous query could be identified significantly. But it cannot support multiword query. C. Polysemy Handling in Description Logic Ontologies Ontology is the set of related concepts in domain and plays important role in eliminating conceptual confusion and reuse of knowledge on semantic web.if lexical representation of ontology contains polysemous(having 2 or more meanings) terms this may lead to ambiguity during its application and management. Here an approach is discussed to handle Polysemy in description logic ontologies by automatically disambiguating terms in ontology [7]. Disambiguation of a term in ontology is done by using its surrounding ontology symbols and its nearby terms in annotated documents using this ontology. Senses defined in WordNet are used here. Due to unique 353

3 meaning of sense, it is used to replace term as the lexical representation of concepts and properties. The right sense is assigned to a polysemous term by maximizing its relatedness with neighbour terms. Extended Gloss Overlap measure is used to compute the relatedness between terms or senses. Experimental results of this work showed that this method is effective and can achieve high precision. By using this method only concept symbols and property symbols in ontology are processed while individual symbols are not processed since they usually have only one meaning. Also terms which are not in WordNet are omitted. D. Semantic Relatedness Computation Using Wikipedia Wikipedia provides large knowledge base for computing semantic relatedness between terms in more structured fashion. Simone Paolo Ponzetto and Michael Strube investigated the use of Wikipedia for computing semantic relatedness between terms and its application to a real-world NLP task, co reference resolution [26]. Semantic relatedness indicates how much two concepts are related by using all relations between them. In this approach, for a given pair of input words, pair of articles with title that contain query words is obtained. Semantic relatedness between those query words is computed with the help of word based similarity of the articles and the distance between the articles categories in the Wikipedia category tree. Here Path Based Measures, Information Content Based Measures, Text Overlap Based Measures are used to compute semantic relatedness. Results of this work showed that semantic relatedness computed using Wikipedia category network consistently correlates better with human judgments than a simple baseline based on Google counts. Along with this it could also show that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. Also Structured knowledge sources for relatedness computation yields more accurate results [26].Though results do not meet the user needs but are better than Google search. E. Semantic Relatedness Computation Using Wikipedia And WordNet Computation of semantic relatedness between texts requires large and structured knowledge base. Here two knowledge sources, Wikipedia and WordNet, are used to compute semantic relatedness between texts [17]. The proposed approach involves preprocessing input texts and extracting sentences. Later extracting concepts and ngrams from sentences, computing semantic relatedness between ngrams with the help of WordNet and Wikipedia (WLM [19]), building a semantic matrix for sentences called as Enriched Concepts Matrix. This semantic matrix is used to compute semantic relatedness between sentences using vector space model. Finally semantic relatedness is computed between texts by using their sentences relation. Contribution of this method is in computing semantic relatedness between almost any concepts pair and texts. This approach could give a high correlation coefficient of 0.74 which outperformed all of other existing approaches and could show that integrating two knowledge bases can give more accurate results. F. Computing Semantic Relatedness Using Wikipedia Links Here use of Wikipedia hyperlink structure is discussed to define relatedness between terms instead of using category network or text on the page [19].The central component of this approach is the link which is a manually defined connection between two manually disambiguated concepts. Here anchor text is used to identify candidate article for terms. Instead of using term counts weighted by the probability of the occurrence of term, link counts weighted by the probability of occurrences of each link is used to measure the angle between vectors of links on the articles of the interest. Similarity between terms is measured based on similarity between representative articles. The advantage of this approach is that it requires far less data and resources since only links on Wikipedia page are considered. Experimental results of this work showed that Wikipedia link based measure consistently outperforms Wiki Relate and all previous approaches across all datasets. However the measure cannot reach the efficacy of ESA [23] i.e. Explicit Semantic Analysis (Gabrilovich and Markovitch ) measure [19]. G. Clustering of Web Pages in Search Results for Ambiguous Query Proposed approach in this paper involves query reformulation strategies such as word substitution, spelling correction and URL stripping, stemming etc. on results of web search engines so as to help users gain more relevant results for their query. The method utilizes number of semantic annotation techniques using knowledge bases, like WordNet and Wikipedia; in order to get the different senses of each query term [13].Here two step approach is used. In first step information is retrieved using query and in second step clustering is performed on result set. K-means [3] and hierarchical clustering are used as common clustering algorithms. For each cluster, 354

4 label with particular sense is assigned which defines a category. This label is recovered from clusters feature vector. Finally performance of system is evaluated based on Precision, Recall, F- Measure. The main advantage of clustering web search results is that it enables users to easily browse web page reference groups with the same meaning. One of its limitations is that the number of search results returned by the search engine is fairly small and also the document information provided by search engine is inadequate because only URLs are provided. H. Automatic Word Sense Disambiguation Using Wikipedia Large numbers of concepts on Wikipedia are explicitly connected to corresponding articles through piped link. These links can be regarded as sense annotations especially for ambiguous terms. For e.g. In 1834, Alan was admitted to the [[bar (law) bar]] at the age of twenty-three, and entered private practice in Boston [18]. Here Bar (law) is sense annotation which disambiguates ambiguous term 'bar'. These hyperlinks available with Wikipedia are used here to generate sense tagged corpus.this sense tagged corpus is built through 3 steps. Firstly all the Wikipedia paragraphs are extracted which contain occurrence of given ambiguous term in the form of piped link. Occurrences of only those words are considered which are spelled with lower case. Now all the labels are collected for ambiguous term by extracting leftmost component of piped link. Finally these labels are manually mapped to WordNet sense and sense tagged corpus is created. The task of a word sense disambiguation system is to automatically learn a disambiguation model that can predict the correct sense for a new, previously unseen occurrence of the word based on sense annotated examples for given ambiguous term. This approach could lead to accurate sense classifiers with an average relative error rate reduction of 44% compared to the most frequent sense baseline, and 30% compared to the Lesk-corpus base line. One exception to this was that no accuracy improvement was observed for words for which very few hyperlinks could be collected from Wikipedia [18]. III.COMPARISON BETWEEN VARIOUS APPROACHES Various approaches discussed in the above section to get relevant results have their own advantages limitations. Those can be clearly understood with the help of the Table 1. TABLE 1. COMPARISON BETWEEN VARIOUS APPROACHES Name of the Method Advantages Disadvantages Dynamic page rank algorithm Efficient information retrieval for ambiguous query. Handling Polysemy in description logic ontologies. Computing semantic relatedness using Wikipedia. Computing semantic relatedness using Wikipedia and WordNet as knowledge base. Computing semantic relatedness using Wikipedia links. Give more efficient results than existing Google s Page Rank algorithm. User intention behind ambiguous query could be identified significantly and approach is an effective way to form clusters for an ambiguous query. It is an effective method for Polysemy handling in semi automatic process. Semantic relatedness measures computed using this method are useful in NLP task co reference resolution. This method could give a high correlation coefficient of 0.74 which outperformed all of other existing state of art approaches even ESA also. It requires far less data and resources since only hyperlinks on Wikipedia pages are used. Also Wikipedia link based measure As WSD is used here, sufficient training time is required for each word. It does not support multiword query. Only the concepts present in ontology are disambiguated and also terms absent in WordNet are omitted. Using Wikipedia alone yields a slightly worse performance in co reference resolution system as compared to WordNet. More computational time is required due to use of ngrams. Wikipedia link based measure is not as good as ESA measure. 355

5 consistently outperforms Wiki Relate. Clustering of web pages in search result for ambiguous query. It enables users to easily browse web page reference groups with the same meaning due to clustering. Number of results returned by the search engine is fairly small. Document information is inadequate since only URLs are provided. Using Wikipedia for automatic word sense disambiguation. Relative error reduction of 30-44% by using Wikipedia as a source of sense annotation. No accuracy improvement is obtained for the ambiguous words for which very few labels could be collected from Wikipedia. Table 1: Comparison between various approaches IV. ANALYSIS OF VARIOUS APPROACHES As mentioned in above table various approaches are proposed to deal with ambiguous query and thereby to get relevant search results. Also queries having shorter lengths are more prone to ambiguities. [13]. Increasing the query length can help in disambiguating the ambiguous terms. Also WordNet, Wikipedia have been used separately as sense repository or both the knowledge bases are combined to obtain large sense repository.it has been observed that use of Wikipedia could give better results for word sense disambiguation compared to WordNet because it has large coverage for named entities and also it is well structured[26]. Use of Wikipedia link structure can help in disambiguating the concepts. Further strong links between Wikipedia articles help in finding relations between concepts [9]. So in order to disambiguate query tokens dynamically on web our proposed method focuses on use of online encyclopedia i.e. Wikipedia as knowledge base. Different methods which work on word sense disambiguation using Wikipedia have used Wikipedia dumps which can be downloaded at wikimedia.org to access knowledge available with Wikipedia. But as sense disambiguation should be done dynamically on web in the proposed approach, Zend Rest Client Component is used to access Wikipedia data with the help of Mediawiki API [2]. A. Motivations of Proposed Approach In the various approaches we found that enormous amount of knowledge is available with wikipedia in terms of disambiguation pages, redirect pages, hyperlinks, categories. Semantic knowledge rests with wikipedia categories and greatly involves human judgement [22]. Further availability of redirect pages on wikipedia provides alternate names for the entities. Disambiguation pages are available on Wikipedia for ambiguous entities and hence can be helpful in disambiguating the terms. Also hyperlink structure available on Wikipedia helps in finding relatedness between two concepts. Thus wikipedia can be an appealing source to disambiguate ambiguous query tokens dynamically on web. Also in majority of approaches word sense disambiguation has been done desktop based with the use of WordNet, Wikipedia dumps. But Wikipedia dumps can hardly fulfill the requirement of updated knowledge source for sense disambiguation. We found that behind this large knowledge source, there exists MediaWiki API which allows to access its content online with the help of Zend Rest Client Component. This continuously updated knowledge base along with larger coverage as compared to WordNet motivated us to use it online for the word sense disambiguation. B. Objectives of Proposed approach By studying advantages and disadvantages of different approaches mentioned above, derived objectives of proposed system are as follows: Disambiguate ambiguous query tokens with the help of semantic knowledge of Wikipedia. Design an algorithm which will support multiword query. Retrieve possible senses available with Wikipedia for an ambiguous term. User should be able to get more refined search results. 356

6 V. PROPOSED APPROACH TO AMBIGUITY RESOLUTION From the analysis of above mentioned approaches it can be said that Wikipedia is most suitable sense repository on web due to continuously updated information and availability of large number of named entities. Mihalcea [18] and Csomai worked on two algorithms where feature vector for ambiguous word is three words to its left and right along with their part of speech. This concept has been evolved in proposed approach in terms of feature set. Here input is query words and expected output is refined search results than Google. Our objective is to find appropriate sense of words in the query by using Wikipedia. In this case, sense is feature set for a word which contains supporting words and gives more information about query word. Proposed approach to disambiguate query tokens consists of 3 modules: Wikipedia Module Google Module Refined Final Module A. Wikipedia Module Generally Pages are written on wikipedia for all possible senses of a word with the page title containing sense in the parentheses. For eg. Bank (geography). Here 'geography' is sense for ambiguous word 'Bank'. These all the possible senses for word are shown in terms of hyperlinks on disambiguation page. Existence of disambiguation page for an entity indicates that it is ambiguous according to wikipedia. Based on the occurrence of disambiguation page for a token, it is decided whether multiple pages for a query token are possible or not. For a single word query, user intervention is used. User will be provided with pop-up menu containing senses available with wikipedia for query token. In this case sense is the hyperlink on disambiguation page. For eg Bank (geography), Bank (surname) etc. From popup menu user will select the relevant sense. Now feature set is obtained from Wikimedia page for selected sense. For a multiword query, if there exists disambiguation page for a query token, hyperlinks on the disambiguation page are followed. Multiple senses are obtained by extracting feature set from each page. So multiple feature sets are obtained for a query token now. Same procedure is repeated for other tokens also. Later relatedness is found between feature sets for different query tokens using text overlap. Final output of this module will be feature set from wikipedia pages for the words in the query. Here Zend Rest Client component is used to access wikipedia data with the help of MediaWiki API [2]. B. Google Module In order to refine search results of Google, for the same query words it is required to check what Google has retrieved. For this Google Ajax API is used. Using this API search results of google are retrieved. Relevancy of google search results decreases with increased page numbers. Because of this limitation only top 15 search results are considered. For each of those top 15 google search results feature set is retrieved. C. Refined Final Module Here similarity is found between both the feature sets i.e the one retrieved from Wikipedia module and other from Google module using text overlap. Page rank is decided based on similarity and finally refined search results are shown to the user. Only text overlap between feature set is considered to decide relatedness. Problem with this approach is that tokens are disambiguated on syntactic level rather than considering the semantic knowledge derived from Wikipedia. For this reason an approach is modified which is discussed below. D. Modified Approach Basic semantic knowledge available with wikipedia which involve human judgement rests in wikipedia categories [22]. Hence wikipedia categories are used in this modified approach to compute semantic relatedness between tokens. Also disambiguation pages are normally written on wikipedia to avoid conflicts when two or more topics can have same natural page title and are aids in searching. As discussed in above approach, disambiguation pages available with wikipedia for ambiguous entities are used to disambiguate the query tokens. Here 3 different cases are considered for query containing two tokens. 357

7 Case 1: Disambiguation page struck for both the query tokens. This implies that both the query tokens are ambiguous according to Wikipedia. In this case separate list of senses is maintained for each disambiguation page. In the previous approach sense of a word was feature set. Here this definition has been evolved to hyperlinks present on disambiguation page. For eg Bank (geography), Bank (surname) etc. Here 'geography', 'surname' are senses for word 'Bank'. Now intersection is found between both the lists. If intersection set is not empty then the sense which is found common due to intersection is considered. If intersection set is empty then redirects for each hyperlink on disambiguation page are followed and second token is traced in those redirects. Here redirects are considered because they represent alternate names for entities. If token could be traced in redirect, sense corresponding to hyperlink on disambiguation page for that redirect is considered. If second token could not be traced in redirects of hyperlinks on disambiguation page of first token, redirects for hyperlinks on disambiguation page of second token are considered and procedure is repeated as above for first token. If above procedure couldn't trace tokens in redirects, Wikipedia page categories are considered. Each hyperlink on disambiguation page is followed and intersection is found for the categories of pages. Here categories are considered to find semantic relatedness between tokens. This procedure is repeated for all the hyperlinks on the disambiguation pages one by one. If category match is obtained, it can be said that tokens are semantically related. Supporting words are extracted from pages corresponding to those hyperlinks for which category match is obtained. Case 2: Disambiguation page struck for only one token This implies that only one token is ambiguous according to Wikipedia. In this case each link on disambiguation page is followed and redirects are obtained for it. Second token is traced in those redirects. If second token could be found in redirects, sense corresponding to hyperlink on disambiguation page for that redirect is considered. If second token could not be traced in redirects then list of hyperlinks for each link on disambiguation page is maintained. Another list of hyperlinks is maintained for Wikipedia page obtained for second token. Find out intersection of both the lists. If intersection set is not empty then take those words as sense which is found common due to intersection. If intersection set is empty then first link on disambiguation page is followed and intersection is found for the categories of pages. Repeat the procedure for other links on disambiguation page one by one. If category match is obtained it can be said that tokens are semantically related. Supporting words are extracted from wikipedia pages corresponding to those hyperlinks for which category match is obtained. Case 3: Disambiguation page is not struck for both tokens This implies that both the tokens are not ambiguous according to wikipedia. Since links among pages connect articles that are semantically related and likely on the same context [9], in this case intersection is found for hyperlinks on both pages and most important common words are retrieved.it is observed that ambiguities in queries are due to the short query length, which is on an average is 2.33 times on a popular search engine [13]. Because of this reason supporting words for query words are found through above module which is addressed as feature set. Now the feature set obtained for all the above mentioned three cases will be used to get refined search results from Google since greater the length of input query (containing more related words), more precise will be the results. VI. CONCLUSION By studying various approaches, own approach is designed. Since availability of large knowledge base helps in getting the search results more relevant to user needs, we emphasized on exploiting knowledge available with Wikipedia. Results are awaited for query containing two tokens. If the experiment proved successful, in future the approach can be extended to more than 2 ambiguous query tokens. REFERENCES [1] Efficient Information Retrieval for Ambiguous Words, Rekha Jain, Sulochana Nathawat, Rupal Bhargava, G.N. Purohit. [2] Hook into Wikipedia information using PHP and the MediaWiki API. [3] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. An efficient k-means clustering algorithm. In Proceedings of IPPS/SPDP Workshop on High Performance Data Mining,

8 [4] Razvan Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), [5] Ronan Collobert, Jason Weston, L_eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch [6] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of EMNLP-CoNLL 2007, [7] Jun Fang, Lei Guo, and Ning Yang. Handling polysemy in description logic ontologies. In Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, [8] Christiane Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, [9] Angela Fogarolli. In ICSC, [10] Miao Hai and Zhang Yang-sen. Construction of polysemy table and search engine based on inverted index. In Fuzzy Systems and Knowledge Discovery (FSKD), th International Conference on, [11] Todd Holloway, Miran Bozicevic, and Katy B orner. Analyzing and visualizing the semantic coverage of wikipedia and its authors [12] Diana Inkpen. Information retrieval on the internet. [13] Andreas Kanavos, Evangelos Theodoridis, and Athanasios K. Tsakalidis. Extracting knowledge from web search engine results. In ICTAI, [14] Robert Krovetz. Homonymy and polysemy in information retrieval. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, [15] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, [16] Chenliang Li, Aixin Sun, and Anwitaman Datta. A generalized method for word sense disambiguation based on wikipedia. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, [17] R. Malekzadeh, J. Bagherzadeh, and A. Noroozi. A hybrid method based on wordnet and wikipedia for computing semantic relatedness between texts. In Articial Intelligence and Signal Processing (AISP), th CSI International Symposium on, [18] Rada Mihalcea. Using Wikipedia for automatic word sense disambiguation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, [19] David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In In Proceedings of AAAI 2008, [20] Roberto Navigli and Paola Velardi. Structural semantic interconnections: A knowledge-based approach to word sense disambiguation [21] Simone Paolo Ponzetto and Michael Strube. Knowledge derived from wikipedia for computing semantic relatedness [22] Priya Radhakrishnan and Vasudeva Varma. Extracting semantic knowledge from wikipedia category names. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, [23] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, [24] R.K.Roul and S. K. Sahay. An effective information retrieval for ambiguous query [25] Amit Singh, Sayali Kulkarni, Somnath Banerjee, Ganesh Ramakrishnan, and Soumen Chakrabarti. Curating and searching the annotated web. In SIGKDD Conference, System demonstration, [26] Michael Strube and Simone Paolo Ponzetto. Wikirelate! Computing semantic relatedness using wikipedia. In Proceedings of the 21st National Conference on Articial Intelligence - Volume 2,

AUTHOR BIOGRAPHY Samiksha V Chakule - She is pursuing B.

Her current research interest includes web mining, Information retrieval, database, data mining.

Her current research interest includes information Retrieval, database, web mining, data mining.

Her current research interest includes information retrieval, database, web mining, data mining. Dr. Mrs. Y.V.

9 AUTHOR BIOGRAPHY Samiksha V Chakule - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes web mining, Information retrieval, database, data mining. Ashwini S Borse - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes information Retrieval, database, web mining, data mining. Shalaka S Jadhav - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes information retrieval, database, web mining, data mining. Dr. Mrs. Y.V. Haribhakta- She is professor in department of computer science and IT (COEP). She has teaching experience of 15 years and has completed her M.E., Ph.D. in computer engineering. Her research interest includes Text Mining and NLP. Currently she is member of ISTE and AMIEE. 360

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India