ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 3, May 2014

Size: px
Start display at page:

Download "ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 3, May 2014"

Transcription

1 Literature Survey on Knowledge Based Information Retrieval on Web Samiksha Chakule, Ashwini Borse, Shalaka Jadhav, Dr. Mrs. Y.V. Haribhakta Abstract - It has been observed that search engines lack in satisfying ambiguous queries. User has to browse through large amount of information on web to get the results of his interests. This problem has resulted into web page references mapped to different meanings mixed together in the result list. This increases the burden on search engine and thus decreases its performance. Here we have proposed an approach based on the use of Wikipedia as a knowledge base in information retrieval activity on web. Our objective is to resolve ambiguity in query using semantic knowledge of Wikipedia. Our module finds supporting words for a given query containing ambiguous terms. Wikipedia provides large knowledge base in structured fashion to compute semantic relatedness between texts. Also it is a large source of information which is updated continuously. Hence focus will be on exploiting its knowledge to disambiguate the query dynamically by fetching the supporting words from Wikipedia. Index Terms Information Retrieval, Knowledge Base, Search Engines, Web, Wikipedia, I. INTRODUCTION In the present world, Internet is the biggest source of information on web. Internet surfing has become inevitable for the people to get acquainted with the happenings around the world. Thus using suitable resource and effective techniques for extracting information that correlates user query, is very important. There are various kinds of search engines like Google, Yahoo etc. which are information retrieval applications on the web. Google uses page rank algorithm to rank the websites in the search results [12]. It works by counting the number and quality of links to a page to determine a rough estimate of how important the website is (source: Wikipedia). The assumption behind this is that more important websites are likely to receive more links from other websites. However search engines lack in satisfying ambiguous queries. For eg. the input word 'JAVA' can refer to 'Java Coffee', or 'Java Programming Language ' or 'Java Island'. This problem has resulted into web page references mapped to different meanings mixed together in the result list. This increases the burden on search engine and thus decreases its performance. Solution to this problem can be obtained by associating appropriate sense to tokens in the query. Various approaches have been proposed till now where dictionaries [24], thesaurus, ontologies are used as sense repositories that define possible senses of words. WordNet has also been used as sense repository. Most of the work dealing with relatedness and similarity measures has been developed using WordNet (Fellbaum, 1998). WordNet represents a well structured taxonomy organized in a meaningful way. But questions arise about larger coverage since it does not include information about named entities. Further advancements in NLP [5] are crucially depended on availability of large world and domain knowledge. Out of those merely available knowledge bases, Wikipedia provides large amount of information about named entities and contains continuously updated knowledge for processing current information [11]. Strength of the Wikipedia lies in its larger coverage as compared to WordNet [26]. Every entry on Wikipedia is an article that defines or describes single entity or concept. A page is uniquely identified by its title. In Wikipedia every entity page is associated with one or more categories each of which can have subcategories. Other resource available with it is disambiguation pages. They are created for ambiguous names. Also redirect page is written on Wikipedia to avoid misspelling and capitalization and acts as alternate name for entity. It does not have any content in itself but sends user to another page. Thus overall knowledge of Wikipedia lies in disambiguation pages, redirect pages, hyperlinks, categories available with it [4]. Here an approach is proposed to overcome the well known knowledge acquisition bottleneck by deriving a knowledge resource from a very large, collaboratively created encyclopedia i.e. Wikipedia. At present it contains over 44 million articles (source: Wikipedia). In this paper we discussed about using Wikipedia as a knowledge base to decide semantic relatedness of query tokens and thereby to disambiguate ambiguous tokens in the query leading to the refined 352

2 search results. Rest of the paper is organized as follows: Section II discusses related work done in disambiguation using knowledge bases like Wikipedia and WordNet in order to get search results more relevant to user needs. In section III various approaches are compared. These approaches are analyzed in section IV and proposed method for disambiguation of query is described in section V. The paper is concluded in Section VI. II. RELATED WORK Various approaches are used till now for word sense disambiguation and thereby to get more relevant search results. Michalcea emphasized that Wikipedia can be used for word sense disambiguation [18]. Few researchers have explored that Wikipedia can be used for question answering as well as for the disambiguation of named entities [4] and showed promising results. In CSAW various named entities are annotated and mapped to entities in Wikipedia catalog [25]. Thus disambiguating them explicitly. In another case general collective disambiguation approach was proposed based on the assumption that coherent documents generally address to the entities from related documents or same domain [15]. Word senses can be described through structured features. SSI (Structural pattern Recognition) algorithm has been used for the disambiguation of noun terms in WordNet and could give high precision [20]. In another approach to search ambiguous term in corpus quickly, inverted index table is created on annotated corpus [10]. This approach could improve retrieval speed for polysemous term 3 to 6 times. Along with this other approaches used are: A. Efficient Information Retrieval Using Dynamic Page Rank Algorithm Here Dynamic Page Rank algorithm has been discussed which uses WSD and thereby acts as layer on the top of search engine [1]. In WSD first all senses of all query words are identified and appropriate sense is assigned to each occurrence of word in textual context. The query is enhanced by applying various steps like tokenization, stemming, stop word removal as well as sense disambiguation approach. Enhanced query is passed to search engine which gives results in the form of web pages. Dynamic page rank of web page is calculated separately by matching it with enhanced query. Finally the result is rearranged from higher to lower dynamic page rank values. In this way user receives more meaningful contents at top of the search results. Mean Reciprocal Rank and Mean Average Precision were applied to compare efficiency of Goggles' Page Rank Algorithm and Dynamic Page Rank Algorithm. Here Mean Reciprocal Rank is the average of the reciprocal ranks of results for a sample of queries.also precision is the fraction of retrieved documents that are relevant and recall is the fraction of relevant instances that are retrieved. A measure that uses both precision and recall is the average precision. Experimental results of this work showed that this approach can give more efficient results than existing Google s Page Rank algorithm and helps in resolving ambiguity through query enhancement. Since WSD is used here, sufficient training time is required for each term. B. Clustering of Word Senses for An Ambiguous Query Here an approach is proposed to increase the effectiveness of search engine by forming cluster of word senses for an ambiguous query [24]. Cluster formation is based on association concept of data mining and also uses vector space model of Gensim (python library) and the free dictionary. It includes extracting web pages for a query and preprocessing extracted web pages. Preprocessing of web page includes removal of stop words, stemming with the help of porter stemmer algorithm, extracting only nouns from corpus. Each processed web page is saved in the form of document and document vector is formed for each retrieved page using Gensim. Now community vector is formed for a queried word using free dictionary. Later Cosine Similarity is found between document vector and community vector. Finally Web pages are clustered based on similarity. Experimental results of this work showed that this approach is an effective way to form clusters for an ambiguous query and also user intention behind ambiguous query could be identified significantly. But it cannot support multiword query. C. Polysemy Handling in Description Logic Ontologies Ontology is the set of related concepts in domain and plays important role in eliminating conceptual confusion and reuse of knowledge on semantic web.if lexical representation of ontology contains polysemous(having 2 or more meanings) terms this may lead to ambiguity during its application and management. Here an approach is discussed to handle Polysemy in description logic ontologies by automatically disambiguating terms in ontology [7]. Disambiguation of a term in ontology is done by using its surrounding ontology symbols and its nearby terms in annotated documents using this ontology. Senses defined in WordNet are used here. Due to unique 353

3 meaning of sense, it is used to replace term as the lexical representation of concepts and properties. The right sense is assigned to a polysemous term by maximizing its relatedness with neighbour terms. Extended Gloss Overlap measure is used to compute the relatedness between terms or senses. Experimental results of this work showed that this method is effective and can achieve high precision. By using this method only concept symbols and property symbols in ontology are processed while individual symbols are not processed since they usually have only one meaning. Also terms which are not in WordNet are omitted. D. Semantic Relatedness Computation Using Wikipedia Wikipedia provides large knowledge base for computing semantic relatedness between terms in more structured fashion. Simone Paolo Ponzetto and Michael Strube investigated the use of Wikipedia for computing semantic relatedness between terms and its application to a real-world NLP task, co reference resolution [26]. Semantic relatedness indicates how much two concepts are related by using all relations between them. In this approach, for a given pair of input words, pair of articles with title that contain query words is obtained. Semantic relatedness between those query words is computed with the help of word based similarity of the articles and the distance between the articles categories in the Wikipedia category tree. Here Path Based Measures, Information Content Based Measures, Text Overlap Based Measures are used to compute semantic relatedness. Results of this work showed that semantic relatedness computed using Wikipedia category network consistently correlates better with human judgments than a simple baseline based on Google counts. Along with this it could also show that Wikipedia outperforms WordNet when applied to the largest available dataset designed for that purpose. Also Structured knowledge sources for relatedness computation yields more accurate results [26].Though results do not meet the user needs but are better than Google search. E. Semantic Relatedness Computation Using Wikipedia And WordNet Computation of semantic relatedness between texts requires large and structured knowledge base. Here two knowledge sources, Wikipedia and WordNet, are used to compute semantic relatedness between texts [17]. The proposed approach involves preprocessing input texts and extracting sentences. Later extracting concepts and ngrams from sentences, computing semantic relatedness between ngrams with the help of WordNet and Wikipedia (WLM [19]), building a semantic matrix for sentences called as Enriched Concepts Matrix. This semantic matrix is used to compute semantic relatedness between sentences using vector space model. Finally semantic relatedness is computed between texts by using their sentences relation. Contribution of this method is in computing semantic relatedness between almost any concepts pair and texts. This approach could give a high correlation coefficient of 0.74 which outperformed all of other existing approaches and could show that integrating two knowledge bases can give more accurate results. F. Computing Semantic Relatedness Using Wikipedia Links Here use of Wikipedia hyperlink structure is discussed to define relatedness between terms instead of using category network or text on the page [19].The central component of this approach is the link which is a manually defined connection between two manually disambiguated concepts. Here anchor text is used to identify candidate article for terms. Instead of using term counts weighted by the probability of the occurrence of term, link counts weighted by the probability of occurrences of each link is used to measure the angle between vectors of links on the articles of the interest. Similarity between terms is measured based on similarity between representative articles. The advantage of this approach is that it requires far less data and resources since only links on Wikipedia page are considered. Experimental results of this work showed that Wikipedia link based measure consistently outperforms Wiki Relate and all previous approaches across all datasets. However the measure cannot reach the efficacy of ESA [23] i.e. Explicit Semantic Analysis (Gabrilovich and Markovitch ) measure [19]. G. Clustering of Web Pages in Search Results for Ambiguous Query Proposed approach in this paper involves query reformulation strategies such as word substitution, spelling correction and URL stripping, stemming etc. on results of web search engines so as to help users gain more relevant results for their query. The method utilizes number of semantic annotation techniques using knowledge bases, like WordNet and Wikipedia; in order to get the different senses of each query term [13].Here two step approach is used. In first step information is retrieved using query and in second step clustering is performed on result set. K-means [3] and hierarchical clustering are used as common clustering algorithms. For each cluster, 354

4 label with particular sense is assigned which defines a category. This label is recovered from clusters feature vector. Finally performance of system is evaluated based on Precision, Recall, F- Measure. The main advantage of clustering web search results is that it enables users to easily browse web page reference groups with the same meaning. One of its limitations is that the number of search results returned by the search engine is fairly small and also the document information provided by search engine is inadequate because only URLs are provided. H. Automatic Word Sense Disambiguation Using Wikipedia Large numbers of concepts on Wikipedia are explicitly connected to corresponding articles through piped link. These links can be regarded as sense annotations especially for ambiguous terms. For e.g. In 1834, Alan was admitted to the [[bar (law) bar]] at the age of twenty-three, and entered private practice in Boston [18]. Here Bar (law) is sense annotation which disambiguates ambiguous term 'bar'. These hyperlinks available with Wikipedia are used here to generate sense tagged corpus.this sense tagged corpus is built through 3 steps. Firstly all the Wikipedia paragraphs are extracted which contain occurrence of given ambiguous term in the form of piped link. Occurrences of only those words are considered which are spelled with lower case. Now all the labels are collected for ambiguous term by extracting leftmost component of piped link. Finally these labels are manually mapped to WordNet sense and sense tagged corpus is created. The task of a word sense disambiguation system is to automatically learn a disambiguation model that can predict the correct sense for a new, previously unseen occurrence of the word based on sense annotated examples for given ambiguous term. This approach could lead to accurate sense classifiers with an average relative error rate reduction of 44% compared to the most frequent sense baseline, and 30% compared to the Lesk-corpus base line. One exception to this was that no accuracy improvement was observed for words for which very few hyperlinks could be collected from Wikipedia [18]. III.COMPARISON BETWEEN VARIOUS APPROACHES Various approaches discussed in the above section to get relevant results have their own advantages limitations. Those can be clearly understood with the help of the Table 1. TABLE 1. COMPARISON BETWEEN VARIOUS APPROACHES Name of the Method Advantages Disadvantages Dynamic page rank algorithm Efficient information retrieval for ambiguous query. Handling Polysemy in description logic ontologies. Computing semantic relatedness using Wikipedia. Computing semantic relatedness using Wikipedia and WordNet as knowledge base. Computing semantic relatedness using Wikipedia links. Give more efficient results than existing Google s Page Rank algorithm. User intention behind ambiguous query could be identified significantly and approach is an effective way to form clusters for an ambiguous query. It is an effective method for Polysemy handling in semi automatic process. Semantic relatedness measures computed using this method are useful in NLP task co reference resolution. This method could give a high correlation coefficient of 0.74 which outperformed all of other existing state of art approaches even ESA also. It requires far less data and resources since only hyperlinks on Wikipedia pages are used. Also Wikipedia link based measure As WSD is used here, sufficient training time is required for each word. It does not support multiword query. Only the concepts present in ontology are disambiguated and also terms absent in WordNet are omitted. Using Wikipedia alone yields a slightly worse performance in co reference resolution system as compared to WordNet. More computational time is required due to use of ngrams. Wikipedia link based measure is not as good as ESA measure. 355

5 consistently outperforms Wiki Relate. Clustering of web pages in search result for ambiguous query. It enables users to easily browse web page reference groups with the same meaning due to clustering. Number of results returned by the search engine is fairly small. Document information is inadequate since only URLs are provided. Using Wikipedia for automatic word sense disambiguation. Relative error reduction of 30-44% by using Wikipedia as a source of sense annotation. No accuracy improvement is obtained for the ambiguous words for which very few labels could be collected from Wikipedia. Table 1: Comparison between various approaches IV. ANALYSIS OF VARIOUS APPROACHES As mentioned in above table various approaches are proposed to deal with ambiguous query and thereby to get relevant search results. Also queries having shorter lengths are more prone to ambiguities. [13]. Increasing the query length can help in disambiguating the ambiguous terms. Also WordNet, Wikipedia have been used separately as sense repository or both the knowledge bases are combined to obtain large sense repository.it has been observed that use of Wikipedia could give better results for word sense disambiguation compared to WordNet because it has large coverage for named entities and also it is well structured[26]. Use of Wikipedia link structure can help in disambiguating the concepts. Further strong links between Wikipedia articles help in finding relations between concepts [9]. So in order to disambiguate query tokens dynamically on web our proposed method focuses on use of online encyclopedia i.e. Wikipedia as knowledge base. Different methods which work on word sense disambiguation using Wikipedia have used Wikipedia dumps which can be downloaded at wikimedia.org to access knowledge available with Wikipedia. But as sense disambiguation should be done dynamically on web in the proposed approach, Zend Rest Client Component is used to access Wikipedia data with the help of Mediawiki API [2]. A. Motivations of Proposed Approach In the various approaches we found that enormous amount of knowledge is available with wikipedia in terms of disambiguation pages, redirect pages, hyperlinks, categories. Semantic knowledge rests with wikipedia categories and greatly involves human judgement [22]. Further availability of redirect pages on wikipedia provides alternate names for the entities. Disambiguation pages are available on Wikipedia for ambiguous entities and hence can be helpful in disambiguating the terms. Also hyperlink structure available on Wikipedia helps in finding relatedness between two concepts. Thus wikipedia can be an appealing source to disambiguate ambiguous query tokens dynamically on web. Also in majority of approaches word sense disambiguation has been done desktop based with the use of WordNet, Wikipedia dumps. But Wikipedia dumps can hardly fulfill the requirement of updated knowledge source for sense disambiguation. We found that behind this large knowledge source, there exists MediaWiki API which allows to access its content online with the help of Zend Rest Client Component. This continuously updated knowledge base along with larger coverage as compared to WordNet motivated us to use it online for the word sense disambiguation. B. Objectives of Proposed approach By studying advantages and disadvantages of different approaches mentioned above, derived objectives of proposed system are as follows: Disambiguate ambiguous query tokens with the help of semantic knowledge of Wikipedia. Design an algorithm which will support multiword query. Retrieve possible senses available with Wikipedia for an ambiguous term. User should be able to get more refined search results. 356

6 V. PROPOSED APPROACH TO AMBIGUITY RESOLUTION From the analysis of above mentioned approaches it can be said that Wikipedia is most suitable sense repository on web due to continuously updated information and availability of large number of named entities. Mihalcea [18] and Csomai worked on two algorithms where feature vector for ambiguous word is three words to its left and right along with their part of speech. This concept has been evolved in proposed approach in terms of feature set. Here input is query words and expected output is refined search results than Google. Our objective is to find appropriate sense of words in the query by using Wikipedia. In this case, sense is feature set for a word which contains supporting words and gives more information about query word. Proposed approach to disambiguate query tokens consists of 3 modules: Wikipedia Module Google Module Refined Final Module A. Wikipedia Module Generally Pages are written on wikipedia for all possible senses of a word with the page title containing sense in the parentheses. For eg. Bank (geography). Here 'geography' is sense for ambiguous word 'Bank'. These all the possible senses for word are shown in terms of hyperlinks on disambiguation page. Existence of disambiguation page for an entity indicates that it is ambiguous according to wikipedia. Based on the occurrence of disambiguation page for a token, it is decided whether multiple pages for a query token are possible or not. For a single word query, user intervention is used. User will be provided with pop-up menu containing senses available with wikipedia for query token. In this case sense is the hyperlink on disambiguation page. For eg Bank (geography), Bank (surname) etc. From popup menu user will select the relevant sense. Now feature set is obtained from Wikimedia page for selected sense. For a multiword query, if there exists disambiguation page for a query token, hyperlinks on the disambiguation page are followed. Multiple senses are obtained by extracting feature set from each page. So multiple feature sets are obtained for a query token now. Same procedure is repeated for other tokens also. Later relatedness is found between feature sets for different query tokens using text overlap. Final output of this module will be feature set from wikipedia pages for the words in the query. Here Zend Rest Client component is used to access wikipedia data with the help of MediaWiki API [2]. B. Google Module In order to refine search results of Google, for the same query words it is required to check what Google has retrieved. For this Google Ajax API is used. Using this API search results of google are retrieved. Relevancy of google search results decreases with increased page numbers. Because of this limitation only top 15 search results are considered. For each of those top 15 google search results feature set is retrieved. C. Refined Final Module Here similarity is found between both the feature sets i.e the one retrieved from Wikipedia module and other from Google module using text overlap. Page rank is decided based on similarity and finally refined search results are shown to the user. Only text overlap between feature set is considered to decide relatedness. Problem with this approach is that tokens are disambiguated on syntactic level rather than considering the semantic knowledge derived from Wikipedia. For this reason an approach is modified which is discussed below. D. Modified Approach Basic semantic knowledge available with wikipedia which involve human judgement rests in wikipedia categories [22]. Hence wikipedia categories are used in this modified approach to compute semantic relatedness between tokens. Also disambiguation pages are normally written on wikipedia to avoid conflicts when two or more topics can have same natural page title and are aids in searching. As discussed in above approach, disambiguation pages available with wikipedia for ambiguous entities are used to disambiguate the query tokens. Here 3 different cases are considered for query containing two tokens. 357

7 Case 1: Disambiguation page struck for both the query tokens. This implies that both the query tokens are ambiguous according to Wikipedia. In this case separate list of senses is maintained for each disambiguation page. In the previous approach sense of a word was feature set. Here this definition has been evolved to hyperlinks present on disambiguation page. For eg Bank (geography), Bank (surname) etc. Here 'geography', 'surname' are senses for word 'Bank'. Now intersection is found between both the lists. If intersection set is not empty then the sense which is found common due to intersection is considered. If intersection set is empty then redirects for each hyperlink on disambiguation page are followed and second token is traced in those redirects. Here redirects are considered because they represent alternate names for entities. If token could be traced in redirect, sense corresponding to hyperlink on disambiguation page for that redirect is considered. If second token could not be traced in redirects of hyperlinks on disambiguation page of first token, redirects for hyperlinks on disambiguation page of second token are considered and procedure is repeated as above for first token. If above procedure couldn't trace tokens in redirects, Wikipedia page categories are considered. Each hyperlink on disambiguation page is followed and intersection is found for the categories of pages. Here categories are considered to find semantic relatedness between tokens. This procedure is repeated for all the hyperlinks on the disambiguation pages one by one. If category match is obtained, it can be said that tokens are semantically related. Supporting words are extracted from pages corresponding to those hyperlinks for which category match is obtained. Case 2: Disambiguation page struck for only one token This implies that only one token is ambiguous according to Wikipedia. In this case each link on disambiguation page is followed and redirects are obtained for it. Second token is traced in those redirects. If second token could be found in redirects, sense corresponding to hyperlink on disambiguation page for that redirect is considered. If second token could not be traced in redirects then list of hyperlinks for each link on disambiguation page is maintained. Another list of hyperlinks is maintained for Wikipedia page obtained for second token. Find out intersection of both the lists. If intersection set is not empty then take those words as sense which is found common due to intersection. If intersection set is empty then first link on disambiguation page is followed and intersection is found for the categories of pages. Repeat the procedure for other links on disambiguation page one by one. If category match is obtained it can be said that tokens are semantically related. Supporting words are extracted from wikipedia pages corresponding to those hyperlinks for which category match is obtained. Case 3: Disambiguation page is not struck for both tokens This implies that both the tokens are not ambiguous according to wikipedia. Since links among pages connect articles that are semantically related and likely on the same context [9], in this case intersection is found for hyperlinks on both pages and most important common words are retrieved.it is observed that ambiguities in queries are due to the short query length, which is on an average is 2.33 times on a popular search engine [13]. Because of this reason supporting words for query words are found through above module which is addressed as feature set. Now the feature set obtained for all the above mentioned three cases will be used to get refined search results from Google since greater the length of input query (containing more related words), more precise will be the results. VI. CONCLUSION By studying various approaches, own approach is designed. Since availability of large knowledge base helps in getting the search results more relevant to user needs, we emphasized on exploiting knowledge available with Wikipedia. Results are awaited for query containing two tokens. If the experiment proved successful, in future the approach can be extended to more than 2 ambiguous query tokens. REFERENCES [1] Efficient Information Retrieval for Ambiguous Words, Rekha Jain, Sulochana Nathawat, Rupal Bhargava, G.N. Purohit. [2] Hook into Wikipedia information using PHP and the MediaWiki API. [3] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. An efficient k-means clustering algorithm. In Proceedings of IPPS/SPDP Workshop on High Performance Data Mining,

8 [4] Razvan Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL-06), [5] Ronan Collobert, Jason Weston, L_eon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch [6] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proceedings of EMNLP-CoNLL 2007, [7] Jun Fang, Lei Guo, and Ning Yang. Handling polysemy in description logic ontologies. In Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on, [8] Christiane Fellbaum. WordNet: An Electronic Lexical Database. Bradford Books, [9] Angela Fogarolli. In ICSC, [10] Miao Hai and Zhang Yang-sen. Construction of polysemy table and search engine based on inverted index. In Fuzzy Systems and Knowledge Discovery (FSKD), th International Conference on, [11] Todd Holloway, Miran Bozicevic, and Katy B orner. Analyzing and visualizing the semantic coverage of wikipedia and its authors [12] Diana Inkpen. Information retrieval on the internet. [13] Andreas Kanavos, Evangelos Theodoridis, and Athanasios K. Tsakalidis. Extracting knowledge from web search engine results. In ICTAI, [14] Robert Krovetz. Homonymy and polysemy in information retrieval. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, [15] Sayali Kulkarni, Amit Singh, Ganesh Ramakrishnan, and Soumen Chakrabarti. Collective annotation of wikipedia entities in web text. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, [16] Chenliang Li, Aixin Sun, and Anwitaman Datta. A generalized method for word sense disambiguation based on wikipedia. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, [17] R. Malekzadeh, J. Bagherzadeh, and A. Noroozi. A hybrid method based on wordnet and wikipedia for computing semantic relatedness between texts. In Articial Intelligence and Signal Processing (AISP), th CSI International Symposium on, [18] Rada Mihalcea. Using Wikipedia for automatic word sense disambiguation. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, [19] David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In In Proceedings of AAAI 2008, [20] Roberto Navigli and Paola Velardi. Structural semantic interconnections: A knowledge-based approach to word sense disambiguation [21] Simone Paolo Ponzetto and Michael Strube. Knowledge derived from wikipedia for computing semantic relatedness [22] Priya Radhakrishnan and Vasudeva Varma. Extracting semantic knowledge from wikipedia category names. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, [23] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, [24] R.K.Roul and S. K. Sahay. An effective information retrieval for ambiguous query [25] Amit Singh, Sayali Kulkarni, Somnath Banerjee, Ganesh Ramakrishnan, and Soumen Chakrabarti. Curating and searching the annotated web. In SIGKDD Conference, System demonstration, [26] Michael Strube and Simone Paolo Ponzetto. Wikirelate! Computing semantic relatedness using wikipedia. In Proceedings of the 21st National Conference on Articial Intelligence - Volume 2,

9 AUTHOR BIOGRAPHY Samiksha V Chakule - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes web mining, Information retrieval, database, data mining. Ashwini S Borse - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes information Retrieval, database, web mining, data mining. Shalaka S Jadhav - She is pursuing B.Tech degree in Information Technology from College Of Engineering, Pune (COEP). Her current research interest includes information retrieval, database, web mining, data mining. Dr. Mrs. Y.V. Haribhakta- She is professor in department of computer science and IT (COEP). She has teaching experience of 15 years and has completed her M.E., Ph.D. in computer engineering. Her research interest includes Text Mining and NLP. Currently she is member of ISTE and AMIEE. 360

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm Rekha Jain 1, Sulochana Nathawat 2, Dr. G.N. Purohit 3 1 Department of Computer Science, Banasthali University, Jaipur, Rajasthan ABSTRACT

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7 International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7 A Hybrid Method for Extracting Key Terms of Text Documents Ahmad Ali Al-Zubi Computer Science Department

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Nitish Aggarwal, Kartik Asooja, Paul Buitelaar, and Gabriela Vulcu Unit for Natural Language Processing Insight-centre, National University

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Ontology based Web Page Topic Identification

Ontology based Web Page Topic Identification Ontology based Web Page Topic Identification Abhishek Singh Rathore Department of Computer Science & Engineering Maulana Azad National Institute of Technology Bhopal, India Devshri Roy Department of Computer

More information

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2 1 Student, M.E., (Computer science and Engineering) in M.G University, India, 2 Associate Professor

More information

Automatic Word Sense Disambiguation Using Wikipedia

Automatic Word Sense Disambiguation Using Wikipedia Automatic Word Sense Disambiguation Using Wikipedia Sivakumar J *, Anthoniraj A ** School of Computing Science and Engineering, VIT University Vellore-632014, TamilNadu, India * jpsivas@gmail.com ** aanthoniraja@gmail.com

More information

Cross-Lingual Word Sense Disambiguation

Cross-Lingual Word Sense Disambiguation Cross-Lingual Word Sense Disambiguation Priyank Jaini Ankit Agrawal pjaini@iitk.ac.in ankitag@iitk.ac.in Department of Mathematics and Statistics Department of Mathematics and Statistics.. Mentor: Prof.

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using

More information

A Graph-based Method for Entity Linking

A Graph-based Method for Entity Linking A Graph-based Method for Entity Linking Yuhang Guo, Wanxiang Che, Ting Liu, Sheng Li Research Center for Social Computing and Information Retrieval MOE-Microsoft Key Laboratory of Natural Language Processing

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Graph-Based Concept Clustering for Web Search Results

Graph-Based Concept Clustering for Web Search Results International Journal of Electrical and Computer Engineering (IJECE) Vol. 5, No. 6, December 2015, pp. 1536~1544 ISSN: 2088-8708 1536 Graph-Based Concept Clustering for Web Search Results Supakpong Jinarat*,

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

On Topic Categorization of PubMed Query Results

On Topic Categorization of PubMed Query Results On Topic Categorization of PubMed Query Results Andreas Kanavos 1, Christos Makris 1 and Evangelos Theodoridis 1,2 1.Computer Engineering and Informatics Department University of Patras Rio, Patras, Greece,

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY Asian Journal Of Computer Science And Information Technology 2: 3 (2012) 26 30. Contents lists available at www.innovativejournal.in Asian Journal of Computer Science and Information Technology Journal

More information

WebSAIL Wikifier at ERD 2014

WebSAIL Wikifier at ERD 2014 WebSAIL Wikifier at ERD 2014 Thanapon Noraset, Chandra Sekhar Bhagavatula, Doug Downey Department of Electrical Engineering & Computer Science, Northwestern University {nor.thanapon, csbhagav}@u.northwestern.edu,ddowney@eecs.northwestern.edu

More information

The Goal of this Document. Where to Start?

The Goal of this Document. Where to Start? A QUICK INTRODUCTION TO THE SEMILAR APPLICATION Mihai Lintean, Rajendra Banjade, and Vasile Rus vrus@memphis.edu linteam@gmail.com rbanjade@memphis.edu The Goal of this Document This document introduce

More information

Graph-based Entity Linking using Shortest Path

Graph-based Entity Linking using Shortest Path Graph-based Entity Linking using Shortest Path Yongsun Shim 1, Sungkwon Yang 1, Hyunwhan Joe 1, Hong-Gee Kim 1 1 Biomedical Knowledge Engineering Laboratory, Seoul National University, Seoul, Korea {yongsun0926,

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators

ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators Pablo Ruiz, Thierry Poibeau and Frédérique Mélanie Laboratoire LATTICE CNRS, École Normale Supérieure, U Paris 3 Sorbonne Nouvelle

More information

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 SENSE

More information

A Hybrid Neural Model for Type Classification of Entity Mentions

A Hybrid Neural Model for Type Classification of Entity Mentions A Hybrid Neural Model for Type Classification of Entity Mentions Motivation Types group entities to categories Entity types are important for various NLP tasks Our task: predict an entity mention s type

More information

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com

More information

Natural Language Processing with PoolParty

Natural Language Processing with PoolParty Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Word Disambiguation in Web Search

Word Disambiguation in Web Search Word Disambiguation in Web Search Rekha Jain Computer Science, Banasthali University, Rajasthan, India Email: rekha_leo2003@rediffmail.com G.N. Purohit Computer Science, Banasthali University, Rajasthan,

More information

Keywords Web Query, Classification, Intermediate categories, State space tree

Keywords Web Query, Classification, Intermediate categories, State space tree Abstract In web search engines, the retrieval of information is quite challenging due to short, ambiguous and noisy queries. This can be resolved by classifying the queries to appropriate categories. In

More information

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre, Aitor Soroa IXA NLP Group University of Basque Country Donostia, Basque Contry a.soroa@ehu.es Abstract

More information

Wikulu: Information Management in Wikis Enhanced by Language Technologies

Wikulu: Information Management in Wikis Enhanced by Language Technologies Wikulu: Information Management in Wikis Enhanced by Language Technologies Iryna Gurevych (this is joint work with Dr. Torsten Zesch, Daniel Bär and Nico Erbs) 1 UKP Lab: Projects UKP Lab Educational Natural

More information

Chinese Microblog Entity Linking System Combining Wikipedia and Search Engine Retrieval Results

Chinese Microblog Entity Linking System Combining Wikipedia and Search Engine Retrieval Results Chinese Microblog Entity Linking System Combining Wikipedia and Search Engine Retrieval Results Zeyu Meng, Dong Yu, and Endong Xun Inter. R&D center for Chinese Education, Beijing Language and Culture

More information

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity

More information

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE Ms.S.Muthukakshmi 1, R. Surya 2, M. Umira Taj 3 Assistant Professor, Department of Information Technology, Sri Krishna College of Technology, Kovaipudur,

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

ScienceDirect. Enhanced Associative Classification of XML Documents Supported by Semantic Concepts

ScienceDirect. Enhanced Associative Classification of XML Documents Supported by Semantic Concepts Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 194 201 International Conference on Information and Communication Technologies (ICICT 2014) Enhanced Associative

More information

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin

More information

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms

Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Engineering, Technology & Applied Science Research Vol. 8, No. 1, 2018, 2562-2567 2562 Empirical Analysis of Single and Multi Document Summarization using Clustering Algorithms Mrunal S. Bewoor Department

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 1, January- June (2012), TECHNOLOGY (IJCET) IAEME ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume

More information

Tag Based Image Search by Social Re-ranking

Tag Based Image Search by Social Re-ranking Tag Based Image Search by Social Re-ranking Vilas Dilip Mane, Prof.Nilesh P. Sable Student, Department of Computer Engineering, Imperial College of Engineering & Research, Wagholi, Pune, Savitribai Phule

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS Nilam B. Lonkar 1, Dinesh B. Hanchate 2 Student of Computer Engineering, Pune University VPKBIET, Baramati, India Computer Engineering, Pune University VPKBIET,

More information

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation Volume 3, No.5, May 24 International Journal of Advances in Computer Science and Technology Pooja Bassin et al., International Journal of Advances in Computer Science and Technology, 3(5), May 24, 33-336

More information

SEMANTIC INFORMATION RETRIEVAL USING ONTOLOGY IN UNIVERSITY DOMAIN

SEMANTIC INFORMATION RETRIEVAL USING ONTOLOGY IN UNIVERSITY DOMAIN SEMANTIC INFORMATION RETRIEVAL USING ONTOLOGY IN UNIVERSITY DOMAIN Swathi Rajasurya, Tamizhamudhu Muralidharan, Sandhiya Devi, Prof.Dr.S.Swamynathan Department of Information and Technology,College of

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22 Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

Enhanced retrieval using semantic technologies:

Enhanced retrieval using semantic technologies: Enhanced retrieval using semantic technologies: Ontology based retrieval as a new search paradigm? - Considerations based on new projects at the Bavarian State Library Dr. Berthold Gillitzer 28. Mai 2008

More information

A cocktail approach to the VideoCLEF 09 linking task

A cocktail approach to the VideoCLEF 09 linking task A cocktail approach to the VideoCLEF 09 linking task Stephan Raaijmakers Corné Versloot Joost de Wit TNO Information and Communication Technology Delft, The Netherlands {stephan.raaijmakers,corne.versloot,

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY Ankush Maind 1, Prof. Anil Deorankar 2 and Dr. Prashant Chatur 3 1 M.Tech. Scholar, Department of Computer Science and Engineering, Government

More information

Re-contextualization and contextual Entity exploration. Sebastian Holzki

Re-contextualization and contextual Entity exploration. Sebastian Holzki Re-contextualization and contextual Entity exploration Sebastian Holzki Sebastian Holzki June 7, 2016 1 Authors: Joonseok Lee, Ariel Fuxman, Bo Zhao, and Yuanhua Lv - PAPER PRESENTATION - LEVERAGING KNOWLEDGE

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Semantically Driven Snippet Selection for Supporting Focused Web Searches Semantically Driven Snippet Selection for Supporting Focused Web Searches IRAKLIS VARLAMIS Harokopio University of Athens Department of Informatics and Telematics, 89, Harokopou Street, 176 71, Athens,

More information

Refinement of Web Search using Word Sense Disambiguation and Intent Mining

Refinement of Web Search using Word Sense Disambiguation and Intent Mining International Journal of Information and Computation Technology. ISSN 974-2239 Volume 4, Number 3 (24), pp. 22-23 International Research Publications House http://www. irphouse.com /ijict.htm Refinement

More information

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION Ms. Nikita P.Katariya 1, Prof. M. S. Chaudhari 2 1 Dept. of Computer Science & Engg, P.B.C.E., Nagpur, India, nikitakatariya@yahoo.com 2 Dept.

More information

A Comprehensive Analysis of using Semantic Information in Text Categorization

A Comprehensive Analysis of using Semantic Information in Text Categorization A Comprehensive Analysis of using Semantic Information in Text Categorization Kerem Çelik Department of Computer Engineering Boğaziçi University Istanbul, Turkey celikerem@gmail.com Tunga Güngör Department

More information

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li Random Walks for Knowledge-Based Word Sense Disambiguation Qiuyu Li Word Sense Disambiguation 1 Supervised - using labeled training sets (features and proper sense label) 2 Unsupervised - only use unlabeled

More information

GIR experiements with Forostar at GeoCLEF 2007

GIR experiements with Forostar at GeoCLEF 2007 GIR experiements with Forostar at GeoCLEF 2007 Simon Overell 1, João Magalhães 1 and Stefan Rüger 2,1 1 Multimedia & Information Systems Department of Computing, Imperial College London, SW7 2AZ, UK 2

More information

Entity Linking at Web Scale

Entity Linking at Web Scale Entity Linking at Web Scale Thomas Lin, Mausam, Oren Etzioni Computer Science & Engineering University of Washington Seattle, WA 98195, USA {tlin, mausam, etzioni}@cs.washington.edu Abstract This paper

More information

Context Based Web Indexing For Semantic Web

Context Based Web Indexing For Semantic Web IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 12, Issue 4 (Jul. - Aug. 2013), PP 89-93 Anchal Jain 1 Nidhi Tyagi 2 Lecturer(JPIEAS) Asst. Professor(SHOBHIT

More information

INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM

INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM INCORPORATING SYNONYMS INTO SNIPPET BASED QUERY RECOMMENDATION SYSTEM Megha R. Sisode and Ujwala M. Patil Department of Computer Engineering, R. C. Patel Institute of Technology, Shirpur, Maharashtra,

More information

A service based on Linked Data to classify Web resources using a Knowledge Organisation System

A service based on Linked Data to classify Web resources using a Knowledge Organisation System A service based on Linked Data to classify Web resources using a Knowledge Organisation System A proof of concept in the Open Educational Resources domain Abstract One of the reasons why Web resources

More information

Multi-Aspect Tagging for Collaborative Structuring

Multi-Aspect Tagging for Collaborative Structuring Multi-Aspect Tagging for Collaborative Structuring Katharina Morik and Michael Wurst University of Dortmund, Department of Computer Science Baroperstr. 301, 44221 Dortmund, Germany morik@ls8.cs.uni-dortmund

More information

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018 Building Multilingual Resources and Neural Models for Word Sense Disambiguation Alessandro Raganato March 15th, 2018 About me alessandro.raganato@helsinki.fi http://wwwusers.di.uniroma1.it/~raganato ERC

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

Context Sensitive Search Engine

Context Sensitive Search Engine Context Sensitive Search Engine Remzi Düzağaç and Olcay Taner Yıldız Abstract In this paper, we use context information extracted from the documents in the collection to improve the performance of the

More information

Mutual Disambiguation for Entity Linking

Mutual Disambiguation for Entity Linking Mutual Disambiguation for Entity Linking Eric Charton Polytechnique Montréal Montréal, QC, Canada eric.charton@polymtl.ca Marie-Jean Meurs Concordia University Montréal, QC, Canada marie-jean.meurs@concordia.ca

More information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University sekine@cs.nyu.edu Kapil Dalwani Computer Science Department

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Semantic Web based Information Extraction

Semantic Web based Information Extraction IJSRD National Conference on Advances in Computer Science Engineering & Technology May 2017 ISSN: 2321-0613 Nimeshkumar Arvindbhai Patel 1 Sanjay M. Shah 2 1 P. G. Scholar 2 Professor 1,2 Department of

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Short Text Clustering; Challenges & Solutions: A Literature Review

Short Text Clustering; Challenges & Solutions: A Literature Review [Volume 3 issue 6 June 2015] Page No.1025-1031 ISSN :2320-7167 INTERNATIONAL JOURNAL OF MATHEMATICS AND COMPUTER RESEARCH Short Text Clustering; Challenges & Solutions: A Literature Review Dr. Tamanna

More information

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge Discover hidden information from your texts! Information overload is a well known issue in the knowledge industry. At the same time most of this information becomes available in natural language which

More information

Domain-Specific Semantic Relatedness From Wikipedia: Can A Course Be Transferred?

Domain-Specific Semantic Relatedness From Wikipedia: Can A Course Be Transferred? Domain-Specific Semantic Relatedness From Wikipedia: Can A Course Be Transferred? Beibei Yang University of Massachusetts Lowell Lowell, MA 01854 byang1@cs.uml.edu Jesse M. Heines University of Massachusetts

More information

Query Session Detection as a Cascade

Query Session Detection as a Cascade Query Session Detection as a Cascade Extended Abstract Matthias Hagen, Benno Stein, and Tino Rüb Bauhaus-Universität Weimar .@uni-weimar.de Abstract We propose a cascading method

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

A Session-based Ontology Alignment Approach for Aligning Large Ontologies

A Session-based Ontology Alignment Approach for Aligning Large Ontologies Undefined 1 (2009) 1 5 1 IOS Press A Session-based Ontology Alignment Approach for Aligning Large Ontologies Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University,

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI 1 KAMATCHI.M, 2 SUNDARAM.N 1 M.E, CSE, MahaBarathi Engineering College Chinnasalem-606201, 2 Assistant Professor,

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Personalized Terms Derivative

Personalized Terms Derivative 2016 International Conference on Information Technology Personalized Terms Derivative Semi-Supervised Word Root Finder Nitin Kumar Bangalore, India jhanit@gmail.com Abhishek Pradhan Bangalore, India abhishek.pradhan2008@gmail.com

More information

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation

More information

Text Mining Research: A Survey

Text Mining Research: A Survey Text Mining Research: A Survey R.Janani 1, Dr. S.Vijayarani 2 PhD Research Scholar, Dept. of Computer Science, School of Computer Science and Engineering, Bharathiar University, Coimbatore, India 1 Assistant

More information

Langforia: Language Pipelines for Annotating Large Collections of Documents

Langforia: Language Pipelines for Annotating Large Collections of Documents Langforia: Language Pipelines for Annotating Large Collections of Documents Marcus Klang Lund University Department of Computer Science Lund, Sweden Marcus.Klang@cs.lth.se Pierre Nugues Lund University

More information