Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

Similar documents
Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Web Information Retrieval using WordNet

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Making Sense Out of the Web

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

WordNet-based User Profiles for Semantic Personalization

CS 6320 Natural Language Processing

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

A Comprehensive Analysis of using Semantic Information in Text Categorization

Domain-specific Concept-based Information Retrieval System

A Linguistic Approach for Semantic Web Service Discovery

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Chapter 6: Information Retrieval and Web Search. An introduction

Information Retrieval. (M&S Ch 15)

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Reading group on Ontologies and NLP:

Influence of Word Normalization on Text Classification

String Vector based KNN for Text Categorization

Cross-Lingual Word Sense Disambiguation

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Text Mining: A Burgeoning technology for knowledge extraction

Query Difficulty Prediction for Contextual Image Retrieval

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Improving Retrieval Experience Exploiting Semantic Representation of Documents

AN EFFECTIVE INFORMATION RETRIEVAL FOR AMBIGUOUS QUERY

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Information Retrieval and Web Search

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

REDUNDANCY REMOVAL IN WEB SEARCH RESULTS USING RECURSIVE DUPLICATION CHECK ALGORITHM. Pudukkottai, Tamil Nadu, India

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Semantic-Based Information Retrieval for Java Learning Management System

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Web Page Similarity Searching Based on Web Content

An Efficient Approach for Requirement Traceability Integrated With Software Repository

Ontology Based Search Engine

Information Retrieval. Information Retrieval and Web Search

Correlation Based Feature Selection with Irrelevant Feature Removal

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

1.

Measuring Semantic Similarity between Words Using Page Counts and Snippets

Archive of SID. A Semantic Ontology-based Document Organizer to Cluster elearning Documents

Introduction to Text Mining. Hongning Wang

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Encoding Words into String Vectors for Word Categorization

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

QUERIES SNIPPET EXPANSION FOR EFFICIENT IMAGES RETRIEVAL

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

QUERY BASED TEXT SUMMARIZATION

Noida institute of engineering and technology,greater noida

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Document Retrieval using Predication Similarity

Lecture 4: Unsupervised Word-sense Disambiguation

ISSN: [Sugumar * et al., 7(4): April, 2018] Impact Factor: 5.164

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Information Extraction of Important International Conference Dates using Rules and Regular Expressions

METEOR-S Web service Annotation Framework with Machine Learning Classification

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

KEYWORD EXTRACTION FROM DESKTOP USING TEXT MINING TECHNIQUES

Query Phrase Expansion using Wikipedia for Patent Class Search

Ubiquitous Computing and Communication Journal (ISSN )

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Ontology-Based Web Query Classification for Research Paper Searching

Word Disambiguation in Web Search

MRD-based Word Sense Disambiguation: Extensions and Applications

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

The Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation

Hybrid Approach for Query Expansion using Query Log

Efficient Discovery of Semantic Web Services

INTERNATIONAL JOURNAL OF SCIENTIFIC & ENGINEERING RESEARCH, VOLUME 6, ISSUE 5, MAY-2015 ISSN

Classification and Comparative Analysis of Passage Retrieval Methods

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation

Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network

A Semantic Model for Concept Based Clustering

International Journal of Advanced Research in Computer Science and Software Engineering

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

Information Retrieval

Keywords Data alignment, Data annotation, Web database, Search Result Record

Automatic Text Classification System

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Wordnet Based Document Clustering

Semantic Indexing of Technical Documentation

Improving Classifier Performance by Imputing Missing Values using Discretization Method

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

An Efficient Approach for Requirement Traceability Integrated With Software Repository

Deep Web Content Mining

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Keyword Extraction by KNN considering Similarity among Features

Research Article. August 2017

Information Retrieval and Web Search

An Approach To Web Content Mining

Extensible Dynamic Form Approach for Supplier Discovery

Transcription:

ISBN 978-93-84468-0-0 Proceedings of 015 International Conference on Future Computational Technologies (ICFCT'015 Singapore, March 9-30, 015, pp. 197-03 Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm Su Mu Tyar 1, and Myo Min Than 1, Department of Information Technology, Yangon Technological University, Myanmar 1 09sumutyar@gmail.com, dr.myominthan79@gmail.com Abstract: In many natural language processing (NLP applications such as machine translation, content analysis, and information retrieval (IR, word sense disambiguation (WSD is an essential technique. In spite of having long history relationship between WSD and retrieval system, there are still challenges in intelligible retrieval and ambiguous query words. To overcome theses challenges, new Jaccard coefficient based WSD algorithm is proposed with the help of WordNet and Corpus lexical resources for automatically identifying the correct meaning of an ambiguous word. In this study, the glosses of Synonyms, Hypernyms and Hyponyms synsets of WordNet and Corpus that encoded senses of each word are considered in queries disambiguation and senses indexing. Keywords: Word Sense Disambiguation, Information Retrieval, Jaccard coefficient, WordNet, Corpus 1. Introduction Due to the proliferation of information on the web, the problem of finding relevant documents has become more visible. The method how to easily retrieve the information and knowledge is needed. In this situation, information retrieval (IR methods are concerned with the process about the representation, storage, searching and finding of information which is relevant to the user query. The ambiguity in the user query has long been recognized as having a detrimental effect on the performance of text based information retrieval (IR system. A word can have many different meanings, or senses. For example, bank in English can either mean a financial institution, or a sloping raised land. The task of word sense disambiguation (WSD is to assign the correct sense to such ambiguous words based on the surrounding context. The word sense disambiguation algorithm is needed for semantic indexing to get the correct sense of the indexed words. Semantic indexing of the document changes from the keyword-based approach to the sensebased approach for effective retrieval. So, the proposed system is implemented as the sense-based information retrieval system by using Jaccard coefficient based WSD algorithm. The sense-based information retrieval system eliminates either the possibility of retrieving information that is obtained due to the presence of polysemes of the keywords or the irrelevant information that is retrieved because of non provision of the correct sense of the word in the searching process. The proposed sense-based information retrieval system has been semantically performed over the words to increase the precision of the IR system. To support the semantic search, this system uses WordNet version 3.0 and Corpus as the lexical resources. For information technology (IT domain, this system is proposed to retrieve user query relevance technology information. The rest of the paper is organized as follows: section presents related works. Background theory about the sense-based information retrieval (IR is presented in section 3. The proposed system design, explanation and experimental results are described in section 4. Finally, conclusion is described in section 5. http://dx.doi.org/10.17758/ur.u031558 197

. Related Work D. Subarani [1] presented the concept-based information retrieval from Tamil text documents. Semantics has been introduced at various linguistic levels, word level, sentence level and document content extraction level and at various stage of information retrieval such as query and document representation, and indexing, to improve the information retrieval from text documents. Domain ontology that has been created with knowledge based, and word sense disambiguation are used to support semantic search in Tamil document repositories. P. O. Michael, S. Christopher and T. John [] demonstrated the relative performance of an IR system using WSD compared to a baseline retrieval technique such as the vector space model. This disambiguation system was trained and evaluated using Semcor 1.6 which is distributed with WordNet. Y. Liu, P. Scheuermann and X. Zhu [3] proposed a text classification method based on word sense disambiguation. This algorithm is applied to Brown Corpus. The sense-based text classification algorithm is an automatic technique to disambiguate word senses and then classify text documents. If this automatic technique can be applied in real applications, the classification of e-documents must be accelerated dramatically. It must be a great contribution to the management system of Web pages and digital libraries, etc. According to literature and concepts pointed out from the previous works, this work is intended to provide a sense-based information retrieval system by using Jaccard coefficient based word sense disambiguation (WSD algorithm. 3. Sense-based Information Retrieval (IR Information retrieval (IR system is able to accept a user query, understand from the user query what the user requires, search a database for relevant documents, retrieve the documents to the user, and rank the documents according to their relevance [4]. Before the documents in a collection are used for retrieval, some pre-processing tasks are usually performed. For traditional text documents (no HTML tags, the tasks are stopword removal, stemming, and handling of digits, hyphens, punctuation, and cases of letters [5]. The sense-based IR system also retrieves the user query relevant information. But, this IR system must consider the synonyms of the query words as a part of the IR query. Relevant synonyms of the query words in the given context contribute the useful information to the query. These relevant synonyms can be identified with the help of proposed Jaccard coefficient based WSD algorithm. 3.1. Word Sense Disambiguation Word sense is one of the meanings of a word. Words are having different meanings based on the context of the word usage in a sentence. Word sense disambiguation (WSD is used to find the correct meaning of the sense or the word. WSD is usually performed on one or more texts although in principle bags of words, i.e., collections of naturally occurring words might be employed [9]. WSD can be viewed as a classification task: word senses are the classes, and an automatic classification method is used to assign each occurrence of a word to one or more classes based on the evidence from the context and from external knowledge sources such as Thesauri, Ontology, Machine readable dictionaries (MRD and WordNet. Among them, this system is used WordNet within WSD for finding semantically related words [7, 8]. Word sense disambiguation process is essential and useful for many applications. These applications are machine translation, speech processing, text processing, content and thematic analysis, grammatical analysis, and information retrieval and hypertext navigation [6]. 3.1.1. WordNet and Corpus WordNet and corpus are the external knowledge sources. These are also used as the source of the synsets. The basic relationship between words in the external knowledge source is the synonym relation called synset. Words in the same synset are synonymous in a particular sense. Word sense is the meaning a word can take depending how it is used [10]. For instance, one of the synsets of bank is {depository financial institution, bank, banking concern, banking company} and its gloss is (a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank" [3]. http://dx.doi.org/10.17758/ur.u031558 198

3.1.. Jaccard Coefficient-based WSD Algorithm Jaccard coefficient-based word sense disambiguation (WSD algorithm is shown in Figure 1. Algorithm: Generate correct sense of the ambiguous word. w : the word in the sentence aw : the ambiguous word s : sense of the word Preprocessing: Segment the input sentence Remove stopwords from this sentence Building context vector: For all w i in the sentence Look up in the WordNet. If w i has only one s then Let w i be disambiguous word. else retrieve all s i of aw i from WordNet and English corpus. For all s i of aw i - build context vector using gloss of s i from WordNet and English corpus; - remove stopword in each context vector; EndFor EndFor Disambiguation : For all s i of aw i - calculate similarity between the input vector A and each context vector B n A i B i sim(a, B i 1 n (Ai i 1 n (Bi i 1 n Ai Bi i 1 If s = max score (s i return the correct sense s of the ambiguous word aw; EndFor Fig. 1: Jaccard Coefficient-based WSD Algorithm 3.. Similarity Measure Method To measure the similarity between the document vector d j and the sense-based query vector q, the similarity measure method is as follows: v ws ij ws iq sim(d i 1 j,q v (ws i 1 ij v (ws i 1 iq v ws ij ws iq i 1 (1 3..1. Sense Weighting Scheme in Document To calculate the weight of sense within document, SF (sense frequency and IDF (inverse document frequency are used. The sense frequency within document is as follows: sfij fij max{f 1j,f j,...,f v j } ( where, f ij is the raw frequency count of sense s i in document d j. sf ij is the normalize sense frequency of sense s i in document d j. The inverse document frequency is as follows: idf i log N dfi (3 http://dx.doi.org/10.17758/ur.u031558 199

where, df i is number of document in which sense s i appears at least once. N is the total number of document in the system. idf i is the inverse document frequency of sense s i. The weight of the sense (ws ij within document is as follows: ws ij sfij idfi (4 3... Sense Weighting Scheme in Query The weight of the sense within query is as follows: ws iq 0.5 0.5sfiq max{sf1q,sf q,...,sf log v q } where, ws iq is the weight of the sense s i in query q. sf iq is the raw frequency count of sense s i in query q. N df i (5 4. Proposed System Design Start Input Query Jaccard Coefficientbased WSD Process Remove Stopwords Search Ambiguous Word English Corpus WordNet Search Optimal Sense for each Ambiguous Word Produce Optimal Sense Retrieval Process Search Relevant Documents by using Similarity Method Produce Relevant Documents Document Collection Rank Documents Display User Query Relevant Document Results End Fig. : Proposed Sense-based Information Retrieval System Design http://dx.doi.org/10.17758/ur.u031558 00

The proposed sense-based information retrieval (IR system design is shown in Figure. In this system, there are two main processes: word disambiguation and information retrieval process. Jaccard coefficient method is applied in disambiguation the ambiguous words. In the information retrieval process, similarity measure method for retrieving the relevant documents in the database. First of all, user input query is accepted to remove the stopwords as a pre-processing step. And then, the ambiguous word, the word that has more than one meaning, is chosen with the help of English corpus and WordNet. After that, the optimal sense for each ambiguous word is determined by applying Jaccard coefficient-based word sense disambiguation (WSD algorithm. In the optimal sense searching process, WordNet and English corpus are used as the external knowledge resources. After getting the optimal sense for each ambiguous word, documents which have the user require information from the document database collection. For relevant documents searching process, disambiguous user query is used. Finally, the most relevant documents are retrieved to the user according to the ranking process. 4.1. Explanation of the Proposed System The aim of this study is to improve the performance of Information Retrieval (IR system. Three documents are considered in the document collection as a sample. These three documents are shown in Figure 3. Document1 In the learning process, learners are able to acquire knowledge. Learning methods include traditional learning, e-learning, blended learning, mobile learning and personalized learning. Document Knowledge acquisition involves the acquisition of knowledge from human experts, books, documents, sensors, or computer files. Document3 Knowledge acquisition is the process used to define the rules and ontologies required for a knowledge-based system. Fig. 3: Sample Documents Firstly, the user query is accepted and determined whether each query word is ambiguous or disambiguous word. If the user query is the learning process for the sample system. The correct meaning of the ambiguous word is determined with the help of the user query input vector. The correct sense of the word is manipulated by calculating the similarity between the user query input vector and each context vector that used the gloss of each sense from the WordNet and English Corpus. Sample ambiguous words and their senses are shown in Table 1. TABLE I: Sample Ambiguous Words and Their Senses from WordNet and English Corpus Ambiguous Word No: of senses Sense 1 Sense Sense 3 Sense 4 Sense 5 learning 8 acquisition eruditeness acquire study memorize information 5 info knowledge accusation data entropy process 9 procedure operation summons outgrowth, appendage treat The correct sense is regarded depends on the ranks of similarity value between each context vector and input vector. The sense that has the highest similarity result is assumed the correct meaning of the ambiguous word, disambiguous query as a final step of WSD process. Both kind of user query are shown in Figure 4. User Input Query: the learning process Disambiguated User Query: learning acquisition process operation Fig. 4: User Query Although the disambiguated user query is used in sense-based IR system, the keyword-based IR system applies the simple user query in this study to compare the efficiency. According to the results, the sense-based IR is much more efficient than the keyword-based IR. Retrieval results are shown in Table. http://dx.doi.org/10.17758/ur.u031558 01

TABLE II: Retrieval Results of the Sample documents ID Retrieval Results by using Disambiguated User Query Retrieval Results by using User Input Query 1 Document 1 (Relevant Document 1 (Relevant Document (Relevant Document (Irrelevant 3 Document 3 (Relevant Document 3 (Relevant 4.. Performance Analysis To access the accuracy or correctness of the proposed system, precision method is used. Precision: the percentage of retrieved documents that is relevant to the query. It can be defined as follows: {relevantdocuments} {retrieveddocuments} precision (6 {retrieveddocuments} For the above sample, the experimental results of the proposed system are shown in Figure 5. In spite of the same accuracy result for document 1, higher accuracy results are gained by using the disambiguous query in information retrieval process. Accuracy Results of the Proposed System 100 Accuracy Result (% 80 60 40 0 Retrieval Results using Disambiguous Query Retrieval Results using Ambiguous Query 0 Document 1 Document Document 3 5. Conclusion Fig. 5: Experimental Results of the Proposed System The proposed sense-based information retrieval system is developed based on the semantic oriented technology. Sense-based IR system used disambiguous query word is much more efficient than most other IR systems used user direct query. The glosses of Synonyms, Hypernyms and Hyponyms synset of WordNet and Corpus external knowledge resources are used in determining the disambiguous query word. To search the optimal sense of ambiguous query word, this study proposed Jaccard coefficient based WSD algorithm with the help of both external knowledge resources. Working flow of correct sense searching is the calculating similarity value between the user query input vector and each context vector that used the gloss of each sense from the WordNet and English Corpus. This study is useful not only to develop the intelligence information retrieval system but also to understand WSD system for applying any other application area of natural language processing field. 6. References [1] D. Subarani, Concept Based Information Retrieval from Text Documents, Dept. of Computer Sciences, SLN College of Sciences, Tirupathi, India, IOSR Journal of Computer Engineering (IOSRJCE, PP 38-38, July-Aug, 01. http://dx.doi.org/10.17758/ur.u031558 0

[] P. O. Michael, S. Christopher and T. John, Word Sense Disambiguation in Information Retrieval Revisited, The University of Sunderland, Informatics Centre, Canada, 003. [3] Y. Liu, P. Scheuermann and X. Zhu, Using WordNet to Disambiguate Word Senses for Text Classification, International Conference on Computational Science, Springer-Verlag Berlin Heidelberg, pp. 780-788, 007. http://dx.doi.org/10.1007/978-3-540-7588-6_17 [4] D. Glockner, Fuzzy Information Retrieval, University Hagen, Department of Computer Science, 005. [5] B. Liu, Web Data Mining, Department of Computer Science, University of Illinois at Chicago, USA, Springer- Verlag Berlin Heidelberg, 007. [6] I. Nancy and V.Jean, Word Sense Disambiguation: The State of the Art, Department of Computer Science, Vassar College, 1998. [7] R. Guzman-Cabrera, P. Rosso and M. Montes-y-Gomez, Semi-supervised Word Sense Disambiguation Using the Web as Corpus, Universidad de Guanajuato, Mexico, 009. http://dx.doi.org/10.1007/978-3-64-0038-0_1 [8] R. Navigli, Word Sense Disambiguation: A Survey, ACM Computing Surveys, Vol. 41, No., Article 10, Italy, February, 009. [9] S. Viswanadha Raju, J. Sreedhar and P. Pavan Kumar, Word Sense Disambiguation: An Empirical Survey, International Journal of Soft Computing and Engineering (IJSCE, Volume-, Issue-, May, 01. [10] A. Bui Muhammad and A. Tambuwal Yusuf, Query Expansion: Is It Necessary In Textual Case-Based Reasoning?, Nigerian Journal of Basic and Applied Science (NJBAS, 011. http://dx.doi.org/10.17758/ur.u031558 03