Measuring Semantic Similarity between Words Using Page Counts and Snippets

Similar documents
MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

Resolving Referential Ambiguity on the Web Using Higher Order Co-occurrences in Anchor-Texts

Discovering Semantic Similarity between Words Using Web Document and Context Aware Semantic Association Ranking

IJMIE Volume 2, Issue 8 ISSN:

Clustering and Classification Augmented with Semantic Similarity for Text Mining

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Web Information Retrieval using WordNet

Web Service Matchmaking Using Web Search Engine and Machine Learning

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

ISSN: (Online) Volume 2, Issue 3, March 2014 International Journal of Advance Research in Computer Science and Management Studies

Keywords Web Query, Classification, Intermediate categories, State space tree

Making Sense Out of the Web

Improving Recommendations Through. Re-Ranking Of Results

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

Assigning Vocation-Related Information to Person Clusters for Web People Search Results

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Robust Estimation of Google Counts for Social Network Extraction

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Data Mining for XML Query-Answering Support

Word Disambiguation in Web Search

Ontology Based Prediction of Difficult Keyword Queries

Information Retrieval and Web Search

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

An Analysis of Researcher Network Evolution on the Web

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

EFFECTIVE EFFICIENT BOOLEAN RETRIEVAL

Document Retrieval using Predication Similarity

Extracting Key Phrases to Disambiguate Personal Name Queries in Web Search

Automatic Discovery of Association Orders between Name and Aliases from The Web using Anchor Texts-Based Co-Occurrences

Keywords: clustering algorithms, unsupervised learning, cluster validity

Navigation Cost Modeling Based On Ontology

MATRIX BASED INDEXING TECHNIQUE FOR VIDEO DATA

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

NATURAL LANGUAGE PROCESSING

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Correlation Based Feature Selection with Irrelevant Feature Removal

A New Technique to Optimize User s Browsing Session using Data Mining

An Efficient Language Interoperability based Search Engine for Mobile Users 1 Pilli Srivalli, 2 P.S.Sitarama Raju

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Query- And User-Dependent Approach for Ranking Query Results in Web Databases

International Journal of Scientific & Engineering Research, Volume 5, Issue 7, July ISSN

KEYWORD GENERATION FOR SEARCH ENGINE ADVERTISING

Entity and Knowledge Base-oriented Information Retrieval

What is this Song About?: Identification of Keywords in Bollywood Lyrics

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

The Goal of this Document. Where to Start?

Text Document Clustering Using DPM with Concept and Feature Analysis

Improving Retrieval Experience Exploiting Semantic Representation of Documents

NUS-I2R: Learning a Combined System for Entity Linking

Chapter 27 Introduction to Information Retrieval and Web Search

Domain-specific Concept-based Information Retrieval System

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Advances in Natural and Applied Sciences. Information Retrieval Using Collaborative Filtering and Item Based Recommendation

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

A Combined Method of Text Summarization via Sentence Extraction

Test Model for Text Categorization and Text Summarization

ISSN (Online) ISSN (Print)

A Novel Approach for Inferring and Analyzing User Search Goals

Semantic-Based Information Retrieval for Java Learning Management System

R 2 D 2 at NTCIR-4 Web Retrieval Task

Introduction to Text Mining. Hongning Wang

Automatic Discovery of Association Orders between Name and Aliases from the Web using Anchor Texts-based Co-occurrences

A Novel Approach for Restructuring Web Search Results by Feedback Sessions Using Fuzzy clustering

Ontology-Based Web Query Classification for Research Paper Searching

Measuring The Degree Of Similarity Between Web Ontologies Based On Semantic Coherence

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Linking Entities in Chinese Queries to Knowledge Graph

THE METHOD OF AUTOMATED FORMATION OF THE SEMANTIC DATABASE MODEL OF THE DIALOG SYSTEM

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval

A Novel Techinque For Ranking of Documents Using Semantic Similarity

Keywords Web Usage, Clustering, Pattern Recognition

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Automatic Identification of User Goals in Web Search [WWW 05]

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Payal Gulati. House No. 1H-36, NIT, Faridabad E xp e r i e nc e

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

A hybrid method to categorize HTML documents

LITERATURE SURVEY ON SEARCH TERM EXTRACTION TECHNIQUE FOR FACET DATA MINING IN CUSTOMER FACING WEBSITE

Classification of Text Documents Using B-Tree

Papers for comprehensive viva-voce

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Exploiting Symmetry in Relational Similarity for Ranking Relational Search Results

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

A Machine Learning Approach for Displaying Query Results in Search Engines

A Framework for Securing Databases from Intrusion Threats

Ravindranagar Colony, Habsiguda, : Department of Computer Science & Engineering, University College of Engineering(A),

Ranking Assessment of Event Tweets for Credibility

Transcription:

Measuring Semantic Similarity between Words Using Page Counts and Snippets Manasa.Ch Computer Science & Engineering, SR Engineering College Warangal, Andhra Pradesh, India Email: chandupatla.manasa@gmail.com V.Ramana Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email:naikramana@gmail.com S.P. Ananda Raj Sr. Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email: anandsofttech@gmail.com Abstract Web mining involves activities such as document clustering, community mining etc. to be performed on web. Such tasks need measuring semantic similarity between words. This helps in performing web mining activities easily in many applications. However, the accuracy of measuring semantic similarity between any two words is difficult task. In this paper a new approach is proposed to measure similarity between words. This approach is based on text snippets and page counts. These two measures are taken from the results of a search engine like Google. To achieve the aim of this paper, lexical patterns are extracted from text snippets and word co-occurrence measures are defined using page counts. The results of these two are combined. Moreover, we proposed algorithms such as pattern clustering and pattern extraction in order to find various relationships between any given two words. Support Vector Machines, a data mining technique, is used to optimize the results. The empirical results reveal that the proposed techniques are finding best results that can be compared with human ratings and accuracy in web mining activities. Key Words - Text snippets, word count, semantic similarity, web mining, lexical patterns 1. INTRODUCTION Web mining has gained popularity as huge amount of information is being made available over web and the automated processing of such data or information is the need of the hour. The applications of web mining include entity disambiguation, relation detection and community extraction. Information retrieval and natural language processing are two important aspects involved in all web mining applications. Lexical dictionary such as Word Net is widely used to achieve natural language processing. However, it is a general purpose lexical ontology. As part of web mining documents are to be compared and analyzed programmatically. This is a tedious task as the meaning of words change across domains over time. The problem with lexical dictionaries is that they are not having diverse information about words in various contexts. For instance the word apple is somehow related to computer science as there is a company by name Apple which has been instrumental in brining many computer hardware and software technologies. However, this word is ignored in some of the lexical dictionaries as they consider it as a fruit. As new words are created and many meanings are associated with the words, the lexical dictionaries have proved to be inadequate to handle things when the words having new meanings and relationships with other words which are not yet updated in lexical dictionaries. To overcome the drawbacks mentioned above, we propose a method that automatically finds semantic similarity between words or entities based on the page counts and text snippets retrieved from web search engines like Google. Page count is an estimate of number of pages that contain query words. Snippet is some text extracted by web search engine based on 553

the query term given. The following is the text snippet obtained from Google search engine with query string apple. Apple Inc. (NASDAQ: AAPL; formerly Apple Computer, Inc.) is an American multinational corporation that designs and sells consumer electronics, computer... Fig. 1: Shows text snippet given by Google search engine for search word apple Similarity measures have been associated with text snippets for query expansion [4], personal name disambiguation [9], and community mining [17]. The text snippets and page counts are automatically obtained from search engines and used in web mining. However, they have the drawbacks as follows page count analysis ignores the position of a word in a page page count of a polysemous word (a word with multiple senses) might contain a combination of all its senses. Because the large number of documents in the result set, only those snippets for the top ranking results for a query can be processed We propose a method that overcomes the problems mentioned above. We use both snippets and page counts and propose algorithms such as lexical pattern extraction and pattern clustering to accurately measure semantic similarity between words. The main contributions of this paper include lexical patterns extraction to identify relation between words, SVM usage to integrate machine learning approach to optimize results. 2. RELATED WORK In [15] taxonomy of words is used to calculate similarity between to words by finding the length of the shorted path connecting two words. Information content concept is used by Resnik [9] where similarity between two concepts was introduced. The maximum of similarity between any concepts that the words belong to is used for finding similarity between words. Information content and also structural semantic information are combined by Li et al. [3] in order to have a similarity measure. Very high accuracy was shown by this technique when used with Charles [11] benchmark data set. Lin [8] defined similarity as the information which is in common to both concepts while Cilibrasi and Vitanyi [12] proposed a metric known as distance metric. This metric is defined using page counts retrieved from web search engines. Snippets were used by [4] in order to measure semantic similarity between any given two words. They represented each snippet as TF-IDF weighted term vector. A double checking model is developed by Chen et al. [4] which is based on snippets returned by web search engine. In various web mining applications such as word sense disambiguation [6], language modeling [13], synonym extraction [5], thesauri extraction [4] the concept of measuring semantic similarity is used. 3. PROPOSED METHOD The proposed method that finds similarity between two words A, B is supposed to return a value between 0.0 and 1.0. The value 0.0 indicates that there is no similarity between words while 1.0 indicates there are absolute similarity between given words. The proposed method makes use of page counts and text snippets retrieved by search engine like Google. For instance the words gem and jewel are given to Google and the resultant page counts and text snippets are used by our method to find similarity between words. The proposed method is visualized as shown in fig. 2. As illustrated in Fig. 2, two words such as gem, jewel is given as input to search engine. The search engine is returning page counts and also text snippets. These are extracted and given input to our proposed techniques. Page counts are given to word cooccurrence measures such as Web Jaccard, WebOverlap, WebDice, and WebPMI. The result of these techniques is given to SVM. On the other hand, the text snippets are given to the proposed algorithms that can generate pattern clusters which in turn are given to SVM. Now SVM has got two inputs. They are work co-occurrence measures and also pattern clusters. The SVM is trained with these and finally accurate semantic similarity is calculated for the given two words such as gem and jewel. 554

. sorted clusters in ascending order do mean that the most useful clusters are at the top. Steps for pattern clustering Fig. 2: outline of proposed method 3.1 Page Count Based Co-Occurrence Measures For given two words A and B page counts are given by search engine when these words are given as input. The four famous word co-occurrence measures such as Jaccard, Overlap (Simpson), Dice, and Point wise mutual information (PMI) are used in the proposed design in order to find similarity between words. 3.2 Lexical Pattern Extraction To overcome the drawbacks of using text snippets directly, we propose an algorithm known as lexical pattern extraction algorithm based on text snippets. The algorithm is meant for finding semantic relations that exist between given words. This technique has been used by various natural language processing tasks like extracting hypernyms [1], [7], question answering [10], meronyms [14] and paraphrase extract. Lexical patterns are the patterns that satisfy the following criteria. 1. A subsequence must exactly contain one occurrence of each A and B 2. The max length of subsequence is L words 3. In a subsequence one or more words can be skipped. However, consequently it should be less than g. 4. Only negation contractions in a context are expanded. 3.3 Lexical Pattern Clustering The extracted lexical patterns are clustered based on the similarity with respect to given cluster. Each cluster contains patterns that express similar semantic relations. Algorithm 1 returns such clusters. The 4. TRAINING WITH SVM A two- class SVM is trained with both synonymous and nonsynonymous word pairs generated from WordNet. For 3000 words the word pairs are extracted. The total number of words in the training data is 6000. Then lexical patterns are extracted subject to specified threshold. Lexical patterns thus extracted are clustered and given to SVM. The SVM acts up on both results of word co-occurrence measures and also pattern clusters in order to calculate semantic similarity between two given words. 5. EXPERIMENTAL RESULTS The experimental results include semantic similarities between given two words by using SVM and page counts and text snippets retrieved from search engines for given words. 555

ISSN:2249-5789 Fig3:home page We have to enter a key word in given text box to search in the search engine.for example, google and opera are the words to search as in Fig 4 and Fig 6. Fig 6 shows entering of a word opera to search Fig 7 shows page counts and snippets retrived for given word opera Fig 4 shows entering of a word google to search When we click on search button,it displays the page counts and snippets as result. For example, the page counts and snippets for google and opera are shown in Fig 5 and Fig 7. We have to enter two words to measure semantic similarity between them.the measurement ranges from 0 to 1. For given words google and opera the semantic similarity is 0.8. Fig 8 shows semantic similarity between google and opera as 0.8 Fig 5 shows page counts and snippets retrived for given word google For various words, we can measure semantic similarity between them. The result is close to 1 when they are semantically closed and it is close to 0 when they are not closed semantically.the output will be shown in form of graphs and tables as follows. 556

Table 1 shows semantic similarities for various word pairs accuracy. To achieve these techniques like pattern extraction and pattern clustering are introduced. These algorithms help in finding various relationships between words. SVM was trained with relationships identified between the given words. The experiments are made with synonymous and non synonymous word pairs that are collected from Word net synsets. The experimental results have shown that the proposed method is far better than the existing approaches that are employed to measure semantic similarity between words. 7. REFERENCES Graph 1 shows semantic similarities for various word pairs 6. CONCLUSION We used the results of web search engine for two words and proposed a semantic similarity measure which is based on the page counts and text snippets that are the results of a web search engine like Google. The aim of this paper is to measure semantic similarity between any two given words with utmost [1] C. Buckley, G. Salton, J. Allan, and A. Singhal.Automatic query expansion using smart: Trec 3. In Proc. of 3rd Text REtreival Conference, pages 69{80, 1994. [2] D. Bollegala, Y. Matsuo, and M. Ishizuka.Disambiguating personal names on the web using automatically extracted key phrases. In Proc. of the 17th European Conference on Artificial Intelligence,pages 553{557, 2006. [3] D. R. Cutting, J. O. Pedersen, D. Karger, and J. W.Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings SIGIR '92, pages 318{329, 1992. [4] D. Lin. An information-theoretic de nition of similarity. In Proc. of the 15th ICML, pages 296{304,1998. [5] D. Lin. Automatic retreival and clustering of similar words. In Proc. of the 17th COLING, pages 768{774,1998. WWW 2007 / Track: Semantic Web Session: Similarity and Extraction 765 Table 7: Entity Disambiguation Results Jaguar Java Method Precision Recall F Precision Recall F WebJaccard 0:5613 0:541 0:5288 0:5738 0:5564 0:5243 WebOverlap 0:6463 0:6314 0:6201 0:6228 0:5895 0:56 WebDice 0:5613 0:541 0:5288 0:5738 0:5564 0:5243 WebPMI 0:5607 0:478 0:5026 0:7747 0:595 0:6468 Sahami [36] 0:6061 0:6337 0:6019 0:751 0:4793 0:5761 CODC [6] 0:5312 0:6159 0:5452 0:7744 0:5895 0:6358 Proposed 0:6892 0:7144 0:672 0:8198 0:6446 0:691 [6] F. Keller and M. Lapata. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459{484, 2003. [7] G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes,6(1):1{28, 1998. [8] H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the International Conference on Digital Libraries, 2005. [9] J. Curran. Ensemble menthods for automatic thesaurus extraction. In Proc. of EMNLP, 2002. [10] J. Mori, Y. Matsuo, and M. Ishizuka. Extracting keyphrases to represent relations in social networks from web. In Proc. of 20th IJCAI, 2007. [11] M. Fleischman and E. Hovy. Multi-document person name resolution. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Reference Resolution Workshop,2004. 557

[12] M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proc. of 14th COLING, pages 539{545, 1992. [13] M. Lapata and F. Keller. Web-based models ofr natural language processing. ACM Transactions on Speech and Language Processing, 2(1):1{31, 2005. [14] M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proc. of 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 206{214, 1998. [15]P. Cimano, S. Handschuh, and S. Staab. Towards the self-annotating web. In Proc. of 13th WWW, 2004. [16] R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proceedings of the World Wide Web Conference (WWW), pages 463{470, 2005. [17] Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proceedings of 15 th International World Wide Web Conference, 2006. 8. ABOUT THE AUTHORS S.P.Anandaraj received B.E (CSE) degree from Madras University, Chennai in the year 2004, M.Tech (CSE) with Gold Medal from Dr.MGR Educational and Research Institute, University in the year 2007 (Distinction with Honors). Now Pursuing Ph.D in St. Peter s University, Chennai. He has 8 Years of Teaching Experience. His areas of interest are Information security and Sensor Networks. He has published papers in International Journal, International Conference and National Conference and attended nearly15 National Workshops/FDP/Seminars etc. He is a member of ISTE, CSI, IEEE, Member of IACSIT and Member of IAENG. Manasa.Ch received the M.C.A Degree from Kamala Institute of Technology and Science, Huzurabad, Karimnagar, A.P, India. Currently doing M.tech in Computer Science and Engineering at SR Engineering College, Warangal, India. Her research interests include Knowledge and Data Engineering. She has Participated in ISTE approved National conference on Mobile Communications and Data Engineering at VITS, Karimnagar,A.P. and participated in Women Student Congress at NIT, Warangal, organized by IEEE WIE student branch, V.Ramana received B.Tech (CSE) degree from JNTU, Hyderabad in the year 2006.M.Tech (AI) from university of Hyderabad in the year 2010, He has2 Years of Teaching Experience. His area of interest is Artificial Intelligence and Machine Learning. He has published papers in International Journal, International Conference and National Conference and attended National Workshops/FDP/Seminars etc., He is a member of CSI. 558