Measuring Semantic Similarity between Words Using Page Counts and Snippets Manasa.Ch Computer Science & Engineering, SR Engineering College Warangal, Andhra Pradesh, India Email: chandupatla.manasa@gmail.com V.Ramana Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email:naikramana@gmail.com S.P. Ananda Raj Sr. Assistant Professor, CSE SR Engineering College, Warangal, Andhra Pradesh, India Email: anandsofttech@gmail.com Abstract Web mining involves activities such as document clustering, community mining etc. to be performed on web. Such tasks need measuring semantic similarity between words. This helps in performing web mining activities easily in many applications. However, the accuracy of measuring semantic similarity between any two words is difficult task. In this paper a new approach is proposed to measure similarity between words. This approach is based on text snippets and page counts. These two measures are taken from the results of a search engine like Google. To achieve the aim of this paper, lexical patterns are extracted from text snippets and word co-occurrence measures are defined using page counts. The results of these two are combined. Moreover, we proposed algorithms such as pattern clustering and pattern extraction in order to find various relationships between any given two words. Support Vector Machines, a data mining technique, is used to optimize the results. The empirical results reveal that the proposed techniques are finding best results that can be compared with human ratings and accuracy in web mining activities. Key Words - Text snippets, word count, semantic similarity, web mining, lexical patterns 1. INTRODUCTION Web mining has gained popularity as huge amount of information is being made available over web and the automated processing of such data or information is the need of the hour. The applications of web mining include entity disambiguation, relation detection and community extraction. Information retrieval and natural language processing are two important aspects involved in all web mining applications. Lexical dictionary such as Word Net is widely used to achieve natural language processing. However, it is a general purpose lexical ontology. As part of web mining documents are to be compared and analyzed programmatically. This is a tedious task as the meaning of words change across domains over time. The problem with lexical dictionaries is that they are not having diverse information about words in various contexts. For instance the word apple is somehow related to computer science as there is a company by name Apple which has been instrumental in brining many computer hardware and software technologies. However, this word is ignored in some of the lexical dictionaries as they consider it as a fruit. As new words are created and many meanings are associated with the words, the lexical dictionaries have proved to be inadequate to handle things when the words having new meanings and relationships with other words which are not yet updated in lexical dictionaries. To overcome the drawbacks mentioned above, we propose a method that automatically finds semantic similarity between words or entities based on the page counts and text snippets retrieved from web search engines like Google. Page count is an estimate of number of pages that contain query words. Snippet is some text extracted by web search engine based on 553
the query term given. The following is the text snippet obtained from Google search engine with query string apple. Apple Inc. (NASDAQ: AAPL; formerly Apple Computer, Inc.) is an American multinational corporation that designs and sells consumer electronics, computer... Fig. 1: Shows text snippet given by Google search engine for search word apple Similarity measures have been associated with text snippets for query expansion [4], personal name disambiguation [9], and community mining [17]. The text snippets and page counts are automatically obtained from search engines and used in web mining. However, they have the drawbacks as follows page count analysis ignores the position of a word in a page page count of a polysemous word (a word with multiple senses) might contain a combination of all its senses. Because the large number of documents in the result set, only those snippets for the top ranking results for a query can be processed We propose a method that overcomes the problems mentioned above. We use both snippets and page counts and propose algorithms such as lexical pattern extraction and pattern clustering to accurately measure semantic similarity between words. The main contributions of this paper include lexical patterns extraction to identify relation between words, SVM usage to integrate machine learning approach to optimize results. 2. RELATED WORK In [15] taxonomy of words is used to calculate similarity between to words by finding the length of the shorted path connecting two words. Information content concept is used by Resnik [9] where similarity between two concepts was introduced. The maximum of similarity between any concepts that the words belong to is used for finding similarity between words. Information content and also structural semantic information are combined by Li et al. [3] in order to have a similarity measure. Very high accuracy was shown by this technique when used with Charles [11] benchmark data set. Lin [8] defined similarity as the information which is in common to both concepts while Cilibrasi and Vitanyi [12] proposed a metric known as distance metric. This metric is defined using page counts retrieved from web search engines. Snippets were used by [4] in order to measure semantic similarity between any given two words. They represented each snippet as TF-IDF weighted term vector. A double checking model is developed by Chen et al. [4] which is based on snippets returned by web search engine. In various web mining applications such as word sense disambiguation [6], language modeling [13], synonym extraction [5], thesauri extraction [4] the concept of measuring semantic similarity is used. 3. PROPOSED METHOD The proposed method that finds similarity between two words A, B is supposed to return a value between 0.0 and 1.0. The value 0.0 indicates that there is no similarity between words while 1.0 indicates there are absolute similarity between given words. The proposed method makes use of page counts and text snippets retrieved by search engine like Google. For instance the words gem and jewel are given to Google and the resultant page counts and text snippets are used by our method to find similarity between words. The proposed method is visualized as shown in fig. 2. As illustrated in Fig. 2, two words such as gem, jewel is given as input to search engine. The search engine is returning page counts and also text snippets. These are extracted and given input to our proposed techniques. Page counts are given to word cooccurrence measures such as Web Jaccard, WebOverlap, WebDice, and WebPMI. The result of these techniques is given to SVM. On the other hand, the text snippets are given to the proposed algorithms that can generate pattern clusters which in turn are given to SVM. Now SVM has got two inputs. They are work co-occurrence measures and also pattern clusters. The SVM is trained with these and finally accurate semantic similarity is calculated for the given two words such as gem and jewel. 554
. sorted clusters in ascending order do mean that the most useful clusters are at the top. Steps for pattern clustering Fig. 2: outline of proposed method 3.1 Page Count Based Co-Occurrence Measures For given two words A and B page counts are given by search engine when these words are given as input. The four famous word co-occurrence measures such as Jaccard, Overlap (Simpson), Dice, and Point wise mutual information (PMI) are used in the proposed design in order to find similarity between words. 3.2 Lexical Pattern Extraction To overcome the drawbacks of using text snippets directly, we propose an algorithm known as lexical pattern extraction algorithm based on text snippets. The algorithm is meant for finding semantic relations that exist between given words. This technique has been used by various natural language processing tasks like extracting hypernyms [1], [7], question answering [10], meronyms [14] and paraphrase extract. Lexical patterns are the patterns that satisfy the following criteria. 1. A subsequence must exactly contain one occurrence of each A and B 2. The max length of subsequence is L words 3. In a subsequence one or more words can be skipped. However, consequently it should be less than g. 4. Only negation contractions in a context are expanded. 3.3 Lexical Pattern Clustering The extracted lexical patterns are clustered based on the similarity with respect to given cluster. Each cluster contains patterns that express similar semantic relations. Algorithm 1 returns such clusters. The 4. TRAINING WITH SVM A two- class SVM is trained with both synonymous and nonsynonymous word pairs generated from WordNet. For 3000 words the word pairs are extracted. The total number of words in the training data is 6000. Then lexical patterns are extracted subject to specified threshold. Lexical patterns thus extracted are clustered and given to SVM. The SVM acts up on both results of word co-occurrence measures and also pattern clusters in order to calculate semantic similarity between two given words. 5. EXPERIMENTAL RESULTS The experimental results include semantic similarities between given two words by using SVM and page counts and text snippets retrieved from search engines for given words. 555
ISSN:2249-5789 Fig3:home page We have to enter a key word in given text box to search in the search engine.for example, google and opera are the words to search as in Fig 4 and Fig 6. Fig 6 shows entering of a word opera to search Fig 7 shows page counts and snippets retrived for given word opera Fig 4 shows entering of a word google to search When we click on search button,it displays the page counts and snippets as result. For example, the page counts and snippets for google and opera are shown in Fig 5 and Fig 7. We have to enter two words to measure semantic similarity between them.the measurement ranges from 0 to 1. For given words google and opera the semantic similarity is 0.8. Fig 8 shows semantic similarity between google and opera as 0.8 Fig 5 shows page counts and snippets retrived for given word google For various words, we can measure semantic similarity between them. The result is close to 1 when they are semantically closed and it is close to 0 when they are not closed semantically.the output will be shown in form of graphs and tables as follows. 556
Table 1 shows semantic similarities for various word pairs accuracy. To achieve these techniques like pattern extraction and pattern clustering are introduced. These algorithms help in finding various relationships between words. SVM was trained with relationships identified between the given words. The experiments are made with synonymous and non synonymous word pairs that are collected from Word net synsets. The experimental results have shown that the proposed method is far better than the existing approaches that are employed to measure semantic similarity between words. 7. REFERENCES Graph 1 shows semantic similarities for various word pairs 6. CONCLUSION We used the results of web search engine for two words and proposed a semantic similarity measure which is based on the page counts and text snippets that are the results of a web search engine like Google. The aim of this paper is to measure semantic similarity between any two given words with utmost [1] C. Buckley, G. Salton, J. Allan, and A. Singhal.Automatic query expansion using smart: Trec 3. In Proc. of 3rd Text REtreival Conference, pages 69{80, 1994. [2] D. Bollegala, Y. Matsuo, and M. Ishizuka.Disambiguating personal names on the web using automatically extracted key phrases. In Proc. of the 17th European Conference on Artificial Intelligence,pages 553{557, 2006. [3] D. R. Cutting, J. O. Pedersen, D. Karger, and J. W.Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings SIGIR '92, pages 318{329, 1992. [4] D. Lin. An information-theoretic de nition of similarity. In Proc. of the 15th ICML, pages 296{304,1998. [5] D. Lin. Automatic retreival and clustering of similar words. In Proc. of the 17th COLING, pages 768{774,1998. WWW 2007 / Track: Semantic Web Session: Similarity and Extraction 765 Table 7: Entity Disambiguation Results Jaguar Java Method Precision Recall F Precision Recall F WebJaccard 0:5613 0:541 0:5288 0:5738 0:5564 0:5243 WebOverlap 0:6463 0:6314 0:6201 0:6228 0:5895 0:56 WebDice 0:5613 0:541 0:5288 0:5738 0:5564 0:5243 WebPMI 0:5607 0:478 0:5026 0:7747 0:595 0:6468 Sahami [36] 0:6061 0:6337 0:6019 0:751 0:4793 0:5761 CODC [6] 0:5312 0:6159 0:5452 0:7744 0:5895 0:6358 Proposed 0:6892 0:7144 0:672 0:8198 0:6446 0:691 [6] F. Keller and M. Lapata. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics, 29(3):459{484, 2003. [7] G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes,6(1):1{28, 1998. [8] H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the International Conference on Digital Libraries, 2005. [9] J. Curran. Ensemble menthods for automatic thesaurus extraction. In Proc. of EMNLP, 2002. [10] J. Mori, Y. Matsuo, and M. Ishizuka. Extracting keyphrases to represent relations in social networks from web. In Proc. of 20th IJCAI, 2007. [11] M. Fleischman and E. Hovy. Multi-document person name resolution. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics (ACL), Reference Resolution Workshop,2004. 557
[12] M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proc. of 14th COLING, pages 539{545, 1992. [13] M. Lapata and F. Keller. Web-based models ofr natural language processing. ACM Transactions on Speech and Language Processing, 2(1):1{31, 2005. [14] M. Mitra, A. Singhal, and C. Buckley. Improving automatic query expansion. In Proc. of 21st Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 206{214, 1998. [15]P. Cimano, S. Handschuh, and S. Staab. Towards the self-annotating web. In Proc. of 13th WWW, 2004. [16] R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proceedings of the World Wide Web Conference (WWW), pages 463{470, 2005. [17] Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. In Proceedings of 15 th International World Wide Web Conference, 2006. 8. ABOUT THE AUTHORS S.P.Anandaraj received B.E (CSE) degree from Madras University, Chennai in the year 2004, M.Tech (CSE) with Gold Medal from Dr.MGR Educational and Research Institute, University in the year 2007 (Distinction with Honors). Now Pursuing Ph.D in St. Peter s University, Chennai. He has 8 Years of Teaching Experience. His areas of interest are Information security and Sensor Networks. He has published papers in International Journal, International Conference and National Conference and attended nearly15 National Workshops/FDP/Seminars etc. He is a member of ISTE, CSI, IEEE, Member of IACSIT and Member of IAENG. Manasa.Ch received the M.C.A Degree from Kamala Institute of Technology and Science, Huzurabad, Karimnagar, A.P, India. Currently doing M.tech in Computer Science and Engineering at SR Engineering College, Warangal, India. Her research interests include Knowledge and Data Engineering. She has Participated in ISTE approved National conference on Mobile Communications and Data Engineering at VITS, Karimnagar,A.P. and participated in Women Student Congress at NIT, Warangal, organized by IEEE WIE student branch, V.Ramana received B.Tech (CSE) degree from JNTU, Hyderabad in the year 2006.M.Tech (AI) from university of Hyderabad in the year 2010, He has2 Years of Teaching Experience. His area of interest is Artificial Intelligence and Machine Learning. He has published papers in International Journal, International Conference and National Conference and attended National Workshops/FDP/Seminars etc., He is a member of CSI. 558