Impact of Term Weighting Schemes on Document Clustering A Review

Similar documents
Information Retrieval. (M&S Ch 15)

Instructor: Stefan Savev

Feature selection. LING 572 Fei Xia

String Vector based KNN for Text Categorization

Similarity search in multimedia databases

Encoding Words into String Vectors for Word Categorization

Concept-Based Document Similarity Based on Suffix Tree Document

Chapter 6: Information Retrieval and Web Search. An introduction

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

International Journal of Advanced Research in Computer Science and Software Engineering

Text Documents clustering using K Means Algorithm

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Keyword Extraction by KNN considering Similarity among Features

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Knowledge Engineering in Search Engines

Web Information Retrieval using WordNet

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

An Improvement of Centroid-Based Classification Algorithm for Text Classification

CS 6320 Natural Language Processing

Information Retrieval: Retrieval Models

Introduction to Information Retrieval

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Information Retrieval. hussein suleman uct cs

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Knowledge Discovery and Data Mining 1 (VO) ( )

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

ANDROID SHORT MESSAGES FILTERING FOR BAHASA USING MULTINOMIAL NAIVE BAYES

Semi supervised clustering for Text Clustering

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Introduction to Information Retrieval

DOCUMENT INDEXING USING INDEPENDENT TOPIC EXTRACTION. Yu-Hwan Kim and Byoung-Tak Zhang

An Automatic Reply to Customers Queries Model with Chinese Text Mining Approach

Comparative Analysis of IDF Methods to Determine Word Relevance in Web Document

Social Media Computing

Inferring Variable Labels Considering Co-occurrence of Variable Labels in Data Jackets

Singular Value Decomposition, and Application to Recommender Systems

BayesTH-MCRDR Algorithm for Automatic Classification of Web Document

Balancing Manual and Automatic Indexing for Retrieval of Paper Abstracts

Chapter 27 Introduction to Information Retrieval and Web Search

Mining Web Data. Lijun Zhang

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Tag-based Social Interest Discovery

Chapter 4: Text Clustering

Using Query History to Prune Query Results

A Content Vector Model for Text Classification

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Term Frequency With Average Term Occurrences For Textual Information Retrieval

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

Semi-Supervised Clustering with Partial Background Information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Information Retrieval. Information Retrieval and Web Search

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Information Integration of Partially Labeled Data

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Text Analytics (Text Mining)

Analysis and Latent Semantic Indexing

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

Searching SNT in XML Documents Using Reduction Factor

Keywords: clustering algorithms, unsupervised learning, cluster validity

Web page recommendation using a stochastic process model

An Overview of Concept Based and Advanced Text Clustering Methods.

modern database systems lecture 4 : information retrieval

USING DIMENSIONALITY REDUCTION METHODS IN TEXT CLUSTERING

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Document Clustering: Comparison of Similarity Measures

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Static Pruning of Terms In Inverted Files

Tensor Sparse PCA and Face Recognition: A Novel Approach

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

Information Retrieval

CPSC 67 Lab #5: Clustering Due Thursday, March 19 (8:00 a.m.)

Self-organization of very large document collections

Mining Web Data. Lijun Zhang

Vector Space Models: Theory and Applications

Text Analytics (Text Mining)

Using Gini-index for Feature Weighting in Text Categorization

AN UNSUPERVISED APPROACH TO DEVELOP IR SYSTEM: THE CASE OF URDU

Multimodal topic model for texts and images utilizing their embeddings

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

Performance Analysis of Data Mining Classification Techniques

Extracting Summary from Documents Using K-Mean Clustering Algorithm

Classification of Text Documents Using B-Tree

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

A hybrid method to categorize HTML documents

Information Retrieval and Web Search

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Word Indexing Versus Conceptual Indexing in Medical Image Retrieval

Transcription:

Volume 118 No. 23 2018, 467-475 ISSN: 1314-3395 (on-line version) url: http://acadpubl.eu/hub ijpam.eu Impact of Term Weighting Schemes on Document Clustering A Review G. Hannah Grace and Kalyani Desikan Department of Mathematics, VIT University, Chennai hannahgrace.g@vit.ac.in, kalyanidesikan@vit.ac.in Abstract Term weighting schemes are used to identify the importance of terms in a document collection and assign weights to them accordingly. Document clustering uses these term weights to identify if documents are similar. In this article, we apply different term weighting schemes to a document corpus and study their impact on document clustering. At first, the given document corpus (DC), is pre-processed using tokenization, stopwords removal and stemming process and then converted to its term document matrix. We have worked with six term weighting schemes viz.: TF, TFIDF, MI, ATC, Okapi, TFICF and obtained clustering solutions using k-means clustering algorithm. In this review paper, we have compared the clustering solutions obtained based on the well-known cluster quality measures: entropy and purity. Key Words and Phrases: Document Clustering; Term Weighting Schemes;TF;TFIDF 1 Introduction Document clustering is an automatic grouping of text documents into clusters such that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters. Document clustering is a method to find useful features from the document corpus. It is used to understand the 467

structure and content of unknown text sets as well as to give new perspectives on familiar ones [1]. Our paper is organized as follows. Section 2 describes the representation of the term document matrix after preprocessing. Section 3 discusses the various term weighting schemes like TF, TFIDF, MI, ATC, Okapi, TFICF and their formulae. Section 4 outlines the experimental results and the analysis based on cluster quality measures. Secion 5 concludes the result. In this review paper we have analyzed the various weighting schemes, found the clustering solution and validated the results with cluster quality measures such as entropy and purity. 2 Document preprocessing and Representation of Term Document Matrix Given a document corpus DC it is pre-processed using techniques like removing blank spaces, numbers, web links, punctuation marks, stop words and stemming [2]. The vector space model [3] is one of the most well known models that is used to represent documents. The main task is to find an appropriate encoding of the feature vector. This provides an efficient quantitative representation of each document [4]. After reducing the sparseness, we formulate the term document matrix. Let T= t 1, t 2,...t p where t i are terms obtained after preprocessing. We then formulate the term document matrix X as in Figure 1. i.e., X = [x ij ] Nxp is the term document matrix and each document d i is a tuple in p-dimensional space R p. Each x ij normally, represents the frequency of term t j in document d i. Apart from term frequency other weighting schemes can also be used. The objective of this paper is to analyse the impact of six term weighting schemes on document clustering. 3 Term Weighting Schemes Term weighting schemes are important for applications based of texts and hence have been well studied. Term weighting [5] is a numerical representation of a collection of documents described by 468

Figure 1: Term Document Matrix p terms and N documents from a document corpus DC. This X = [x ij ] Nxp is a sparse matrix called the term-document matrix whose rows correspond to documents and columns correspond to the stemmed terms appearing in the documents. There are three different factors that determine term weighting schemes [6] : local, global and normalization. 3.1 TF weighting Scheme Term frequency (TF) is the simplest measure to weight each term in a document. Term frequency (TF) measures how frequently a term occurs in a document. In this weighting scheme, a term document matrix comprises of rows corresponding to documents in the collection and columns corresponding to terms. The elements of this matrix show which document contains which term and how many times they appear in each of the documents. 3.2 TFIDF weighting scheme TFIDF is term frequency - inverse document frequency weighting scheme which is commonly used. This reflects how important a word is to a document in a collection or corpus. The TFIDF value increases proportional to the number of times a word appears in the document using the frequency of the word in the corpus. The 469

basic form of TFIDF is given by w ij = x ij log N n j where w ij is the weight of term j in document i; x ij is the number of occurrences of term j in document i (TF); N is the total number of documents in the document collection; and n j is the number of documents in which term j occurs at least once. n j is often referred N N to as the document frequency (DF) of term j and naturally, n j is called the inverse document frequency (IDF) of term j. IDF is the most frequently used collection frequency factor. The reason for incorporating IDF is because of its simplicity and robustness. But it has its own drawbacks. IDF weight is based on the intuition that words that do not occur frequently in a collection tend to be more informative than the words that appear frequently across many documents. 3.3 MI weighting scheme The mutual information method (MI) is a supervised method that works by computing the mutual information between a given term and a document class. This provides a measure of how much information the term can tell about a given class of documents. Low mutual information suggests that the term has a low discrimination power and hence it has to be removed [8]. The weighting scheme is given by, w ij = N x ij N p x ij x ij i=1 j=1 N where N is the number of documents in the document collection, p is the number of terms and x ij is the term document frequency. 3.4 Okapi weighting scheme Okapi[7] is one of the best-known term weighting schemes. It originated from the probabilistic relevance model, rather than the vector space model. But Okapi has a lot in common with the TFIDF N 470

weighting scheme which is based on the vector space model. Both use term frequency, inverse document frequency and normalisation. Okapi term weighting scheme is given as follows: ( ) w ij = x ij ( ) N dl 0.5 + 1.5 avg dl + x log nj + 0.5 x ij + 0.5 ij where x ij is the term document frequency, N is the total number of documents in the collection,n j is the document frequency, ie., the number of documents in which term j occurs, dl is the document length ie., the number of terms in a particular document and avg dl is the average length of documents for a given document collection. 3.5 ATC weighting scheme ATC [6,7] stands for Augmented TFIDF weighting scheme. It is given by: ( ) ( ) x ij N 0.5 + 0.5 log max x ij n j w ij = [( ) ( )] N 2 x ij N 0.5 + 0.5 log max x ij n j i=1 where x ij is the term frequency, max x ij is the maximum term frequency in each document, N is the total number of documents in the collection, n j is the number of documents in which term j occurs. 3.6 TFICF weighting scheme The TFICF [7] refers to term frequency-inverse corpus frequency. This scheme produces document clusters that are of comparable quality as those generated by other weighting schemes. TFICF weighting scheme is given by, ( ) N + 1 w ij = log (1 + x ij ) log n j + 1 471

where N is the total number of documents in the corpus, n j is the number of documents in the corpus where term j occurs after removing the stopwords and applying Potter s algorithm, x ij is the term frequency. Among the weighting schemes, TFIDF is the most popular generalpurpose term weighting scheme followed by Okapi. 4 Experimental Results Our interest in this paper lies in studying the impact of the 6 term weighting schemes. We use the various weighting schemes mentioned in section 3 to cluster the documents by applying the k- means algorithm. For our experimental analysis, we consider a corpus containing 6 documents and also a bench mark data set classic. The classic data set contains 7095 documents classified into 4 classes. 4.1 Comparison of Cluster Quality Measures for 6 documents We compared the clustering solutions obtained for the different weighting schemes using cluster quality measures like entropy and purity. These are standard measures that help us to ascertain the cluster quality. The entropy of a good cluster should be low and that of purity should be high. Given below are the entropy and purity values of the clustering solutions obtained for the 6 term weighting schemes. TF TFIDF MI ATC Okapi TFICF Entropy 0.6667 1.0986 0.8675 1.0114 1.0986 1.0986 Purity 0.6667 0.6667 0.6667 0.6667 0.6667 0.6667 Table 1: Cluster Quality for 6 documents From Table 1, we see that the purity values are the same for all term weighting schemes. Comparing the clustering solutions based on entropy, we see that the clustering solution obtained using the TF weighting scheme is better. It has the lowest entropy value. 472

4.2 Comparison of Cluster Quality for classic data set In this section we present the comparison of the cluster quality of the clustering solutions for the classic data set for the six different weighting schemes. The k-means algorithm is applied to cluster the 7095 documents in this data set into 4 clusters. TF TFIDF MI ATC Okapi TFICF Entropy 0.7388 0.65965 0.9086 0.8051 0.47635 0.4432 Purity 0.4954 0.61240 0.4516 0.6214 0.7113 0.7620 Table 2: Cluster Quality for classic data set From Table 2, we see that TFICF has the highest purity value and lowest entropy value which makes it clear that for classic data set TFICF weighting scheme gives the best clustering solution. 5 Conclusion The impact of term weighting schemes on Informational Retrieval models has been well studied. But not much work has been done on the impact of term weighting schemes on text clustering. A good term weighting scheme will be able to give an informative term document matrix representation for the documents. Conversely, a bad term weighting scheme will only be able to provide small amount of content information about the documents. From our analysis we notice that we cannot identify the supremacy of a particular term weighting scheme for the small document corpus that we have considered. But using a large data set we are able to identify which one is better. Also, the existing term weighting schemes do not distinguish between informative words from the uninformative ones, which is crucial to the performance of document clustering. The role of term weighting schemes in text clustering needs to be further explored for different sizes and types of document collections. 473

References [1] G. Hannah Grace and Kalyani Desikan, Experimental Estimation of Number of Clusters Based on Cluster Quality. arxiv preprint arxiv:1503.03168 (2015). [2] Kadhim, Ammar Ismael, Yu-N. Cheah, and Nurul Hashimah Ahamed. Text Document Preprocessing and Dimension Reduction Techniques for Text Document Clustering. Artificial Intelligence with Applications in Engineering and Technology (ICAIET), 2014 4th International Conference on. IEEE, 2014. [3] Salton, Gerard, Anita Wong, and Chung-Shu Yang. A vector space model for automatic indexing. Communications of the ACM 18.11 (1975): 613-620. [4] Murty, M. Ramakrishna, J. V. R. Murthy, and Prasad Reddy PVGD. Text Document Classification based-on Least Square Support Vector Machines with Singular Value Decomposition. International Journal of Computer Applications 27.7 (2011). [5] Ab Samat, Nordianah, et al. Term weighting schemes experiment based on SVD for Malay text retrieval. IJCSNS International Journal of Computer Science and Network Security, 8 (2008), 357-361. [6] Chisholm, Erica, and Tamara G. Kolda. New term weighting formulas for the vector space method in information retrieval. Computer Science and Mathematics Division, Oak Ridge National Laboratory (1999). [7] Jin, Rong, Christos Falusos, and Alex G. Hauptmann. Metascoring: automatically evaluating term weighting schemes in IR without precision-recall. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, (2001). [8] Vijayarani, S., Ms J. Ilamathi, and Ms Nithya. Preprocessing Techniques for Text Mining-An Overview. International Journal of Computer Science and Communication Networks, ISSN: 2249-5789, 5 7-16. 474

475

476