Text Documents clustering using K Means Algorithm

Similar documents
Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Clustering: Overview and K-means algorithm

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering: Overview and K-means algorithm

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Introduction to Information Retrieval

Information Retrieval and Web Search Engines

PV211: Introduction to Information Retrieval

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Introduction to Information Retrieval

Document Clustering: Comparison of Similarity Measures

Information Retrieval and Web Search Engines

Information Retrieval and Organisation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Introduction to Information Retrieval

Based on Raymond J. Mooney s slides

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Expectation Maximization!

Clustering (COSC 416) Nazli Goharian. Document Clustering.


[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Exploratory Analysis: Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering: K-means and Kernel K-means

CSE 5243 INTRO. TO DATA MINING

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

CSE 5243 INTRO. TO DATA MINING

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering and Visualisation of Data

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

CSE 494 Project C. Garrett Wolf

CSE494 Information Retrieval Project C Report

Association Rule Mining and Clustering

Unsupervised Learning

K-Means. Oct Youn-Hee Han

University of Florida CISE department Gator Engineering. Clustering Part 2

Gene Clustering & Classification

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Unsupervised Learning and Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

International Journal of Advanced Research in Computer Science and Software Engineering

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

VK Multimedia Information Systems

Hierarchical Clustering 4/5/17

CS47300: Web Information Search and Management

Lecture 8 May 7, Prabhakar Raghavan

What to come. There will be a few more topics we will cover on supervised learning

Clustering Algorithms for general similarity measures

Semi-Supervised Clustering with Partial Background Information

Lesson 3. Prof. Enza Messina

Unsupervised Learning : Clustering

New Approach for K-mean and K-medoids Algorithm

An Improvement of Centroid-Based Classification Algorithm for Text Classification

Clustering CS 550: Machine Learning

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Chapter 9. Classification and Clustering

Clustering in Data Mining

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

A SURVEY ON CLUSTERING ALGORITHMS Ms. Kirti M. Patil 1 and Dr. Jagdish W. Bakal 2

CLUSTERING. Quiz information. today 11/14/13% ! The second midterm quiz is on Thursday (11/21) ! In-class (75 minutes!)

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Behavioral Data Mining. Lecture 18 Clustering

Improved Performance of Unsupervised Method by Renovated K-Means

Particle Swarm Optimization applied to Pattern Recognition

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Clustering Part 3. Hierarchical Clustering

Unsupervised Learning Partitioning Methods

Data Informatics. Seon Ho Kim, Ph.D.

Unsupervised Learning and Clustering

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Cluster Analysis. Ying Shen, SSE, Tongji University

Introduction to Mobile Robotics

A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering

ECLT 5810 Clustering

Unsupervised Learning I: K-Means Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Natural Language Processing

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Clust Clus e t ring 2 Nov

2. Background. 2.1 Clustering

Dynamic Clustering of Data with Modified K-Means Algorithm

Mixture Models and EM

Concept-Based Document Similarity Based on Suffix Tree Document

Introduction to Data Mining

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

An Enhanced K-Medoid Clustering Algorithm

Transcription:

Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of K-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result. Index terms: K Means, document Vector, Residual sum of square, Tf-IDF. 1. Introduction Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are coherent internally, but clearly dissimilar to the objects belonging to other clusters. 2. Classification Clustering algorithms may be classified as listed below: Flat clustering: Creates a set of clusters without any explicit structure that would relate clusters to each other; It s also called exclusive clustering Hierarchical clustering: Creates a hierarchy of clusters. Hard clustering: Assigns each document/object as a member of exactly one cluster. Soft clustering: Distribute the document/object over all users Algorithms Agglomerative (Hierarchical clustering) K-Means (Flat clustering, Hard clustering) EM Algorithm (Flat clustering, Soft clustering) Hierarchical Agglomerative Clustering (HAC) and K- Means algorithm have been applied to text clustering in a straightforward way. Typically it usages normalized, TF-IDF-weighted vectors and cosine similarity. Here, the k-means algorithm using a set of points in n-dimensional vector space for text clustering 3. K-Means Algorithm The k-means clustering algorithm is known to be efficient in clustering large data sets. This clustering algorithm was developed by MacQueen, and is one of the simplest and the best known unsupervised learning algorithms that solve the well-known clustering problem. The K-Means algorithm aims to partition a set of objects, based on their attributes/features, into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity function; similarity can be measured by using different methods such as cosine similarity, Euclidean distance, Extended Jaccard) to all objects in that cluster. ISSN: 2320 8007 282

3.1 Basic K-Means Algorithm Choose k number of clusters to be Determined. Choose k objects randomly as the initial cluster center Repeat 3.1. Assign each object to their closest cluster 3.2. Compute new clusters, i.e. Calculate mean points. 4.Until 4.1. No changes on cluster centers (i.e. Centroids do not change location any more) OR 4.2. No object changes its cluster enffor Output HS 3.2 Residual Sum of Squares RSS is the objective function in K-means and our goal is to minimize it. Because N is fixed, minimizing RSS is equivalent to minimizing the average squared distance; a measure of how well centroids represent their documents.rss is a measure of how well the centroids represent the members of their clusters, the squared distance of each vector from its centroid summed over all vectors. The document is represented in the form of vector such that the words (also called features) represent dimensions of the vector and frequency of the word in document is the magnitude of the vector i.e. Vector is of the form <(t1,f1),(t2,f2),(t3,f3),.(tn,fn)> Where t1, t2,.., tn are the terms/words (dimension of the vector) and f1, f2,, fn are the corresponding frequencies or magnitude of the vector components. The algorithm of creating a document vector is given below: Input Token Stream (TS), all the tokens in the document Collection Output HS, a Hash Table of tokens with respective frequencies Initialize: Hash Table (HS):= empty Hash Table for each Token in Token Stream (TS) do If Hash Table (HS) contains Token then Frequency:= value of Token in hs increment Frequency by 1 else Frequency: =1 enidif Store Frequency as value of Token in Hash Table (HS) The algorithm then moves the cluster centers around in space in order to minimize RSS. The first step of K-means is to select as initial cluster centers K randomly selected documents, the seeds. The algorithm then moves the cluster centers seed around in space to minimize RSS. This is done iteratively by repeating two steps until a stopping criterion is met: Reassigning documents to the cluster with the closest centroid and recomputing each centroid based on the current members of its cluster. ISSN: 2320 8007 283

centroid is replaced with the new centroid. RSS, the sum of the RSSk, must then also decrease during recomputation. Because there is only a finite set of possible clustering s, a monotonically decreasing algorithm will eventually arrive at a (local) minimum. Take care, however, to break ties consistently, for example, by assigning a document to the cluster with the lowest index if there are several equidistant centroids. Otherwise, the algorithm can cycle forever in a loop of clustering s that have the same cost. Although this proves the convergence of K-means, there is unfortunately no guarantee that a global minimum in the objective function will be reached. 3.3 Termination Condition Assignment of documents to clusters (the partitioning function ;) does not change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but run-time may be unacceptably long. Terminate when the decrease in RSS falls below a threshold O. For small O, this indicates that we are close to convergence. Again, we need to combine it with a bound on the number of iterations to prevent very long run-times. We now show that K-means converges by proving that RSS monotonically decreases in each iteration. We will use decrease in the meaning decrease or does not change in this section. First, RSS decreases in the reassignment step; each vector is assigned to the closest centroid, so the distance it contributes to RSS decreases. Second, it decreases in the re-computation step because the new centroid is the vector v for which RSSk reaches its minimum. where xm and vm are the mth components of their respective vectors. Setting the partial derivative to zero, we get: We can apply one of the following termination conditions. A fixed number of iterations I has been completed. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations. Assignment of documents to clusters does not change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long. Centroids do not change between iterations. This is equivalent to not terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. In practice, we need to combine it with a bound on the number of iterations to guarantee termination. Terminate when the decrease in RSS falls below a threshold t. For small t, this indicates that we are close to convergence. Again, we need to combine it with abound on the number of iterations to prevent very long runtimes. 3.4 Bad choice of initial seed In K-Means algorithm there is unfortunately no guarantee that a global minimum in the objective function will be reached, this is a particular problem if a document set contains many outliers, documents that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS. Effective heuristics for seed selection include: which is the component wise definition of the centroid. Thus, we minimize RSSk when the old Excluding outliers from the seed set Trying out multiple starting points and choosing ISSN: 2320 8007 284

the clustering with the lowest cost; and Obtaining seeds from another method such as hierarchical clustering. 3.5 TF-IDF TF-IDF stands for term frequency-inverse document frequency, is a numerical statistics which reflects how important a word is to a document in a collection or corpus, it is the most common weighting method used to describe documents in the Vector Space Model, particularly on IR problems. The number of times a term occurs document is called its term frequency. We can calculate the term frequency for a word The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. as the ratio of number of times the word occurs in the document to the total number of words in the document. Multiplying together these two metrics gives the tf-idf, placing importance on terms frequent in the document and rare in the corpus. The tf*idf of term t in document d is calculated as: e.g. For three vectors (after removing stop-words and performing stemming), Doc1 < (computer, 60), (JAVA, 30) > Doc2 < (computer, 55), (PASCAL, 20) > Doc3 < (graphic, 24), (Database, 99) > set of documents. The method used in our implementation is the tf-idf formulation. In tf-idf formulation the frequency of term i, tf (i) is multiplied by a factor calculated using inversedocument - frequency idf (i) given in (2). In the example above, total number of documents is N=3, the term frequency of 'computer is tfcomputer and the number of documents in which the term 'com CLUSTERING RESULT An input sample of 24 documents [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34] were provided to the K- Means Algorithm. With the initial value of k=24, the algorithm was run for three different scenarios: (a) When the document vectors were formed on the basis of features (words) of the document. (b) When the document vectors were formed on the basis of sub-category of features. (c) When the document vectors were formed on the basis of parent category of the feature. The result of k-means clustering algorithm for each case is given below: Cluster name Documents Text Mining [12, 13, 14, 16, 17, 18, 28] Databases [11, 25, 26, 27] Operating Systems [23, 32] Mobile Computing [22, 24] Microprocessors [33, 34] Programming [30, 31] Data Structures [29] Business Computing [20, 21] World Wide Web [15] Data Transfer [19] computer' occurs is dfcomputer Total Documents, N=3 The vectors shown above indicate that the term 'computer' is less important compared to other terms (such as JAVA which appears in only one document out of three) for identifying groups or clusters because this term appears in more number of documents (two out of three in this case) making it less distinguishable feature for clustering. Whatever the actual frequency of the term may be, some weight must be assigned to each term depending on the importance in the given ISSN: 2320 8007 285

4. CONCLUSION In this paper I have discussed the concept of document clustering. I have also presented the implementation technique of K- means clustering algorithm. I have compared three different ways of representing a document and suggested how an organized domain dictionary can be used to achieve better similarity results of the documents. The implementation discussed in this paper considered residual sum of square and TF-IDF methods.this could further improve similarity measure of documents which would ultimately provide better clusters for a given set of documents. 5. References [1]. A.K. Jain, M.N. Murty and P.J. Flynn, "Data Clustering: A Review", ACM Computing Surveys, Vol. 31, No. 3, 1999. [3]. A.K. Jain, "Data Clustering: 50 years beyond K- means", In 19 th International Conference on Pattern Recognition, Tampa, FL, 2008. [4]. Harman, D. and Voorhees, E., editors. Proceedings of the Fifth Text Retrieval Conference (TREC-5). National Institute of Standards and Technology, Department of Commerce.1996. [5]. Manning, Christopher D., Prahakar Raghavan and Hinrich Schelutze. 2008. An Introduction to Information Retrieval. Cambridge, England: Cambridge University Press [6]. Clustering and Information Retrieval by W Wu, H Xiong, S Shekhar - 2004 [2]. Faber, V. Clustering and the Continuous k- Means Algorithm, Los Alamos Science, November 22, 1994. ISSN: 2320 8007 286