A Study on K-Means Clustering in Text Mining Using Python

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03 Issue 08, August, 2016 Available at http://www.ijcsonline.com/ Dr. (Ms). Ananthi Sheshasayee 1, Ms. G. Thailambal 2 1 Head and Associate Professor, Quaid -e- Milleth College for Women, Chennai, India 2 Research Scholar, SCSVMV University, Kancheepuram, India Abstract According to Statistics 195,248,950 Internet users are in India, which is the second largest internet user in the world. The total number of websites gets increased to 672,985,183 in the year of 2013. Text Mining is an emerging research area in nowadays as the information gets increased everyday on the web. The User did not know how the documents were linked to the query given and displayed. Sometimes the documents are relevant and many times the documents are irrelevant to the query typed by the user. These appropriate and inappropriate results are due to the clustering algorithm applied to it. Getting proper results page from these websites are possible only with the process of Clustering. Clustering is the fundamental process in many disciplines whereas Cluster Analysis is used for grouping of similar collection of patterns based on Similarity factors. This paper discusses the tasks of Text Mining algorithms and clustering techniques. Different types of clustering algorithm available where K-Means clustering algorithm presented in detail along with its Strengths and Limitations in this paper. It also includes various Computation measures of algorithm which is used to identify the similar objects to cluster. This paper gives the detailed information about the applications of Clustering and tools used for clustering in different applications. Related works of K-means clustering algorithm in Text Mining applications and other applications are presented with the conclusion that the K-Means algorithm can be combined with other algorithms to get efficient results. Keywords: Text Mining, Clustering Algorithm, K-Means Clustering, Python. I. INTRODUCTION Text Mining is retrieving information of different patterns from unstructured textual data in the web Repository. Text mining is a variation on a field called data mining that tries to find interesting patterns from large databases. Text mining, also known as Intelligent Text Analysis, Text Data Mining or Knowledge-Discovery in Text (KDT), refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. [8]. Typically, only a small fraction of the many available documents will be relevant to a given individual user. Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, text mining has become an increasingly popular and essential theme in data mining. [9] II. TASKS OF TEXT MINING ALGORITHMS [7] A. Text Categorization Assigning the documents to pre-defined categories. Many Statistical approaches have been applied such as Regression Models, Support Vector Machines. B. Text Clustering Finding Group of Similar objects of data based on the Similarity Function. Methods applied are categorized as Hierarchical and Partitioning. C. Concept Mining The task of discovering concepts which combine Categorization and clustering approach to find concepts and their relations from text collections. D. Information Retrieval Retrieving the information from a collection of information resources available depending on the user's query. E. Information Extraction Task of automatically extracting structured information from unstructured or Semi-Structured documents. III. CLUSTERING TECHNIQUES Clustering is grouping of similar data sets with the same content. It includes grouping of same text messages in e-mail, same content from different Books. Text Clustering algorithms are classified into many types, namely distance-based algorithms, frequent sequence algorithms, feature selection and extraction algorithms, density-based algorithms, distance-based algorithms, frequent sequence algorithms, feature selection and extraction algorithms, density-based algorithms. A clustering algorithm discovers groups in the set of documents such that documents within a group are more similar than documents across groups [2]. 560 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

Distance from x to y always same as y to x Distance from point x to point y cannot be greater than the sum of the distance from x to any other point z and distance from y to x. Clustering Tasks Scattered Document Clustered Document Fig.1 Documents Before and after Clustering The following conditions help to increase the effectiveness of the clustering. [1] A. Similarity Measure: Only Similar documents to be considered which is hard to define. B. Dimension Reduction: The size of the data needs to be reduced to increase the operations efficiency by removing the irrelevant words from the text collection. C. Cluster Labels: Giving separate names to different clusters in an appropriate way are needed to identify the clusters in a clear way. D. Number of Clusters: Number of clusters used to be deciding earlier, which is difficult when you have less information. E. Overlapping of Clusters: algorithm should accept overlapping of clusters since several topics are used by certain documents. F. Scalability: Irrespective of size the algorithm should be used. G. Flexibility: Algorithm should be scalable with different attributes, clusters etc. Clustering hypothesis formulated as Given a Suitable Clustering collection, if d documents interested then other members of d also interested by the user. The Parameters used by the clustering algorithms are [3] Number of clusters desired A Minimum and Maximum size of the cluster. The Control of overlap between Clusters. An arbitrarily chosen objective function optimized. A threshold value of the matching function below which an object will not be included in the cluster. H. Distance Computation Most clusters analysis methods based on similarity between objects by computing distance between each pair. The Properties of distance are Distance is always positive Distance from a point to itself is zero Fig 2. Key Tasks of Clustering A. Distance measures of Clusters Euclidean distance: The largest value attributes are Properly scaled. D(x,y) = (E(x i -y i )2)1/2.(1) Manhattan distance: The domination of largest valued is not much as Euclidean distance. D(x,y)=E i mod x i -y i Chebychev distance: (2) This is based on maximum attribute difference. D(x,y)= Max mod x i -y i Document Representation ------------------- Convert the documents into structured form. Clustering Logic ---------------------------------------------- Determining the documents is assigned to the clusters based on similarity measure. Categorical distance: (3) If many attributes have categorical values with only a small number of values. Let N be the total number of categorical attributes. D(x, y) = (Number of x i -y i )/N (4) Definition of Similarity Measure ------------------- Similarities between two documents. 561 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

I. Types of Clustering [5] Partitional clustering The given n data is partitioned into k partitions represent cluster, i.e. (k<=n). The partitioned data should follow the criteria: (i) (ii) At least One data object should be in each cluster A Data object should belong to only one cluster group. The widely used methods are Iterative clustering or Reallocation clustering in which data objects move from one cluster to another and in Single pass Clustering the data object processing is done only once. K-Means Clustering: The widely used Partitional clustering is K-Means in which it assigns each point to a cluster whose center called centroid is nearest. The center is the average of all the points and its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. [6] The Steps of K-Means: Step 1: Choose the k number of clusters. Step 2: Randomly generate k random points as a cluster center. Step 3: Determine the Euclidean distance of each Object to all Centroids. Step 4: Assign each point to the nearest Centroid. Step 5: Re-compute the new cluster Centers. Step 6: Repeat steps 2 & 3 until Convergence. This algorithm aims to minimize the following function for k clusters and no data points J= x i -c j 2 (5) Where j=1 to k and i=1 to n and x i -c j is a chosen Euclidean distance measure between data point xi from cluster cj. Still K-means have some limitations such as Handling Outliers is not possible, Intermediate Solutions are not made. But this algorithm is traditionally used in most of the applications since it is easy to implement and the time complexity is O (N) [10] where N is the number of objects to be grouped. Table 1 contains the advantages of K-Means Clustering. Hierarchical Clustering These methods start with one cluster and then split into smaller and smaller clusters and then merge similar clusters into larger and larger clusters in which objects resulting in a tree of clusters. Density Based clustering For each data point in a cluster at least a minimum number of points must exist within a given radius. Each cluster is a dense region of points surrounded by regions of low density. Grid based clustering Object space is divided into grid according to the characteristics of data. This method not affected by data ordering and they can deal with non numeric data easily Model based clustering This algorithm builds clusters with a high level of similarity within them and low level of similarity between them. This algorithm works Based on the Mean values and this minimizes the squared error function. Advantages Type of Attributes algorithm can handle Time Complexity Data ordering Dependency Prior Knowledge and User Defined parameters Interpretability of Results Ability to Memorize results Table 1: Advantages of K-Means K-Means Numeric Low Yes Yes Clusters Centroids J. Clustering Implementation in Python The following partial code implemented in Python language [22]. 562 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

Fig. 3 Sample Clustering Implementation using Python IV. RELATED WORK OF K-MEANS CLUSTERING IN OTHER APPLICATIONS Oyelade, O. J et.al., presents k-means clustering algorithm as a simple and an efficient tool to monitor the progression of students' performance in higher institution. They analyzed the students' results based on cluster analysis and uses standard statistical algorithms to arrange their scores data according to the level of their performance [11]. Bader Aljaber et.al use of citation contexts, when combined with the vocabulary in the full-text of the document in High Energy Physics and Genomics, is a promising alternative means of capturing critical topics covered by journal articles. The author uses link based clustering algorithm which determines the similarity between documents with a number of co-citations. They used bi-clustering algorithm and at the end they include K- means algorithm to reduce the size of the bi-clusters by merging its similar documents [12]. V. RELATED WORK OF TEXT MINING APPLICATIONS USING K-MEANS CLUSTERING ALGORITHM Anil Kumar Pandey et.al., uses k-means algorithm to cluster web documents to help researchers. The author extracts document features and applies the Apriori 563 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016

algorithm which generates mutually exclusive frequent sets taken as initial points of k-means clustering algorithm. This displays the highly related documents appearing together with same features [13]. Neetu Sharma et al uses K-means algorithm and Random Forest Classifier in WEKA tool and concluded that using clustering before classification on the data file poach.arff from WORDNET has optimized the performance [14]. VI. CONCLUSION The performance of Clustering algorithm depends on the structure, the amount and the representativeness of the data. Some of the applications where Clustering is widely used are discussed in this paper that shows the importance of clustering in Text Mining. Many other clustering algorithms available with some Pros and Cons which can be combined for getting better results. of Computer Technology & Applications, Vol 3 (4), 1598-1604, ISSN: 2229-6093. [15] L.V. Bijuraj Clustering and its Applications, Proceedings of National Conference on New Horizons in IT ISBN 978-93-82338-79-6. [16] https://code.google.com/p/sofia-ml [17] http://nlp.fi.muni.cz/projekty/gensim [18] http://mahout.apache.org [19] http://radimrehurek.com/gensim [20] http://carrotsearch.com/lingo3g [21] http://graphlab.org [22] Toby Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications. Sebastopol, CA: O'Reilly Media. REFERENCES [1] Francis Musembi Kwale, A Critical Review of K - Means Text Clustering Algorithms, International Journal of Advanced Research in Computer Science, Volume 4, No. 9, ISSN No. 0976-5697. [2] Dan Munteanu, Severin Bumbaru, A Survey Of Text Clustering Techniques Used For Web Mining, The Annals Of Dunarea De Jos University Of Galati Fascicle III, ISSN 1221-454x. [3] C. J. Van Rijsbergen, Information Retrieval, Butterworths, London. [4] Pushplata, Mr. Ram Chatterjee, An Analytical Assessment on Document Clustering, I.J. Computer Network and Information Security, 5, 63-71, DOI: 10.5815/ijcnis. 2012.05.08. [5] Ms.S.Prabha, Dr.K.Duraiswamy, Ms.M.Sharmila Analysis of Different Clustering Techniques in Data and Text Mining, International Journal of Computer Science Engineering (IJCSE), Vol. 3 No.02, ISSN: 2319-7323. [6] Mrs.S.C.Punitha, Dr. M. Punithavalli A Comparative Study to Find a Suitable Method for Text Document Clustering, International Journal of Computer Science & Information Technology, Vol3, No.6. [7] Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay, A Tutorial Review On Text Mining Algorithms, International Journal of Advanced Research in Computer and Communication Engineering Vol. 1, Issue 4, ISSN : 2278 1021. [8] Vishal Gupta, Gurpreet S. Lehal A Survey of Text Mining Techniques and Applications, Journal of Emerging Technologies in Web Intelligence, Vol. 1, No. 1. [9] R. Sagayam, S.Srinivasan, S. Roshni A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques, International Journal of Computational Engineering Research Vol. 2 Issue. 5.pp: 1443-1446. [10] Comparative Study of Clustering Algorithms On Textual Databases, Thesis submitted to Technical University Ilmenau, Germany. [11] O. J. Oyelade, O. O. Oladipupo, I. C. Obagbuwa, Application Of K-Means Clustering Algorithm For Prediction Of Students Academic Performance, (IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, Issue 1. [12] Bader Aljaber Æ Nicola Stokes Æ James Bailey Æ Jian Pei Document Clustering Of Scientific Texts using Citation Contexts, Information Retrieval DOI 10.1007/s10791-009-9108-x, Springer Science+Business Media, LLC. [13] Anil Kumar Pandey, T. Jaya Laxmi, Web Document Clustering for Finding Expertise in Research Area, BVICAM s International Journal of Information Technology, Vol. 1 No. 2 ISSN 0973 5658. [14] Neetu Sharma, Dr. S. Niranjan Optimization Of Word Sense Disambiguation Using Clustering In Weka, International Journal 564 International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 08, August, 2016