Clustering of text documents by implementation of K-means algorithms

Size: px

Start display at page:

Download "Clustering of text documents by implementation of K-means algorithms"

Suzan Short
5 years ago
Views:

1 Clustering of text documents by implementation of K-means algorithms Abstract Mr. Hardeep Singh Assistant Professor Department of Professional Studies Post Graduate Government College Sector 11, Chandigarh The document clustering has investigated for use in a number of different areas of text mining and information retrieval. Initially document clustering was investigated for improving the precision or recall in information retrieval systems as an efficient way of finding the nearest neighbours of a document. Clustering has proposed for use in browsing a collection of documents or in organizing the results returned by a search engine in response to a user s query. Document clustering also has used for automatically generate hierarchical clusters of documents. The automatic generation of taxonomy of web documents like that provided by Yahoo, often cited as a goal. A different approach finds natural clusters in already existing documents taxonomy and then uses these clusters to produce and effective document classifier for new documents. Initially we also believed that hierarchical clustering was superior to K- means clustering for clustering the text documents. During the course of my experiments, I analysed that a simple K-means and a variant of K-means i.e. spherical K-means can produce the clusters of documents that are better than those produced by regular K-means. I have also been able to find what we think is a reasonable explanation for this behaviour. I applied K-means and spherical K-means code written in MATLAB 7.7 to waste water treatment plant and 20 Newsgroup (Ng) dataset. I have taken test data of 20 Ng with 200 documents and clustered these with different no. of cluster values k=5, 10, 20, 25. We obtained efficient results. Keywords: Clustering, K-means, SKM, KDD, TDM, cosine similarity, Euclidean distance 53

2 1) Introduction 1.1) Data mining Data mining is the extraction of hidden predictive information from large databases. It is the Knowledge discovery in databases (KDD). It is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses [1]. Data mining tools predict future trends and behaviours, allowing businesses to make proactive, knowledge-driven decisions. Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community as it is supported by three technologies that are now sufficiently mature: Massive data collection, Powerful multiprocessor computers, Data mining algorithms [2]. 1.2) Text mining We are facing an ever-increasing volume of text documents. The abundant texts flowing over the Internet, huge collections of documents in digital libraries and repositories, and digitized personal information such as blog articles and s are piling up quickly every day. These have brought challenges for the effective and efficient organization of text documents. Text mining which is also referred to as text data mining (TDM), roughly equivalent to text analytics, refers to the process of deriving high-quality information from text [3]. The high-quality information is derived by devising of patterns and trends through means such as statistical pattern learning. Text mining can help an organization derive potentially valuable business insights from text-based content such as word documents, and postings on social media streams like Facebook, Twitter and LinkedIn. Mining unstructured data with natural language processing (NLP), statistical modelling and machine learning techniques can be challenging because natural language text is often inconsistent [4]. It contains ambiguities caused by inconsistent syntax and semantics, including slang, language specific to vertical industries and age groups, double entendres and sarcasm. Text analytics software can help by transposing 54

3 words and phrases in unstructured data into numerical values, which may be linked with structured data in a database and is analyzed with traditional data mining techniques. With an iterative approach, an organization can successfully use text analysis to gain insight into contentspecific values such as sentiment, emotion, intensity and relevance. Because text analytics technology is considered an emerging technology, however, results and depth of analysis can vary wildly from vendor to vendor. 1.3) Technologies used for text mining Classification, clustering, feature extraction, association, regression, anomaly detection are the major technologies which used for mining the text from databases. I used clustering technique for text data mining. 2) Objective Main objective of my work is to analyze the document clustering algorithms and clustering the documents with application of K-means and spherical K-means (SKM) algorithms. For this purpose, I applied these clustering algorithms for different cluster values. The comparative analysis of these algorithms indicates the better clustering results. 3) Review of literature 3.1) Documents clustering Clustering is dividing data points into homogeneous classes or clusters: Points in the same group are as similar as possible Points in different group are as dissimilar as possible When a collection of objects is given, we put objects into group based on similarity. Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms [5]. Partitional clustering algorithms are considered more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity. Text document clustering groups similar documents that to form a coherent cluster, while documents that are different have 55

4 separated apart into different clusters [6]. For example, when clustering research papers, two documents are regarded as similar if they share similar thematic topics. When clustering is employed on web sites, we are usually more interested in clustering the component pages according to the type of information that is presented in the page. Accurate clustering requires a precise definition of the closeness between a pair of objects, in terms of either the pair wise dissimilarity or distance [7]. 3.2) Similarity measures Before clustering, a similarity measure must be determined. The measure reflects the degree of closeness or separation of the target objects and should correspond to the characteristics that to believed to be distinguishing the clusters embedded in the data. In many cases, these characteristics are dependent on the data or the problem context at hand; there is no measure that is universally best for all kinds of clustering problems. Moreover, choosing an appropriate similarity measure is also crucial for cluster analysis, especially for a particular type of clustering algorithms. 3.3) Euclidean distance Euclidean is one of the distance measures used on K-Means algorithm. Euclidean distance between an observation and initial cluster centroids 1 and 2 is calculated. Based on Euclidean distance each observation has assigned to one of the clusters - based on minimum distance. Euclidean distance is a standard metric for geometrical problems. It is the ordinary distance between two points and can be easily measured with a ruler in two or three-dimensional space. Euclidean distance is widely used in clustering problems, including clustering text. It is also the default distance measure used with the K-means algorithm [2]. 3.4) Cosine Similarity When documents are represented as term vectors, the similarity of two documents corresponds to the correlation between the vectors. This is quantified as the cosine of the angle between vectors, that is, the so-called cosine similarity. Cosine similarity is one of the most popular similarity measure applied to text documents, such as in numerous information retrieval applications and clustering. 56

5 3.5) Information Retrieval Information retrieval deals with the retrieval of information from a large number of text-based documents. Some of the database systems are not usually present in information retrieval systems because both handle different kinds of data. Examples of information retrieval system include Online Library catalogue system, Online Document Management Systems, Web Search Systems etc. The main problem in an information retrieval system is to locate relevant documents in a document collection based on a user's query. This kind of user's query consists of some keywords describing an information need [8]. In such search problems, the user takes an initiative to pull relevant information out from a collection. This is appropriate when the user has a short-term need. However, if the user has a long-term information need, then the retrieval system can also take an initiative to push any newly arrived information item to the user. This kind of access to information is known as Information Filtering, and the corresponding systems are known as Filtering Systems. 3.6) Application of clustering Clustering is almost used many fields. Here are some applications: 1. Clustering helps marketers improve their customer base and work on the target areas. It helps group people (according to different criteria such as willingness, purchasing power etc.) based on their similarity in many ways related to the product under consideration. 2. Clustering helps in identification of groups of houses based on their value, type and geographical locations. 3. Clustering is used to study earthquake. Based on the areas hit by an earthquake in a region, clustering can help analyse the next probable location where earthquake can occur [9]. 3.7) Clustering algorithms A clustering algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster. The goal of clustering is to determine the intrinsic grouping in a set of unlabelled data [7]. 3.8) K-means clustering algorithm 57

6 The K-means clustering algorithm is known to be efficient in clustering large data sets. This clustering algorithm was developed by MacQueen, and is one of the simplest and the best known unsupervised learning algorithms that solve the well-known clustering problem. The K-means algorithm aims to partition a set of objects, based on their attributes or features, into k clusters, where k is a predefined or user-defined constant [19]. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related in terms of similarity function; similarity can be measured by using different methods such as cosine similarity, Euclidean distance, Extended Jaccard to all objects in that cluster. For example- A pizza chain wants to open its delivery centres across a city. What do you think would be the possible challenges? They need to analyse the areas from where the pizza is being ordered frequently. They need to understand as to how many pizza stores have to be opened to cover delivery in the area. They need to figure out the locations for the pizza stores within all these areas in order to keep the distance between the store and delivery points minimum. Resolving these challenges includes a lot of analysis and mathematics. We would now learn about how clustering can provide a meaningful and easy method of sorting out such real life challenges. 3.9) K-means Clustering Method If k- is given, the K-means algorithm can be executed by the following steps: 1. Partition of objects into k non-empty subsets 2. Identify the cluster centroids (mean point) of the current partition. 3. Assigning each point to a specific cluster 4. Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum. 5. After re-allotting the points, find the centroid of the new cluster 58

7 In each step, K-means computes distances between element vectors and cluster centroids, and reassigns document to this cluster, whose centroid is the closest one. Then, all centroids are recomputed [7]. 3.10) Spherical K-means algorithm It is the K-means algorithm with cosine similarity, is a popular method for clustering highdimensional text data. In this algorithm, each document as well as each cluster mean is represented as a high-dimensional unit-length vector [11]. However, it has been mainly used in hatch mode. Thus is, each cluster mean vector is updated, only after all document vectors being assigned, as the (normalized) average of all the document vectors assigned to that cluster. We demonstrate that the online spherical K-means algorithm can achieve significantly better clustering results than the general K-means 4) Implementation and Results I implemented the K-means code for two types of databases. First of all K-means is implemented for general numeric data then for waste water treatment plant database and for 20 Newsgroup dataset. For spherical K-means we used only 20 Ng dataset. Both algorithms have code written in MATLAB 7.7 which is implemented on these datasets. The brief information about datasets and MATLAB is explained below. comp.graphics comp.os.mswindows.misc rec.motorcycles rec.autos comp.sys.ibm.pc.hardware rec.sport.baseball comp.sys.mac.hardware rec.sport.hockey comp.windows.x sci.crypt sci.electronics sci.med sci.space misc.forsale talk.politics.misc talk.religion.misc talk.politics.guns alt.atheism talk.politics.mideast soc.religion.christian 4.1) Datasets Information 59

8 The 20 Newsgroups data set- The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. It was originally collected by Ken Lang. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other, while others are highly unrelated. Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter: Second dataset is 'Faults in an urban waste water treatment plant'. This dataset comes from the daily measures of sensors in an urban waste water treatment plant. The objective is to classify the operational state of the plant in order to predict faults through the state variables of the plant at each of the stages of the treatment process. This domain has been stated as an ill-structured domain. Required description for dataset is source information, relevant information, number of instances are-527, number of attributes are -38. Each attribute has a number given for each different state of input and output reading of water treatment plant. The specific software tool used for this work is MATLAB ) MATLAB MATLAB (matrix laboratory) is a multi-paradigm numerical computing environment and fourthgeneration programming language. A proprietary programming language developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, Java, Fortran and Python. MATLAB had around millions users across industry and academia. MATLAB users come from various backgrounds of engineering, science, and economics. As I applied K-means and spherical K-means on the above these two datasets, I taken results for clusters k =5, k =10, k =20, k =25. The graphs showing documents with their corresponding cluster number for each value of clusters are drawn. First of all the graphs are plotted for 60

9 documents and their two attribute values. I also plotted a graph showing 200 documents in number for the sake of cluster visibility, then for above values of K-means and SKM both are implemented and graphs are plotted. A clearness difference between the both types of graphs is visible from these graphs. In this research paper graphs are not included. 5) Conclusions During this work, I analyzed the clustering of documents and implementation with the aim of document clustering by using the portioning clustering algorithm. For the portioning clustering methods the advantage is only the improvement in clustering result, where as in clustering time tend to increase. Algorithms are used to numeric and text documents. The system revealed me relevant results in my experiments. As I explained the implementation of K-means for data set of waste water treatment plant which has thirty eight (38) attributes and four (4) objects, it works efficiently. For 20 Ng dataset K-means is applied for two hundred (200) test documents having three attributes. When the SKM is applied to these 200 documents of 20 Ng for different number of clusters K-means works properly and showed the sufficient results. However, when it is implemented for many documents it does not show the sufficient results. The comparative analysis of algorithms on basis of following criteria demonstrates that SKM yield a better clustering quality than basic K-means. 1. Time complexity- Time complexity of K-means is O (n k i) where n is the number of data points in dataset, k is the number of cluster, i is the number of iterations. Since we used K- means and then I applied my procedure, time complexity is equal to the summation of two times. At first, my method finds the centroids for pair of k-clusters. Average running time is high for K- means and low for SKM. 2. Sensitivity to noise- Noise and outliers present in data strongly affects the quality of clusters so it is always essential that algorithm should not be sensitive to noise. K-means is sensitive to noise whereas SKM is partially sensitive. 3. Effect of high dimensionality of database -SKM has better high dimensionality of database than k-means. 4. Effect of size of dataset- K-means is not suitable for large datasets. SKM provides better results than K-means for large databases. 61

10 5. Average density of clusters- Average density of clusters is calculated by dividing the number of data points in a cluster to the area of that cluster. It is lower in K -means than SKM. 6) Future Scope In my experiment my limitation was that one document can belong to only one cluster, it can jointly assigned to more than one cluster. An improvement can be done to cluster one document to more than one cluster, if its properties are relevant to them and include the clustering relationship between them. As future work it was noticed that intra cluster evaluation measure would be more precise if we could combine it with the inter cluster distance stop criterion. The noise removal algorithms may be used to remove the inconsistent data in the documents. The variant of K-means may be improved for clustering high dimensional textual data. 7) References [1] Raymond J. Mooney, Rajvan Bunescu Mining knowledge from text using information extraction. Deptt of computer science, University of Texas at Austin,USA(2002). [2] Raj Kuma1 "Classification Algorithms for Data Mining a Survey. Department of Computer Science and Engineering,Jind Institute of Engg. & Technology, Jind, Haryana, India (2012). [3] M. Hearst Untangling text mining" (1999). [4] Jochen DorrePeter Gerstland Text mining finding nuggets in mountains of textual data in knowledge discovery and data mining (1999). [5] Michael Steinbach Vipin Kumar. finding clusters of different sizes, shapes, and densitiesin noisy high dimensional data.ieee university of Minnesota, MN, USA(2003). [6] Padmini Srinivasan The search for Novelity in text (2005). [7] Anoop Jain, Aruna Bajpai, Manish Kumar Rohila Efficient Clustering Technique for Information Retrieval in Data Mining, Department of Computer Applications, Samrat Ashok Technological Institute, Vidisha (M.P.) India (2012). [8] A.Christy & P.Thambidurai Improved information extraction for text mining with soft matching rules and expectations (2003). [9] Anand V. Saurkar, Vaibhav Bhujade, Priti Bhagat Amit Khaparde Various Data Mining Techniques Department of Computer Science& Engineering, Department of Information Technology,DMIETR, Sawangi(M), Wardh,Maharashtra, India (2014). [10] Michael Steinbach Vipin Kumar A comparison of document clustering techniques,ieee, Deptt of computer science & engg., University of Minnesota at Austin (2000). [11] Shi Zhong Efficient online spherical K-means clustering, Deptt of computer science & Engineering, Florida Atlantivc University, IEEE, USA (2008). 62

11 [12] [13] Streamed Info-Ocean 63

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important