Document Clustering using Concept Space and Cosine Similarity Measurement

29 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Department of Computer and Information Science Universiti Teknologi Petronas Brawijaya University Bandar Seri Iskandar, Tronoh, Perak, Malaysia laililmf@gmail.com Baharum Baharudin Department of Computer and Information Science Universiti Teknologi Petronas Bandar Seri Iskandar, Tronoh, Perak, Malaysia baharbh@gmail.com Abstract Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition () or Principle Component Analysis (). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around.9 and entropy around.5. Keywords-data mining; clustering; LSI; ; ; fuzzy c- means; euclidean distance; cosine similarity I. INTRODUCTION Data mining is a technique to get the pattern from hidden information. This technique is to find and describe structural patterns in data collection as a tool for helping to explain that data and make predictions from it. Generally, data mining tasks are divided into two major categories: predictive tasks which aim to predict the value of a particular attribute based on the values of other attributes and another one is descriptive tasks which aim to derive patterns (correlations, trends, clusters, trajectories, and anomalies)[]. Clustering is a method to organize automatically a large data collection by partition a set data, so the objects in the same cluster are more similar to one another than to objects in other clusters. Document clustering is related to organize a large data text collection. In the field of Information Retrieval (IR), document clustering is used to automatically group the document that belongs to the same topic in order to provide user s browsing of retrieval results [2]. Some experimental evidences show that IR application can benefit from the use of document clustering [3]. Document clustering has always been used as a tool to improve the performance of retrieval and navigating large data. The clustering methods can be classified into hierarchical method and partitioning method. Partitioning clustering can be divided into hard clustering and overlapping (fuzzy clustering). In any given document, there is the possibility that it can contain multiple subject or category. This issue is purposed to use fuzzy clustering approach, which allows a document to appear in multiple clusters. The method used is different from hard clustering method, which a document is only belongs to one cluster, not more. It means that we assume well defined boundary among the clusters. Many researchers have conducted in document clustering by hard clustering methods. In the research show that Bisecting K-Means algorithm is better than Basic K-Means algorithm and Agglomerative Hierarchical Clustering using similarity concept with cosine formulation [4]. Thus, another research that in grouping process used by variance value is better in the result [5], or by using distance [6]. Furthermore, there are several applications by using Fuzzy C-Means algorithm, the researcher had applied clustering for symbolic data and its result of clustering has better quality than hierarchical method (hard clustering) [7]. Also, it had been applied in text mining [8]. While grouping or clustering of document, the problem is very huge terms or words are represented to the full text vector space. Many lexical matching at term level are inaccurate. Sometimes the words have the number of meaning or the number of words has the same meaning, it effects to match returns irrelevant documents. It is difficult to judge which documents belong to the same cluster based on the specific category without any selection for the terms which have meaning full or correlation between the terms each other. Therefore, this research is used concept of Information Retrieval approach is Latent Semantic Index (LSI). In this method, the documents are projected onto a small subspace of this vector space and clustered. So, there is creation of new abstract vector space, which contain of the important term is captured in order [2]. 978--7695-3892-/9 $26. 29 IEEE DOI.9/ICCTD.29.26 58

In this paper, firstly it is described about information retrieval using LSI concept. Then, we describe document similarity as basic concept of clustering, and also one of clustering algorithm applied for document clustering itself is Fuzzy C-Means. After that, the methodology which used to implement document clustering by LSI and cosine similarity embedded to experiment evaluation. Since this experiment is implemented, the performance can be made an analysis and conclusion. II. INFORMATION RETRIEVAL Information retrieval is the way to search the match information as user desired. Unrelated document may be retrieved simply because terms occur accidentally in it, and on the other hand, related documents may be missed because no term in the document occurs in the query. Therefore, the retrieval could be based on concept rather than on terms, by mapping first items to a concept space and using ranking of similarity as shown in Fig. [2]. matrix[9]. And Fig.2 depicts how the LSI in getting the pattern in document collection. Figure 2. Description of LSI in data collection B. Singular Value Decomposition () The Singular Value Decomposition () is a method which can find the patterns in the matrix and identify which words and documents are similar to each other. It creates the new matrices from term (t) x document (d) matrix A that are matrices U, and V such that A= USV T which can be illustrated as in Fig. 3.[] Figure 3. The relative sizes of the three matrixes when t > d Figure. Using concept for Retrieval Information Fig. describes there is middle layer into two query based on the concept (c and c2) and document maps instead of directly relating documents and term as in vector retrieval. This vector, the query c2 of t3 return d2, d3, d4 in the answer set based on the observation that they relate to concept c2, without requiring that the document contains term t3. A. Latent Semantic Index (LSI) Initially, latent semantic indexing (LSI) is obtained to get pattern in the document collection which used to improve the accuracy in Information Retrieval. It uses Singular Vector Decomposition () or Principle Component Analysis () to decompose the original matrix A of document representation and to retain only k largest singular value from singular value matrix. In this matrix, selects only the largest singular value which is to keep the corresponding columns in two other matrixes U and V T. The choice of s determines on how many of the important concepts the ranking will be based on. It is assumption that concepts with small singular value are rather to be considered as noise and thus can be ignored. Therefore, LSI can be depicted of how the sizes of the involved matrixes reduce, when only the first s singular values are kept for the computation of the ranking and also how the position between term and document in the In Fig. 3, the matrix shows where U has orthogonal, unit-length column (U T U=I) and it is called left singular vectors; V has orthogonal which is called right singular vectors, unit-length column (V T V=I) and is diagonal matrix (k x k) of singular values, where k is the rank of A ( min (t, d)). Generally, A = U V T matrix must all be of full rank. The amount of dimension reduction, need to choice correctly in order to represent the real structure in the data [2]. C. Principal Component Analysis () Principal component analysis is a method to find k principal axes which are orthonormal coordinate systems that can capture most of the variance in data. Basically, is formed from Singular Vector Decomposition () on the covariance matrix which used eigen vector or value of covariance matrix [, 2]. III. DOCUMENT DISSIMILARITY The dissimilarity of data object (document) is shown by the distance between document as cluster center and the others. The distance, d, between two points, x and y, in one-, two-, three-, or higher-dimensional space, is given by the equation., () where n is the number of dimensions and X k and Y k are respectively, the k th attributes (components) of x and y []. 59

In contrast, the similarity between data object (document) is known as the small distance in one cluster. Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. A measure of similarity for document clustering is the cosine of the angle between two vectors as in this equation 2 []. cos, / (2) where d i and d j are two different documents IV. FUZZY C-MEANS CLUSTERING There are various fuzzy clustering algorithms and one simple fuzzy clustering technique is the fuzzy c-means algorithm (FCM) by Duda and Hart [3] which was birth of fuzzy method. The FCM is known to produce reasonable partitioning of the original data in many cases (see [4]) and is very quickly compared to some other approaches. Besides that, FCM is convergence to a minimum or saddle point from any initializations, it may be either a local or (the) global minimum of objective function [5]. As principle, this algorithm is based on minimization of the objective function J(X; U,V). Generally, the objective function is the summing up dissimilarity weighted by membership degree ( is shown as equation 3 [6] ;,, (3) where d is the distance; V is cluster center and X is data (document-term matrix). V. METHODOLOGY The proposed method is LSI concept in order to get the small vector space. The details steps for document clustering are as below:. Document preprocessing which includes case folding, parsing, removing stop word and stemming. 2. Removing the terms which have global frequency less than 2 and local frequency is more than a half of the total document. 3. Representation full text document to term-document A vector (vector space model) using TF-IDF weight term. 4. Mapping the term-document A matrix to V document matrix in concept space using LSI approach. Since (as property of matrix), thus: 5. Implementing Fuzzy C-means clustering algorithm by term-document V vector as representative of the document collection and using Cosine similarity as replacement of distance which defined as [ cos,. It is applied into the objective function of Fuzzy C-means algorithm. VI. PERFORMANCE EVALUATION In order to know the performance for quality of clustering, there are two measurements which are F-measure and entropy[7]. This basic idea is from information retrieval concept. In this technique, each cluster is considered as if it were the result of query and each class as if it were the desired set of documents for the query. Furthermore, the formulation of F-measure involves Precision and recall for each cluster j and class i are as follows:, ;, (4) where n ij is the number of documents with class label i in cluster j, n i is the number of documents with class label i and n j is the number of documents in cluster j. Thus, the F-measure cluster j and class i is obtained as this below equation:,,, (5),, The higher f-measure is the higher accuracy of cluster, includes precision and recall. Another measurement which related to the internal quality of clustering is entropy measurement ( and it can be formulated:,.log, (6) where, P(i, j) is probability that a document has class label i and is assigned to cluster j. Thus, the total entropy of clusters is obtained by summing the entropies of each cluster weighted by the size of each cluster: (7) where, n j is size of cluster j and n is total document number in the corpus. The lower value of entropy, the higher quality of cluster internally. VII. EXPERIMENTAL RESULT We evaluated the performance of the proposed method using the data sets taken from 2News Group [8]. The dataset is made up of four groups with refer to the data volume: Binary2, Multi5, Multi7 and Multi. They consist of short news with various topics which used as class reference in clustering process. The binary2 dataset contains of 2 documents (25 terms) and documents in each topic. Also for the other dataset, there are documents in each topic. The multi5 dataset contains of 5 documents (8367 terms), and multi7 dataset contains of 7 documents (866 terms). And another dataset contains of documents (25627 terms). Thus, each dataset is clustered separately based on the number of topics. The first step is document preprocessing in order to reduce the volume density of data and the result after preprocessing is shown in the Table I. 6

TABLE I. Binary2 Multi5 Multi7 Multi DESCRIPTION OF DATASET AFTER PREPROCESSING Dataset Total Topics Docs Terms talk.politics.mideast talk.politics.misc comp.graphics rec.motorcycles rec.sport.baseball sci.space talk.politics.mideast alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.electronics talk.politics.guns alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.crypt sci.electronics sci.med sci.space talk.politics.guns 2 7432 5 3646 7 3959 8655 After that, it is represented the weight term using TF- IDF method in matrix (vector space model). By applying LSI method, the term-document matrix dimension is reduced as shown in the Table II. The size of volume reduction is based on the selected k-rank with optimum condition at interval [2...5]. TABLE II. Data set REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI Total Documents Total Patternterms () Total Patternterms () Binary2 2 8 26 Multi2 5 22 22 Multi7 7 28 24 Multi 2 24 Thus, to cluster the document collection is used Fuzzy C-Means algorithm by parameter fuzziness (m=.), error rate ( =.) and specific cluster number (c) based on the number of topic or class. By applied LSI method ( and ), the distribution of term-document for binary2 dataset and the position of cluster center at the certain k-rank can be illustrated at Fig. 4. There is different distribution of dataset in clustering for the both methods. method of binary2 dataset method of binary2 dataset.2.3.5.25..5.2.5. -.5 -. -.5 -.2 -.25 -.8 -.6 -.4 -.2 -. -.8 -.6 -.4 -.2.5 -.5 -. -.5 -.3 -.25 -.2 -.5 -. -.5.5. Figure 4. Dataset and cluster center distribution using LSI approach VIII. DISCUSSION It means that there are k patterns in each document collection. In order to know the effect of k-rank with quality of cluster and get optimum condition (high performance), we apply these methods by various k-rank. It is applied to binary2 dataset using and with k=, 2, 3,, 5 and the result is shown in Fig. 4 and 5..2.8.6.4.2 2 6 4 8 22 26 3 34 38 42 46 5 k-rank prec recall f-measure entropy Figure 5. Performance of clustering for binary2 using 6

.2.8.6.4.2 Figure 6. Performance of clustering for binary2 using The performance of clustering for each LSI method is obtained optimum condition at different k-rank. This is showed that the optimum of binary2 is at k-rank=8, and for is at k-rank=6. It is also applied to the other data sets: multi5, multi7 and multi. Furthermore, the comparison of performance for document clustering between without LSI applied and with LSI applied including Cosine similarity measurement as replacement of distance is shown in Fig. 7 for all data sets. 3 2.5 2.5.5 2 6 4 8 22 26 3 34 38 42 46 5 k-rank prec recall fmeasure entropy binary2 multi5 multi7 multi Precision Recall cosine Recall F-measure cosine F-measure Entropy cosine Entropy Figure 7. Performance comparison of document clustering Fig. 7 depicts that the accuracy of document clustering without LSI applied is very low, especially for huge data volume (multi7 and multi) which using distance. When it is applied to multi7 dataset, it has precision=.464, recall=.324, and f-measure =.38, but it has high entropy which is 2.337. It is also happened to multi dataset which has the worst performance with precision=.4, recall=.34, f-measure=.372 and entropy=2.422. In contrast, it is used LSI approach and Cosine similarity as reprecement of distance method to be applied in FC-Means algorithm. And the performance for external and internal quality of cluster is very high. The multi7 dataset is obtain f-measure (svd:.96; pca:.93) and entropy(svd:.597; pca:.69) and multi with f-measure (svd:.888; pca:.887) and entropy (svd:.7; pca:.732). The performance of rest data sets is also increase even not significant. IX. CONCLUSION The document clustering can be applied using concept space and cosine similarity. It had made the significant reduction of term-document matrix dimension with refer to k-rank (total number of pattern). Also their average performance is very high with f-measure about.9 and entropy about.5. It is significant improvement when applied in huge data volume (multi7 and multi dataset) until more than 5% increasing. REFERENCES. Pang-Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining. Pearson International ed. 26: Pearson Education, Inc. 2. M.A. Hearst, a.j.o.p. Reexamining the cluster hypothesis. 996: In Proceeding of SIGIR '96. 3. Jardine, N.a.v.R., C.J., The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval. Vol. 7. 97. 4. Steinbach M., K.G., Kumar V., A Comparison of Document Clustering Techniques. 2, University of Mineasota. 5. Saveresi, S.M., D.L. Boley, S.Bittanti and G. Gazzaniga, Cluster Selection in Divisive Clustering Algorithms. 22. 6. Larose, D.T., An Introduction to Data Mining. Discovering Knowledge in Data. 25: Willi & Sons, Inc. 7. El-Sonbaty, Y.a.I., M.A., Fuzzy Clustering for Symbol Data. IEEE Transactions on Fuzzy Systems, 998. 6. 8. Rodrigues, M.E.S.M.a.S., L. A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining. in The 5th International Conference on Recent Advances in Soft Cpmputing. 24. 9. Aberer, K., EPFL-SSC, L.d.s.d.i. repartis, Editor. 23.. S. Deerwester, e.a., Indexing by latent semantic analysis. Journal of American Society for Information Science and Technology, 99. 4: p. 39-47.. Smith, L., A Tutorial on Principal Component Analysis. 22. 2. Shlens, J., A Tutorial on Principal Component Analysis. 29. 3. Bezdek, J.C., Fuzzy Mathematics in Pattern Classification. 973, Cornell University: Ithaca, New York. 4. Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algorithm. 98: Plenum Press. 5. Hathaway, R., Bezdek, J., and Tucker, W., An Improved Convergence Theory for the Fuzzy ISODATA Clustering Algorithms. Analysis of Fuzzy Information, 987. 3(Boca Raton: CRC Press): p. 23-32. 6. Sadaaki Miyamoto, H.I., Katsuhiro Honda, Algorithm for Fuzzy Clustering. Methods in c-means Clustering with Applications, ed. S.i.F.a.S. Computing. Vol. 229. 28, Osaka, Japan: Scientific Publishing Services Pvt. Ltd., Chennai, India. 7. Brojner Larsen and Chinatsu Aone, Fast and Effective Text Mining Using Linear-time Document Clustering, in KDD-99. 999: San Diego, California. 8. http://kdd.ics.uci.edu/ databases/2newgroups.html 62