Clustering Documents in Large Text Corpora

Size: px

Start display at page:

Download "Clustering Documents in Large Text Corpora"

Hugo Sharp
5 years ago
Views:

1 Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe Yongzheng Zhang Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 yongzhen Abstract In this report, two clustering algorithms PDDP and EM are tested using neural network corpus. Terms extracted by C-value / NC-value method are used to construct the space vector model. The resulting clusters are evaluated with the measure of mean scatter value for PDDP algorithm and with the measure of log likelihood for EM algorithm. 1 Introduction As the World Wide Web and online system continue to grow at a tremendous rate, text clustering is becoming increasingly widespread [10]. The topic of clustering has been extensively studied in many scientific disciplines and over the years a variety of different approaches have been developed [4, 6, 10]. Fast document clustering algorithms with high quality play an important role in organizing large amounts of data into a small number of meaningful clusters [12]. Typically clustering approaches can 1

2 be categorized as agglomerative and partitional based on the underlying methodology of the algorithm, or as hierarchical and non-hierarchical based on the structure of the final solution [13]. In general, text clustering involves constructing vector space model and representing documents by feature vectors. First, a set of keywords (or significant terms) are extracted from the document corpus to form the feature vector. Second, each document is represented by the feature vector, which consists of frequency and weight statistics of all significant terms. Finally, clustering proceeds by measuring the similarity (usually a function of Euclidean distance) between documents and assigning documents to appropriate clusters. In this project, we used significant terms as feature vectors, which differentiates from the work in [3], where keywords are used. Traditionally frequency of occurrence method is used to extract useful terms. Research in [5] shows C-value / NC-value method, which combines linguistic and statistic information, is convinced to have better performance over conventional frequency method [5]. In this project, we applied both frequency method and C-value / NC-value method to extract terms from the corpus. In our work, we took advantage of two software packages: PDDP [2] and weka [7] to cluster documents in a large text corpora of neural network articles. These two tools implement PDDP algorithm [3] and EM algorithm [9], respectively. We aim to evaluate the quality of the clustering results based on different feature vectors, which are generated using two different methods, i.e., frequency of occurrence and C-value / NC-value. 2 Text Clustering Algorithms In this section we briefly describe how PDDP and EM algorithms work on a collection of documents in the text clustering literature. 2

3 2.1 PDDP algorithm The method of Principal Direction Divisive Partitioning (PDDP) was developed by Boley [4]. It falls in the partitional or hierarchical categories. It first computes a root hyperplane, and then a child hyperplane for each cluster formed from the root hyperplane, and so on. The algorithm proceeds by splitting a leaf node into two children nodes using the leaf s associated hyperplane. The final result is a binary tree of clusters defined by associated principal directions and hyperplanes [3]. As indicated in [4], each document is represented by a column of term frequencies and all the columns form a term frequency matrix, say M. Specifically, the i, j-th entry, M ij, is the number of occurrences of term t i in document d j. In order to make the results independent of document length, each column is scaled to have unit length in the usual Euclidean norm: ˆMij = M ij / i Mij, 2 so that i ˆM ij 2 = 1. At each stage of the algorithm a cluster is split as follows. The centroid vector for each cluster is the vector c whose i-th component is c i = j ˆM ij /k, where the sum is taken over all documents in the cluster and k is the number of documents in the cluster. The principal direction for each individual cluster is the direction of maximum variance, defined to be the eigenvector corresponding to the largest eigenvalue of the unscaled sample covariance matrix ( ˆM ce)( ˆM ce), where e is a row vector of all ones and denotes the matrix transpose. The resulting vector is the principal direction, and all the documents are projected onto this vector. Those documents with positive projections are allocated to the right child cluster, the remaining documents are allocated to the left child cluster. 2.2 EM algorithm The expectation-maximization (EM) algorithm was first introduced by Dempster et al. [9] as an approach to estimating the missing model parameters. It can be seen as an iterative approach to optimization. It first finds the expected value of the log likelihood with respect to the current parameter estimates. The second step is to maximize the expectation computed in the first step. These two steps are 3

4 iterated if necessary. Each iteration is guaranteed to increase the log likelihood and the algorithm is guaranteed to converge to a local maximum of the likelihood function [1]. EM has many applications such text classification and text clustering. It is one of the most used statistical unsupervised learning algorithms. 3 Evaluation of Clustering Results In this section we discuss how to conduct various clustering tasks using PDDP and weka EM clustering packages and evaluate the quality of the clustering results. 3.1 Experiment setup PDDP for Matlab [2] and EM for Java [7] are already available online. What we have to do is creating separate term frequency matrices for clustering using above two packages. Precisely, we applied two Automatic Term Extraction methods (frequency of occurrence and C-value / NC-value) [5] to generate a set of candidate terms, and then manually extract the real terms. Next each document is scanned and frequency of term occurrences is recorded to construct the term frequency matrix. In terms of term selection, we have to select real terms with high frequencies of occurrences to construct the feature vectors, because intuitively similar documents must share a high rate of common terms. However, too frequently appeared terms, such as neural network, might not be good features because most articles in this corpus are talking about neural network, and using these features might result in most papers clustered into a single cluster. So we eliminated the top 10 terms when constructing the term list. Consequently, we constructed eight different term frequency matrices, as described below, for two software packages corresponding to four different term lists. Then we ran the clustering software and compared the quality of both methods against different feature vectors. 4

5 3.2 Evaluation Schemes Usually for clustering, there are two kinds of measures of cluster goodness or quality [11]. One type of measure allows us to compare different sets of clusters without reference to external knowledge and is called an internal quality measure. This type of measure uses the overall similarity which is based on the pairwise similarity of documents in a cluster. The second type allows us to evaluate how well the clustering is working by comparing the groups produced by the clustering techniques to known classes. This type of measure is called an external quality measure. One external measure is entropy [11], which provides a measure of goodness for un-nested clusters or for the clusters at one level of a hierarchical clustering. Another external measure is the F-measure [11], which is more oriented toward measuring the effectiveness of a hierarchical clustering. For PDDP algorithm, we could use scatter to measure the overall similarity, since the scatter value in PDDP is the distance measure as in other clustering algorithm [4], where the group at the biggest scatter point is split at each iteration level. For EM algorithm, we may use log likelihood to measure the overall similarity, since in EM, at each iteration, we optimize the log likelihood of expected parameters. 3.3 Clustering results using PDDP Four experiments were carried out for testing the performance of PDDP algorithm. We constructed the term lists by both frequency of occurrence and C-value / NC-value methods [5]. The data sets for the experiments are summarized in Table 1. As stated in previous subsection, the scatter value can be used as an indicator of overall similarity measure. The bigger the scatter, the more distinct the two clusters. As shown in Table 2, for the four experiments, the mean scatters have little difference. The largest mean scatters are from experiment Frq 400 and Cv400, whereas the smallest is from Cv200. This means that for PDDP algorithm, the more terms used in space vector, the better the resulting clusters. 5

6 Experiment Frq 200 Frq 400 Cv 200 Cv400 Data extract 200 terms by frequency from 1000 articles extract 400 terms by frequency from 1000 articles extract 200 terms by Cvalue from 1000 articles extract 400 terms by Cvalue from 1000 articles Table 1: Experiment data set summary Experiment mean (scatter) standard derivation (scatter) Frq Frq Cv Cv Table 2: Scatter values of four experiments Figure 1 illustrates the distributions of the scatter values for the four experiments against the cluster node number. The overall distributions for the four experiments have little difference. 3.4 Clustering results using EM In our experiments for EM algorithm, four term lists, namely T L 1, T L 2, T L 3, and T L 4, were constructed from 200 neural network articles. T L 1 was created using the C-value / NC-value method, which consists of 100 terms ranking from 11 to 110 in the final term list. T L 2 was created using the same method, but with only 50 terms ranking from 11 to 60. T L 3 and T L 4 were generated using the frequency of occurrence method, with 100 and 50 terms, respectively. In this set of experiments, four term frequency matrices (T F M 1, T F M 2, T F M 3, and T F M 4 ) with Attribute-Relation File Format (ARFF) [8], were constructed based on four term lists T L 1, T L 2, T L 3, and T L 4, respectively. Next we ran EM algorithm with four term frequency matrices one by one. First we examine the quality of EM clustering on term frequency matrix T F M 1, 6

7 x. Cv400 o Cv200.*. Fre400 [] Fre Scatter Number of Nodes Figure 1: Scatter distribution along the cluster node for the four experiments which was generated using 100 terms extracted by C-value / NC-value method. EM algorithm uses log likelihood to measure how likely a particular clustering is. The greater the log likelihood is, the better the clustering result is. We noticed that a minimum allowable standard deviation, σ, for normal density calculation must be set, because text clustering in high dimensional space often has large sparse data and consequently joint density overflows. Table 3 shows various values of σ and corresponding clustering results, where N is the number of resulting clusters and l is the log likelihood. σ N l Table 3: Various σ values and clustering results As we can see in Table 3, different σ results in different number of clusters and log likelihood. More importantly, for a particular term frequency matrix, there exists a threshold t, which allows meaningful clustering, such as t = 0.04 in this case. With 7

8 σ < t, EM achieves positive log likelihood, which indicates a wrong clustering. We also observe that there is an optimal σ, which achieves greatest log likelihood (i.e. best clustering performance), such as σ = 0.05 in this case. As shown in Table 4, seven clusters were built, where i is the cluster number, C i is the number of documents in cluster C i, and p is the percentage of all documents in a single cluster. i C i p 7.5% 14.0% 6.5% 8.5% 11.5% 24.5% 27.5% Table 4: Best clustering results achieved with σ = 0.05 We are also interested in the EM clustering quality with number of clusters as an additional option besides σ = Table 5 shows the clustering results with the number of clusters, N, varying from 2 to 10. N l Table 5: Different number of clusters As we can see in Table 5, these 200 documents are most probably clustered into 7 clusters with greatest log likelihood -13.4, and 6 or 8 clusters are also acceptable. Next we did similar tests on T F M 2, T F M 3, and T F M 4, respectively. The best clustering results achieved in each test are shown in Table 6, where N is the number of resulting clusters, and l is the greatest log likelihood achieved in each test. We observe that different term frequency matrices of the same document set lead to different clustering quality. Precisely, term frequency matrices (T F M 1 and T F M 2 ) built on terms generated by C-value / NC-value achieve better results than those (T F M 3 and T F M 4 ) with frequency method. And higher dimensional space produces better quality (100 vs. 50). 8

9 4 Conclusion T F M N l Table 6: Best clustering results for all T F Ms We have done many tests aiming to show the difference of clustering performance between different feature vectors and different dimensions. Experiments show that the quality of clustering results depends on the feature vectors. Clustering based on terms which are more representative of the document corpus achieves better quality. Moreover, text clustering usually suffers from high dimensionality, where the distance between documents seems to be the same, thus documents seem to be similar. However for PDDP algorithm, this is not the case, since PDDP does not use the distance as the similarity measure. Due to the limited time, we have only tried how to take advantage of mature software to conduct clustering tasks and have got a sense of how clustering proceeds on a small document set. In terms of future work, we are interested in digging deeper the following topics: We are interested in using more formal evaluation criteria such as entropy and F-measure as indicated in [11]. As we have seen, the EM clustering algorithm falls into the non-hierarchical category. However, it can be extended to hierarchical by applying it to further decompose the clusters obtained in the first iteration. We force EM algorithm to produce two clusters (bisecting), and iterate on this to produce a hierarchical decomposition of the same type as PDDP. Thus we can compare the computational efficiency and clustering quality of PDDP and EM algorithms in a high dimensional space. The text corpus we use is a set of computer science articles in neural network. 9

10 Currently we are staying with a small collection. It will be interesting to experiment different sized subsets of the whole collection, and to estimate the time required to cluster the full papers in the whole Neural Network corpus. References [1] J. Bilmes. A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report, University of Berkeley, ICSI-TR , [2] Daniel Boley. Retrieve Experimental Software for Principal Direction Divisive Partitioning. boley/distribution/pddp.html, last accessed on Dec. 11, [3] Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4): , [4] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. Document categorization and query generation on the world wide web using webace. AI Review, 13(5). [5] K. Frantzi, S. Ananiadou, and H. Mima. Automatic recognition of multiword terms. International Journal of Digital Libraries, 3(2): , [6] A. Hotho, A. Maedche, and S. Staab. Ontology-based text clustering. In Proceedings of the IJCAI-2001 Workshop Text Learning: Beyond Supervision, August, Seattle, USA., [7] The University of Waikato. Weka 3: Machine Learning Software in Java. ml/weka/index.html, last accessed on Dec. 11,

11 [8] The University of Waikato. Attribute-Relation File Format (ARFF). ml/weka/arff.html, last accessed on Dec. 11, 2002, [9] Dempster A. P., Laird N. M., and Rubin D. B. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, 39:1 38, [10] H. Schutze and H. Silverstein. Projections for efficient document clustering. In Proceedings of SIGIR 97, Philadelphia, pp , [11] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, [12] Y. Zhao and G. Karypis. Criterion functions for document clustering: Experiments and analysis. Technical Report TR #01 40, Department of Computer Science, University of Minnesota, Minneapolis, MN, Available on the WWW at 15, [13] Ying Zhao and George Karypis. Evaluation of hierarchical clustering algorithms for document datasets. 11

Methods for Intelligent Systems

Methods for Intelligent Systems Lecture Notes on Clustering (II) Davide Eynard eynard@elet.polimi.it Department of Electronics and Information Politecnico di Milano Davide Eynard - Lecture Notes on Clustering