Automatic Text Clustering via Particle Swarm Optimization

Size: px

Start display at page:

Download "Automatic Text Clustering via Particle Swarm Optimization"

Marvin Robbins
5 years ago
Views:

1 Automatic Text Clustering via Particle Swarm Optimization Xing Gao 2, Yanping Lu 1 * 1 Department of Cognitive Science, Xiamen University, China, Yanpinglv@xmu.edu.cn 2 Software School of Xiamen University, China, Gaoxing@xmu.edu.cn Abstract An automatic clustering algorithm based on particle swarm optimization, termed ATCPSO, is proposed for texting clustering in this paper. autopso has been exploited for evolving the correct number of clusters and simultaneously identifying clusters and has demonstrated to improve the performance of high-dimensional data clustering. To extend autopso to text clustering, a few modifications to the algorithm are necessary, such as the similarity measure, parameter selection and the criterion function. Our experimental results on both ten structured text datasets built from 20 newsgroups as well as four text datasets selected from CLUTO show that the proposed algorithm is able to greatly improve the quality of text clustering compared to four typical clustering algorithms and one competitive subspace clustering method. eywords: Text Clustering, Automatic Determination of, Particle Swarm Optimization 1. Introduction With the growing popularity of Internet and the increasing development of Web, there has been an explosion in the volume and variety of textual data. This has led to the need for the development of new techniques to help users effectively navigate, organize and manage the available web documents, with the goal of finding the best matching their needs. Text clustering, which consists of automatically grouping a set of documents into predefined number of categories, is one of the most popular techniques towards the achievement of this objective. As far as clustering techniques for text data are concerned, traditional clustering algorithms have been extensively investigated for use in the area of text clustering over years. Basically, these algorithms can be categorized into two classes, partitional methods and agglomerative hierarchical approaches. Partitional algorithms, such as k-means [1], bisection-kmeans [2] and graph-based [3], find the clusters by partitioning the dataset into a predefined number of clusters, while agglomerative algorithms, such as UPGMA [4], single_link [5] and Chameleon [6], find the clusters by considering each data as one cluster and then repeatedly merging pairs of clusters until a termination criterion is met. As far as high-dimensional document clustering is concerned, partitional algorithms have been demonstrated to lead to better solutions than agglomerative hierarchical algorithms [7]. However, partitional algorithms need to predefine the parameter, the number of clusters k. On the other hand, they are often sensitive to the initial cluster centroids. As a result, they fail to produce satisfactory clustering results document datasets. Due to good performance of stochastic search procedures, one way to automatically determine the number of clusters is to make use of evolutionary techniques. In this regard, genetic algorithms (GA) have been the most frequently proposed for automatically clustering data sets. Clustering is formulated as a continuous optimization problem in the GA algorithms. Since PSO has been demonstrated to perform better than GA in such optimization problems, we in this paper present a particle swarm optimization method to cluster documents and simultaneously identify a correct number of clusters, called Automatic Text Clustering via Particle Swarm Optimization (ATCPSO), which extends the particle swarm optimization for automatically clustering high-dimensional data (autopso) [8] to handle the problem of clustering documents. From the experimental results on ten groups of text datasets with different structures built from 20 newsgroups and four datasets selected from CLUTO, we find that ATCPSO overall outperforms four typical clustering algorithms, namely k-means, and Bisection k-means, Agglomeration and Graph-based, and one competitive subspace clustering algorithm, PSOVW [9]. The rest of the paper is organized as follows. Section 2 describes summarizes the autopso algorithm and gives the modifications of autopso in text clustering. The experimental results of ATCPSO are given in Section 3, which contains the description of text datasets from different sources, the text clustering evaluation methods, and experimental results. Section 4 draws some conclusions and points out directions for future work. International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number23,December 2012 doi: /jdcta.vol6.issue

2 2. Automatic Text Clustering via Particle Swarm Optimization (ATCPSO) A. autopso Since an effective clustering algorithm consists of two key components, namely the criterion function and the search algorithm, Lu et al. [8] employed a criterion function that makes it possible to directly compare partitions with similar or different numbers of clusters, in the same generation or between adjacent generations and a particle swarm optimization global search strategy, called autopso, that minimizes the developed criterion function for automatically determining the number of clusters k. The Davies-Bouldin (DB) index [10] used in autopso aims at both minimizing within-cluster scatter and maximizing between-cluster separation. The employed objective function in autopso is described as follows. 1 k min J ( Z) R i (1) k i 1 where, Si S j Ri max i, j i Bij is the ratio of the sum of within-cluster scatter 1 S i d( x, zi ) Ci x C i to between-cluster separation Bi, j d( zi, z j ). Here, x is the data object and z i is the representative cluster C i and is denoted as the corresponding 1 cluster centroid, which is computed as zi x where C i is the number of data objects belonging Ci x C i to cluster C i. d is the distance function which measures the dissimilarity between two data objects. DB takes values in the interval [ 0, ] and needs to be minimized. Given the objective function, clustering is formulated as a continuous function optimization problem with bound constraints. Instead of employing local search strategies, autopso makes full use of a particle swarm optimization algorithm to minimize the given objective function. It is simple to implement. Furthermore in order to encode a variable number of clusters, autopso utilizes a real-number matrix and a binary vector representation for a particle. Then, a new crossover matrix learning procedure governed by the associated binary vector, is proposed to maintain the population diversity, making the proposed algorithm immune to the premature convergence problem. It has been applied to solve high-dimensional data clustering and has been demonstrated to be less sensitive to the initial cluster centroids and greatly improve the cluster quality. B. Modifications for Text Clustering However, when extending autopso to handle text clustering, there exist some modifications. First due to the sparsity and high-dimensionality of text data, another suitable method instead of the Euclidean distance measuring the similarity between two text documents must be chosen. In addition, we added an additional parameter, i.e. the crossover probability P c, in order to maintain the population diversity of PSO which is set particularly for text datasets with very high dimensions in ATCPSO. Throughout this paper, we will use the symbols, N, M and, to denote the number of text documents, the number of terms and the number of clusters, respectively. We will use the symbol D to denote the text set, D 1, D 2,., D to denote each one of the k categories, and N 1, N 2,, N to denote the sizes of the corresponding clusters. Since the vector-space model is the famous representation of text documents, ATCPSO employed it to represent each document. In this model, each document is considered as a vector in the term-space. We used the tf-idf term weighting model [11], in which each document can be represented as 13

3 N N N tf1 log, tf2 log,, tfm log df1 df2 dfm where, tf i is the frequency of the i th term in the document and df i is the number of documents that contain the i th term. Since text matrices are very sparse and high-dimensional, the Euclidean distance is not appropriate to measure the similarity between two text vectors. Therefore for ATCPSO, the cosine coefficient [16] is used to measure which document centroid is closest to a given document and is defined by the following equation: cos ( x, z) x z x z where x represents one document and z represents a corresponding cluster centroid. This measure becomes one if the documents are identical, and zero if there is nothing in common between them (i.e., the vectors are orthogonal to each other). For the less use of computation complexity, all the document vectors have been normalized into unit length in the experiments. So for any two document vectors, x and z, the cosine function can be simplified as cos ( x, z) x z, where denotes the dot-product of the two vectors. Therefore, the objective function in ATCPSO is to maximize the overall similarity of documents in a cluster and it is defined as follows: 1 k max J( Z) R i (1) k i 1 Si S j Ri min i, j i Bij In clustering high-dimensional data such as text data, particles in PSO tend to converge to local optima due to the curse of very high dimensionality. In order to obtain good population diversity, particles in ATCPSO are required to have different exploration and exploitation ability. So, we added another parameter in PSO, i.e. the crossover probability P c, which is empirically set to f ( i) f (1) e e Pc ( i) 0.5* i f ( s) f (1), where f ( i) 5* for each particle. The values of the learning e e s 1 probability P c in ATCPSO range from 0.05 to 0.5, as for CLPSO [16]. 3. Experiments and evaluations To evaluate the performance of the proposed ATCPSO method, a comprehensive performance study was conducted using a number of different text datasets. Four typical clustering algorithms, namely Bisection k-means [2], Agglomeration [7], Graph-based [3] and k-means [1], and one competitive subspace clustering algorithms, namely PSOVW [9], were chosen for comparison. The goal of the experiments was to assess the accuracy and the efficiency of ATCPSO in clustering text documents. We compared the clustering quality of the six algorithms, examined the variance of the cluster solutions and measured the running time spent by these six algorithms. These experiments provide information on how well and how fast each algorithm is able to retrieve known categories of high-dimensional text datasets and how sensitive each algorithm is to the initial centroids. 3.1 Text Datasets The first group of datasets was derived from the popular 20 newsgroups corpus, which is a collection of approximately 20,000 newsgroup documents, evenly across 20 different newsgroups. In the experiments, the version of Machine Learning Group at UCD [11] was used, where the datasets were constructed from the original 20 newsgroups and have been preprocessed with stop-word removal and 14

4 stemming already applied. All the documents are presented as vector space model and in the following experiments, the terms occurring in less than four documents have been eliminated. Ten subgroups datasets in the group os* of USD version were used for the experiments, as illustrated in Table 1. All the datasets have different structured text data built from the 20 newsgroups dataset. Specially, the documents between two different classes may share some common topics so that the two classes are overlapped considerably. For example, the classes, autos and motor, are overlapping in the entire feature space. Furthermore, the datasets are comprised of unbalanced classes where at least one class contains large less number of documents than other classes in the datasets. The aim of these 20 newsgroups datasets is to provide challenges to the clustering algorithms in terms of the class imbalance and overlap. Table 1. Details of the 20 newsgroups datasets. Each subgroup of these datasets contains 4 datasets with 500, 1000, 2000 and 3000 dimensions, respectively. Therefore, there are total 40 datasets were used in experiments. Each document was scaled into unit length. and N represent the number of clusters and the number of instances in each subgroup of datasets, which contains one category of unbalanced documents. The classes of autos and motor are very closely related to each other. datasets N Category (its number of documents) 1 os_4_ autos(200), med(600), baseball(600), motor(600) 2 os_4_ graphics(200), med(600), mac(600), space(600) 3 os_5_ autos(250), med(562), baseball(562), politics(562), space(562) 4 os_5_ motor(562), electronics(562), atheism(250), politics(562), space(562) 5 os_6_ crypt(300), electronics(540), med(540), politics(540), religion(540), space(540) 6 os_6_ motor(540), electronics(540), baseball(540), politics(540), autos(300), space(540) 7 os_7_ motor(525), electronics(525), baseball(525), politics(525), autos(350), space(525), med(525) 8 os_7_ autos(350), electronics(525), baseball(525), politics(525), religion(525), guns(525), med(525) 9 os_8_ autos(400), graphics(514), baseball(514), politics(514), mac(514), motor(514), med(514), space(514) 10 os_8_ autos(514), electronics(514), baseball(514), politics(514), atheism(400), motor(514), med(514), space(514) The second group of datasets used in the experiments was obtained from the TREC collection. We obtained five TREC subsets, namely tr12, tr23, tr31, tr41 and tr45, from the well-known CLUTO toolkit [13], where the categories correspond to the documents relevant to particular queries. The datasets have been pre-processed and are represented as vector space model, where each element of the vector indicates the frequency of the corresponding keyword in the document. The details of these datasets are illustrated in Table 2. Table 2. Details of the TREC subsets. The table is directly cited from [12]. Datasets N tr tr tr tr tr The clustering algorithms with parameters 15

5 To investigate the performance of ATCPSO for text clustering, five clustering algorithms: k-means, Bisection k-means, Agglomeration, Graph-based and PSOVW have been chosen for comparison in the experiments. A brief introduction of the five clustering algorithms is given below: k-means: The k-means algorithm starts with a random partitioning and loops the following steps: 1) Compute the current cluster centroids and the average vector of each cluster in a data set; 2) Assign each data object to the cluster whose cluster centroid is closest to it. The algorithm terminates when there is no further change in the cluster centroids. k-means maximizes compactness by minimizing the within-cluster scatter. In our implementations, an empty cluster sometimes occurs, so we reassign one data object randomly selected from the data set to it. Bisection k-means: In order to generate k desired clustering solutions, Bisection k-means performs a series of k-means. Bisection k-means is initiated with the universal cluster containing all data objects. Then it loops: it selects the cluster with the largest variance and calls k-means, which optimizes a particular clustering criterion function in order to split this cluster into exactly two subclusters. The loop is repeated a certain number of times such that k non-overlapping clusters are generated. Note that this approach ensures that the criterion function is locally optimized within each bisection, but in general it is not globally optimized. Graph-based: In the Graph-based clustering algorithm, the desired k clustering solutions are obtained by first modeling the objects using a nearest-neighbor graph. Each data object becomes a vertex, and each data object is connected to the other objects most similar to it. Then, the graph is split into k clusters using a min-cut graph partitioning algorithm. PSOVW: PSOVW proposed by Lu et al. [11] is a soft projected clustering algorithm via variable weighting for high-dimensional data clustering. A suitable k-means objective weighting function is minimized by the effective particle swarm optimization strategy in order to search for global optima to the variable weighting problem in clustering. Experimental results on both synthetic and real data show that PSOVW greatly improves cluster quality for high-dimensional data clustering. We implemented two clustering algorithms including k-means and ATCPSO and the subspace clustering PSOVW algorithm in MATLAB, while the other three compared clustering algorithms such as, Bisection k-means, Agglomeration and Graph, can be downloaded from the website of the CLUTO clustering toolkit [13]. According to the routine of experiments [9], the similarity between two documents is measured by the extended Jaccard coefficient in PSOVW, while by the cosine function for other algorithms, namely Bisection k-means, Agglomeration, Graph-based, k-means, and the proposed ATCPSO algorithm. For the experiments, we run Bisection k-means, Agglomeration, Graph-based in CLUTO using the default parameters. 3.3 Experimental Metrices The quality of a clustering solution was determined by three different metrics that examine the class labels of the documents assigned to each cluster. Assume that a data set with categories was grouped into clusters and N is the total number of documents in the data set. Given a particular category L k of size N k and a particular cluster S i of size N i, suppose N ki is the number of documents that belong to category L k assigned to cluster S i. Accuracy Recall Table 3. Experimental Metrics Nkk k 1 Accuracy *100% N Recallk k Nkk call 1 Re, Recallk *100%, 1 k Nk 16

Precision Pr ecisioni i Nii ecision 1 Pr, Pr ecisioni *100%, Ni 1 i FScore FScorek k 1 2* Recallk * Pr ecisionk FScore, FScorek Recallk Pr ecisionk Entropy N i 1 Nki Nki Entropy ( log ) i 1N log k 1

6 Precision Pr ecisioni i Nii ecision 1 Pr, Pr ecisioni *100%, Ni 1 i FScore FScorek k 1 2* Recallk * Pr ecisionk FScore, FScorek Recallk Pr ecisionk Entropy N i 1 Nki Nki Entropy ( log ) i 1N log k 1 Ni Ni Table 3 presents the four evaluation metrics used in this paper. The first metric, Recall, measures the faction of the documents that are relevant to the query that are successfully retrieved in each category. The second metric, Precision, is the fraction of retrieved documents that are relevant to the research. The two measures, Recall and Precision, are used together in the third metric FScore to provide the harmonic mean of precision and recall, which measures the degree of each cluster contain documents from the original category [14]. The fourth metric is the Entropy [15], which examines how the documents in all categories are distributed within each cluster. In general, Accuracy, Recall, Precision and FScore rank the clustering quality from zero (worst) to one (best), while Entropy measures from one (worst) to zero (best). The FScore value will be one when every category has a corresponding cluster containing the same set of documents. The Accuracy value will be the one when every document is distributed to its corresponding dominant cluster. The Recall value will be one when one category has a corresponding cluster containing the same set of documents. The Precision value will be one when one cluster has a corresponding category that contains the same set of documents. The FScore value will be one when every category has a corresponding cluster containing the same set of documents. The Entropy value will be zero when every cluster contains only documents from a single category. 3.4 Experimental Results & Analysis 1) Evaluations on the os* datasets Our first set of experiments was focused on evaluating the average quality of the clustering solutions produced by the various algorithms and the influence of text datasets characteristics such as the number of clusters and the relatedness of clusters on algorithms. On each of the ten structured text dataset built from 20 newsgroups, we ran the above algorithms 10 times. The average Recall, Precision, FScore and Entropy values for the six algorithms on the ten os* datasets are present in the following graphs (a)-(d) in Figure 1, respectively. (a) Recall (b) Precision 17

(c) FScore (d) Entropy Figure 1. The Recall, Precision, FScore and Entropy values of each algorithm on the ten os* datasets. A number of observations can be made by analyzing the results in Figure 1.

1-10. This is true when the clustering quality was evaluated using the Recall, Precision and FScore measures as well as the Entropy measure.

7 (c) FScore (d) Entropy Figure 1. The Recall, Precision, FScore and Entropy values of each algorithm on the ten os* datasets. A number of observations can be made by analyzing the results in Figure 1. First, ATCPSO, Graph-based and PSOVW lead to clustering solutions that are consistently better than the solutions obtained by other algorithms over the entire experiments on the ten os* text datasets This is true when the clustering quality was evaluated using the Recall, Precision and FScore measures as well as the Entropy measure. They produced solutions that are about 5%-30% better in terms of Recall, Precision and FScore and around 5-40% better in terms of Entropy than other algorithms. Second, ATCPSO lead to the best solutions irrespective of the number of clusters, the relatedness of clusters and the measures used to evaluate the clustering quality. Over the first set of experiments, solutions achieved by ATCPSO are always the best. On the average, TCPSO outperforms the next best algorithm, PSOVW, by around 2%-8% in terms of Recall, Precision and FScore and 9%-30% in terms of Entropy. In general, it achieves higher the Recall, Precision and FScore values while lower the Entropy values than other algorithms. The results in the Recall, Precision and FScore graphs are in agreement with those in the Entropy graph. Both of them show that it greatly surpasses other algorithms for text clustering. In general, the performances of ATCPSO and PSOVW stabilize over text datasets where there are multiple highly related categories such as the datasets 5, 8-10, while Graph-based and Bisection k-means deteriorate on the opposite. In such documents, each category has a subset of words and two categories share many of the same words such as categories, autos and motor. Thus, we would expect that a document can often fail to find the lower centroid regardless of the measure metric they used. The probability for a document to find a wrong centroid varies from one dataset to another, but it basically increases with the high relatedness of categories. The success of ATCPSO and PSOVW are due to the PSO global strategies used in these two algorithms. Third, except on the text dataset 3 and 4, Graph-based performs the third best for all the measures. Bisection k-means also performs quite well when the categories in datasets are highly unrelated such as the text datasets 1-4. The solutions yielded by k-means fluctuate on all the entire experiments due to the employment of the local search strategy. Agglomeration performs the worst of the above six algorithms and leads to small Recall, Precision and FScore and large Entropy values on all datasets. That means the number of documents of each cluster yielded by Agglomeration is thus more balanced in one genuine category and as a result, the number of documents that are correctly identified by Agglomeration in the genuine category is very low. 2) Evaluations on the tr* datasets To further investigate the behavior of ATCPSO, we perform another sequence of experiments. Our second set of experiments was focused on evaluating the FScore and Entropy performances of each algorithm on the tr* datasets. Each algorithm was independently performed 10 times on each of the five tr* datasets. The average FScore and Entropy values of each algorithm on the five tr* datasets are present in the graphs (a) and (b) of Figure 2, while a similar comparison based on the average accuracy and corresponding variance of clustering results achieved by each algorithm are also shown in the graphs (c) and (d) of Figure 2. 18

(a) FScore (b) Entropy (c) Accuracy (d) Corresponding variance of clustering results Figure 2.

Looking at the four graphs in Figure 2 and comparing the performance of each algorithm over the five tr* datasets, we can see that ATCPSO outperformed all other algorithms, although PSOVW and

PSOVW and Graph-based performed quite well on the tr41 dataset where the key words of different categories overlap a lot, such as the categories: cable and television, multimedia and media etc.

However, both of them should preset the parameter, the number of clusters k, which is different to preset due to the unknown structure these datasets.

8 (a) FScore (b) Entropy (c) Accuracy (d) Corresponding variance of clustering results Figure 2. The average FScore, Entropy, Accuracy values and corresponding variance of each algorithm on the five tr* datasets. Looking at the four graphs in Figure 2 and comparing the performance of each algorithm over the five tr* datasets, we can see that ATCPSO outperformed all other algorithms, although PSOVW and Graph-based are very close. In terms of FScore and Entropy, Graph-based, Agglomeration and Bisection k-means are on the average 3%-10% and 8%-15% worse than ATCPSO, respectively. PSOVW and Graph-based performed quite well on the tr41 dataset where the key words of different categories overlap a lot, such as the categories: cable and television, multimedia and media etc., because PSOVW through the subspace clustering mechanism and Graph-based through the graph mechanism consider the relationships of clusters. However, both of them should preset the parameter, the number of clusters k, which is different to preset due to the unknown structure these datasets. The k-means and Bisection k-means algorithms occasionally work well, achieving over 85% cluster accuracy on the dataset tr41 where categories, such as cable, politics, sports etc., are highly unrelated, but the cluster accuracy is often low on other datasets. Since ATCPSO, PSOVW, k-means and Bisection k-means were randomly initialized in our experiments, we also examined the variance of 10 clustering solutions produced by them on the five tr* dataset in order to see how sensitive the clustering results yielded by them were to the initial cluster centroids. The variance of the cluster accuracy is shown in the graph (d) of Figure 2. We ignore Graph-based and Agglomeration clustering algorithms, because they are deterministic and therefore produce the same clustering solution on each trial. From the graph (d) of Figure 2, we can easily note that ATCPSO and PSOVW yields so less variance of clustering results than k-means and Bisection k-means over the five tr* dataset. That s because the particles in ATCPSO and PSOVW do not work independently but cooperate with each other in order to move to better regions. k-means and Bisection k-means are sensitive to the initial cluster centroids, although they occasionally work well, achieving the high clustering accuracy on some trials. 19

9 4. Conclusions & Future Work In response to needs arising from the emergence of high-dimensional text datasets and the low quality of clustering algorithms, this paper extends a particle swarm optimization algorithm for high-dimensional data clustering and simultaneously automatic determination of k to handle the problem of text clustering, called ATCPSO. However, there exist some modifications in ATCPSO. Due to the sparsity and very high dimensions of text datasets, the similarity between two documents is measured by the cosine similarity instead of the Euclidean distance and an additional parameter P c is added in ATCPSO in order to maintain the population diversity of particles in PSO. Therefore, the objective function is changed suitable for text clustering, which is to maximize the overall similarity of documents within a cluster and simultaneously to minimize the overall similarity of documents between two clusters, and the search strategy is changed to escape from the premature convergence of PSO. Experiments on ten structured text datasets built from 20 newsgroups as well as text datasets selected from CLUTO show that ATCPSO is able to greatly improve the quality of text clustering compared to four typical clustering algorithms and one competitive subspace clustering method. Although it runs slower than typical clustering algorithms, its deficiency in running time is acceptable. Plus, the clustering quality is of the prime importance and the time taken to obtain it is at most secondary in text clustering. Moreover, ATCPSO shows that the effectiveness of autopso is not confined to Euclidean distance because the objective function used in atuopso is independent of the similarity measure. Other similarity measures can thus be used for different applications, such as the cosine similarity for text data. More importantly, due to the use of the DB objective function ATCPSO does not need to predefine the parameter, i.e. the number of clusters k. One issue needs to be addressed by ATCPSO. ATCPSO can not recover the relevant terms for different clusters, because there is no subspace-based mechanism within the objective function, which consequently cannot totally capture the effect of the variances along each dimension. We will continue to work on recovering the relevant dimensions for different clusters in the future. References [1] A.. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, USA, [2] E.H. Han, G. arypis, V. umar, B. Mobasher, Hypergraph based clustering in high-dimensional data sets: A summary of results, Bulletin of the Technical Committee on Data Engineering, vol.21, no.1, pp.15-22, [3] M. Steinbach, G. arypis, V. umar, A comparison of document clustering techniques, In Proceeding of the Text mining workshop, pp , [4] P. H. Sneath, R. R. Sokal, Numerical Taxonomy, Freeman, London U, [5] G. arypis, E.H. Han, V. umar, Chameleon: A hierarchical clustering algorithm using dynamic modeling, IEEE Computer, vol.32, no.8, pp.68 75, [6] Ying Zhao, G. arypis, Hierarchical Clustering Algorithms for Document Datasets, Data Mining and nowledge Discovery, vol.10, no.2, pp , [7] Yanping Lu, Suping Xu, Xing Gao, Particle Swarm Optimizer for Automatically Clustering High-dimensional Data, Swarm Intelligence Symposium, pp.37-44, [8] Yanping Lu, Shengrui Wang, Shaozi Li, Particle Swarm Optimizer for Optimal Variable Weighting in Clustering High-dimensional Data, Machine Learning, pp.43-70, [9] D. L. Davies, D. W. Bouldin, A cluster separation measure, IEEE Trans Pattern Analysis and Machine Intelligence, vol.1, no.4, pp , [10] Lifei Chen, Gongde Guo, aijun Wang, Class-dependent Projection based Method for Text Categorization, Pattern Recognition Letters, vol.32, no.10, pp , [11] Zhao Y, arypis G, Criterion functions for document clustering: experiments and analysis, Technical Report of Department of Computer Science and Engineering, USA, [12] Zhou X., Hu X, Zhang X., Lin X, Song I. Y, Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR, In Proceeding of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp ,

10 [13] J.J. Liang, A.. Qin, P.N. Suganthan, S. Baskar, Comprehensive learning particle swarm optimizer for global optimization of multimodal functions, IEEE Transactions on Evolutionary Computation, vol.10, no.3, pp , [14] Shang Gao, Cungen Cao, Convergence Analysis of Particle Swarm Optimization Algorithm, AISS: Advances in Information Sciences and Service Sciences, vol.4, no.14, pp.25-32, [15] Donghui Chen, Zhijing Liu, Zonghu Wang, A novel fuzzy clustering algorithm based on ernel method and Particle Swarm Optimization, JCIT: Journal of Convergence Information Technology, vol.7, no.3, pp , [16] Lv Li, Vector Particle Swarm Optimization, IJACT: International Journal of Advancements in Computing Technology, vol.4, no.17, pp , [17] Huizhen Yang, Yaoqiu Li, "Particle Swarm Optimization for Control of Adaptive Optics System", AISS: Advances in Information Sciences and Service Sciences, vol.4, no.22, pp , 2012 *Corresponding author, Yanping Lu, Tel , Fax , Yanpinglv@xmu.edu.cn; Xing Gao, gaoxing@xmu.edu.cn. This work is in part supported by the Natural Science Foundation of Fujian Province under Grant 2010J01345, the Fundamental Research Funds for the Central Universities under Grant , the Research Fund for the Doctoral Program of Higher Education of China under Grant and Shenzhen ey Laboratory for High Performance Data Mining with Shenzhen New Industry Development Fund under Grant CXB A. 21

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important