Automatic Text Clustering via Particle Swarm Optimization

Size: px
Start display at page:

Download "Automatic Text Clustering via Particle Swarm Optimization"

Transcription

1 Automatic Text Clustering via Particle Swarm Optimization Xing Gao 2, Yanping Lu 1 * 1 Department of Cognitive Science, Xiamen University, China, Yanpinglv@xmu.edu.cn 2 Software School of Xiamen University, China, Gaoxing@xmu.edu.cn Abstract An automatic clustering algorithm based on particle swarm optimization, termed ATCPSO, is proposed for texting clustering in this paper. autopso has been exploited for evolving the correct number of clusters and simultaneously identifying clusters and has demonstrated to improve the performance of high-dimensional data clustering. To extend autopso to text clustering, a few modifications to the algorithm are necessary, such as the similarity measure, parameter selection and the criterion function. Our experimental results on both ten structured text datasets built from 20 newsgroups as well as four text datasets selected from CLUTO show that the proposed algorithm is able to greatly improve the quality of text clustering compared to four typical clustering algorithms and one competitive subspace clustering method. eywords: Text Clustering, Automatic Determination of, Particle Swarm Optimization 1. Introduction With the growing popularity of Internet and the increasing development of Web, there has been an explosion in the volume and variety of textual data. This has led to the need for the development of new techniques to help users effectively navigate, organize and manage the available web documents, with the goal of finding the best matching their needs. Text clustering, which consists of automatically grouping a set of documents into predefined number of categories, is one of the most popular techniques towards the achievement of this objective. As far as clustering techniques for text data are concerned, traditional clustering algorithms have been extensively investigated for use in the area of text clustering over years. Basically, these algorithms can be categorized into two classes, partitional methods and agglomerative hierarchical approaches. Partitional algorithms, such as k-means [1], bisection-kmeans [2] and graph-based [3], find the clusters by partitioning the dataset into a predefined number of clusters, while agglomerative algorithms, such as UPGMA [4], single_link [5] and Chameleon [6], find the clusters by considering each data as one cluster and then repeatedly merging pairs of clusters until a termination criterion is met. As far as high-dimensional document clustering is concerned, partitional algorithms have been demonstrated to lead to better solutions than agglomerative hierarchical algorithms [7]. However, partitional algorithms need to predefine the parameter, the number of clusters k. On the other hand, they are often sensitive to the initial cluster centroids. As a result, they fail to produce satisfactory clustering results document datasets. Due to good performance of stochastic search procedures, one way to automatically determine the number of clusters is to make use of evolutionary techniques. In this regard, genetic algorithms (GA) have been the most frequently proposed for automatically clustering data sets. Clustering is formulated as a continuous optimization problem in the GA algorithms. Since PSO has been demonstrated to perform better than GA in such optimization problems, we in this paper present a particle swarm optimization method to cluster documents and simultaneously identify a correct number of clusters, called Automatic Text Clustering via Particle Swarm Optimization (ATCPSO), which extends the particle swarm optimization for automatically clustering high-dimensional data (autopso) [8] to handle the problem of clustering documents. From the experimental results on ten groups of text datasets with different structures built from 20 newsgroups and four datasets selected from CLUTO, we find that ATCPSO overall outperforms four typical clustering algorithms, namely k-means, and Bisection k-means, Agglomeration and Graph-based, and one competitive subspace clustering algorithm, PSOVW [9]. The rest of the paper is organized as follows. Section 2 describes summarizes the autopso algorithm and gives the modifications of autopso in text clustering. The experimental results of ATCPSO are given in Section 3, which contains the description of text datasets from different sources, the text clustering evaluation methods, and experimental results. Section 4 draws some conclusions and points out directions for future work. International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number23,December 2012 doi: /jdcta.vol6.issue

2 2. Automatic Text Clustering via Particle Swarm Optimization (ATCPSO) A. autopso Since an effective clustering algorithm consists of two key components, namely the criterion function and the search algorithm, Lu et al. [8] employed a criterion function that makes it possible to directly compare partitions with similar or different numbers of clusters, in the same generation or between adjacent generations and a particle swarm optimization global search strategy, called autopso, that minimizes the developed criterion function for automatically determining the number of clusters k. The Davies-Bouldin (DB) index [10] used in autopso aims at both minimizing within-cluster scatter and maximizing between-cluster separation. The employed objective function in autopso is described as follows. 1 k min J ( Z) R i (1) k i 1 where, Si S j Ri max i, j i Bij is the ratio of the sum of within-cluster scatter 1 S i d( x, zi ) Ci x C i to between-cluster separation Bi, j d( zi, z j ). Here, x is the data object and z i is the representative cluster C i and is denoted as the corresponding 1 cluster centroid, which is computed as zi x where C i is the number of data objects belonging Ci x C i to cluster C i. d is the distance function which measures the dissimilarity between two data objects. DB takes values in the interval [ 0, ] and needs to be minimized. Given the objective function, clustering is formulated as a continuous function optimization problem with bound constraints. Instead of employing local search strategies, autopso makes full use of a particle swarm optimization algorithm to minimize the given objective function. It is simple to implement. Furthermore in order to encode a variable number of clusters, autopso utilizes a real-number matrix and a binary vector representation for a particle. Then, a new crossover matrix learning procedure governed by the associated binary vector, is proposed to maintain the population diversity, making the proposed algorithm immune to the premature convergence problem. It has been applied to solve high-dimensional data clustering and has been demonstrated to be less sensitive to the initial cluster centroids and greatly improve the cluster quality. B. Modifications for Text Clustering However, when extending autopso to handle text clustering, there exist some modifications. First due to the sparsity and high-dimensionality of text data, another suitable method instead of the Euclidean distance measuring the similarity between two text documents must be chosen. In addition, we added an additional parameter, i.e. the crossover probability P c, in order to maintain the population diversity of PSO which is set particularly for text datasets with very high dimensions in ATCPSO. Throughout this paper, we will use the symbols, N, M and, to denote the number of text documents, the number of terms and the number of clusters, respectively. We will use the symbol D to denote the text set, D 1, D 2,., D to denote each one of the k categories, and N 1, N 2,, N to denote the sizes of the corresponding clusters. Since the vector-space model is the famous representation of text documents, ATCPSO employed it to represent each document. In this model, each document is considered as a vector in the term-space. We used the tf-idf term weighting model [11], in which each document can be represented as 13

3 N N N tf1 log, tf2 log,, tfm log df1 df2 dfm where, tf i is the frequency of the i th term in the document and df i is the number of documents that contain the i th term. Since text matrices are very sparse and high-dimensional, the Euclidean distance is not appropriate to measure the similarity between two text vectors. Therefore for ATCPSO, the cosine coefficient [16] is used to measure which document centroid is closest to a given document and is defined by the following equation: cos ( x, z) x z x z where x represents one document and z represents a corresponding cluster centroid. This measure becomes one if the documents are identical, and zero if there is nothing in common between them (i.e., the vectors are orthogonal to each other). For the less use of computation complexity, all the document vectors have been normalized into unit length in the experiments. So for any two document vectors, x and z, the cosine function can be simplified as cos ( x, z) x z, where denotes the dot-product of the two vectors. Therefore, the objective function in ATCPSO is to maximize the overall similarity of documents in a cluster and it is defined as follows: 1 k max J( Z) R i (1) k i 1 Si S j Ri min i, j i Bij In clustering high-dimensional data such as text data, particles in PSO tend to converge to local optima due to the curse of very high dimensionality. In order to obtain good population diversity, particles in ATCPSO are required to have different exploration and exploitation ability. So, we added another parameter in PSO, i.e. the crossover probability P c, which is empirically set to f ( i) f (1) e e Pc ( i) 0.5* i f ( s) f (1), where f ( i) 5* for each particle. The values of the learning e e s 1 probability P c in ATCPSO range from 0.05 to 0.5, as for CLPSO [16]. 3. Experiments and evaluations To evaluate the performance of the proposed ATCPSO method, a comprehensive performance study was conducted using a number of different text datasets. Four typical clustering algorithms, namely Bisection k-means [2], Agglomeration [7], Graph-based [3] and k-means [1], and one competitive subspace clustering algorithms, namely PSOVW [9], were chosen for comparison. The goal of the experiments was to assess the accuracy and the efficiency of ATCPSO in clustering text documents. We compared the clustering quality of the six algorithms, examined the variance of the cluster solutions and measured the running time spent by these six algorithms. These experiments provide information on how well and how fast each algorithm is able to retrieve known categories of high-dimensional text datasets and how sensitive each algorithm is to the initial centroids. 3.1 Text Datasets The first group of datasets was derived from the popular 20 newsgroups corpus, which is a collection of approximately 20,000 newsgroup documents, evenly across 20 different newsgroups. In the experiments, the version of Machine Learning Group at UCD [11] was used, where the datasets were constructed from the original 20 newsgroups and have been preprocessed with stop-word removal and 14

4 stemming already applied. All the documents are presented as vector space model and in the following experiments, the terms occurring in less than four documents have been eliminated. Ten subgroups datasets in the group os* of USD version were used for the experiments, as illustrated in Table 1. All the datasets have different structured text data built from the 20 newsgroups dataset. Specially, the documents between two different classes may share some common topics so that the two classes are overlapped considerably. For example, the classes, autos and motor, are overlapping in the entire feature space. Furthermore, the datasets are comprised of unbalanced classes where at least one class contains large less number of documents than other classes in the datasets. The aim of these 20 newsgroups datasets is to provide challenges to the clustering algorithms in terms of the class imbalance and overlap. Table 1. Details of the 20 newsgroups datasets. Each subgroup of these datasets contains 4 datasets with 500, 1000, 2000 and 3000 dimensions, respectively. Therefore, there are total 40 datasets were used in experiments. Each document was scaled into unit length. and N represent the number of clusters and the number of instances in each subgroup of datasets, which contains one category of unbalanced documents. The classes of autos and motor are very closely related to each other. datasets N Category (its number of documents) 1 os_4_ autos(200), med(600), baseball(600), motor(600) 2 os_4_ graphics(200), med(600), mac(600), space(600) 3 os_5_ autos(250), med(562), baseball(562), politics(562), space(562) 4 os_5_ motor(562), electronics(562), atheism(250), politics(562), space(562) 5 os_6_ crypt(300), electronics(540), med(540), politics(540), religion(540), space(540) 6 os_6_ motor(540), electronics(540), baseball(540), politics(540), autos(300), space(540) 7 os_7_ motor(525), electronics(525), baseball(525), politics(525), autos(350), space(525), med(525) 8 os_7_ autos(350), electronics(525), baseball(525), politics(525), religion(525), guns(525), med(525) 9 os_8_ autos(400), graphics(514), baseball(514), politics(514), mac(514), motor(514), med(514), space(514) 10 os_8_ autos(514), electronics(514), baseball(514), politics(514), atheism(400), motor(514), med(514), space(514) The second group of datasets used in the experiments was obtained from the TREC collection. We obtained five TREC subsets, namely tr12, tr23, tr31, tr41 and tr45, from the well-known CLUTO toolkit [13], where the categories correspond to the documents relevant to particular queries. The datasets have been pre-processed and are represented as vector space model, where each element of the vector indicates the frequency of the corresponding keyword in the document. The details of these datasets are illustrated in Table 2. Table 2. Details of the TREC subsets. The table is directly cited from [12]. Datasets N tr tr tr tr tr The clustering algorithms with parameters 15

5 To investigate the performance of ATCPSO for text clustering, five clustering algorithms: k-means, Bisection k-means, Agglomeration, Graph-based and PSOVW have been chosen for comparison in the experiments. A brief introduction of the five clustering algorithms is given below: k-means: The k-means algorithm starts with a random partitioning and loops the following steps: 1) Compute the current cluster centroids and the average vector of each cluster in a data set; 2) Assign each data object to the cluster whose cluster centroid is closest to it. The algorithm terminates when there is no further change in the cluster centroids. k-means maximizes compactness by minimizing the within-cluster scatter. In our implementations, an empty cluster sometimes occurs, so we reassign one data object randomly selected from the data set to it. Bisection k-means: In order to generate k desired clustering solutions, Bisection k-means performs a series of k-means. Bisection k-means is initiated with the universal cluster containing all data objects. Then it loops: it selects the cluster with the largest variance and calls k-means, which optimizes a particular clustering criterion function in order to split this cluster into exactly two subclusters. The loop is repeated a certain number of times such that k non-overlapping clusters are generated. Note that this approach ensures that the criterion function is locally optimized within each bisection, but in general it is not globally optimized. Graph-based: In the Graph-based clustering algorithm, the desired k clustering solutions are obtained by first modeling the objects using a nearest-neighbor graph. Each data object becomes a vertex, and each data object is connected to the other objects most similar to it. Then, the graph is split into k clusters using a min-cut graph partitioning algorithm. PSOVW: PSOVW proposed by Lu et al. [11] is a soft projected clustering algorithm via variable weighting for high-dimensional data clustering. A suitable k-means objective weighting function is minimized by the effective particle swarm optimization strategy in order to search for global optima to the variable weighting problem in clustering. Experimental results on both synthetic and real data show that PSOVW greatly improves cluster quality for high-dimensional data clustering. We implemented two clustering algorithms including k-means and ATCPSO and the subspace clustering PSOVW algorithm in MATLAB, while the other three compared clustering algorithms such as, Bisection k-means, Agglomeration and Graph, can be downloaded from the website of the CLUTO clustering toolkit [13]. According to the routine of experiments [9], the similarity between two documents is measured by the extended Jaccard coefficient in PSOVW, while by the cosine function for other algorithms, namely Bisection k-means, Agglomeration, Graph-based, k-means, and the proposed ATCPSO algorithm. For the experiments, we run Bisection k-means, Agglomeration, Graph-based in CLUTO using the default parameters. 3.3 Experimental Metrices The quality of a clustering solution was determined by three different metrics that examine the class labels of the documents assigned to each cluster. Assume that a data set with categories was grouped into clusters and N is the total number of documents in the data set. Given a particular category L k of size N k and a particular cluster S i of size N i, suppose N ki is the number of documents that belong to category L k assigned to cluster S i. Accuracy Recall Table 3. Experimental Metrics Nkk k 1 Accuracy *100% N Recallk k Nkk call 1 Re, Recallk *100%, 1 k Nk 16

6 Precision Pr ecisioni i Nii ecision 1 Pr, Pr ecisioni *100%, Ni 1 i FScore FScorek k 1 2* Recallk * Pr ecisionk FScore, FScorek Recallk Pr ecisionk Entropy N i 1 Nki Nki Entropy ( log ) i 1N log k 1 Ni Ni Table 3 presents the four evaluation metrics used in this paper. The first metric, Recall, measures the faction of the documents that are relevant to the query that are successfully retrieved in each category. The second metric, Precision, is the fraction of retrieved documents that are relevant to the research. The two measures, Recall and Precision, are used together in the third metric FScore to provide the harmonic mean of precision and recall, which measures the degree of each cluster contain documents from the original category [14]. The fourth metric is the Entropy [15], which examines how the documents in all categories are distributed within each cluster. In general, Accuracy, Recall, Precision and FScore rank the clustering quality from zero (worst) to one (best), while Entropy measures from one (worst) to zero (best). The FScore value will be one when every category has a corresponding cluster containing the same set of documents. The Accuracy value will be the one when every document is distributed to its corresponding dominant cluster. The Recall value will be one when one category has a corresponding cluster containing the same set of documents. The Precision value will be one when one cluster has a corresponding category that contains the same set of documents. The FScore value will be one when every category has a corresponding cluster containing the same set of documents. The Entropy value will be zero when every cluster contains only documents from a single category. 3.4 Experimental Results & Analysis 1) Evaluations on the os* datasets Our first set of experiments was focused on evaluating the average quality of the clustering solutions produced by the various algorithms and the influence of text datasets characteristics such as the number of clusters and the relatedness of clusters on algorithms. On each of the ten structured text dataset built from 20 newsgroups, we ran the above algorithms 10 times. The average Recall, Precision, FScore and Entropy values for the six algorithms on the ten os* datasets are present in the following graphs (a)-(d) in Figure 1, respectively. (a) Recall (b) Precision 17

7 (c) FScore (d) Entropy Figure 1. The Recall, Precision, FScore and Entropy values of each algorithm on the ten os* datasets. A number of observations can be made by analyzing the results in Figure 1. First, ATCPSO, Graph-based and PSOVW lead to clustering solutions that are consistently better than the solutions obtained by other algorithms over the entire experiments on the ten os* text datasets This is true when the clustering quality was evaluated using the Recall, Precision and FScore measures as well as the Entropy measure. They produced solutions that are about 5%-30% better in terms of Recall, Precision and FScore and around 5-40% better in terms of Entropy than other algorithms. Second, ATCPSO lead to the best solutions irrespective of the number of clusters, the relatedness of clusters and the measures used to evaluate the clustering quality. Over the first set of experiments, solutions achieved by ATCPSO are always the best. On the average, TCPSO outperforms the next best algorithm, PSOVW, by around 2%-8% in terms of Recall, Precision and FScore and 9%-30% in terms of Entropy. In general, it achieves higher the Recall, Precision and FScore values while lower the Entropy values than other algorithms. The results in the Recall, Precision and FScore graphs are in agreement with those in the Entropy graph. Both of them show that it greatly surpasses other algorithms for text clustering. In general, the performances of ATCPSO and PSOVW stabilize over text datasets where there are multiple highly related categories such as the datasets 5, 8-10, while Graph-based and Bisection k-means deteriorate on the opposite. In such documents, each category has a subset of words and two categories share many of the same words such as categories, autos and motor. Thus, we would expect that a document can often fail to find the lower centroid regardless of the measure metric they used. The probability for a document to find a wrong centroid varies from one dataset to another, but it basically increases with the high relatedness of categories. The success of ATCPSO and PSOVW are due to the PSO global strategies used in these two algorithms. Third, except on the text dataset 3 and 4, Graph-based performs the third best for all the measures. Bisection k-means also performs quite well when the categories in datasets are highly unrelated such as the text datasets 1-4. The solutions yielded by k-means fluctuate on all the entire experiments due to the employment of the local search strategy. Agglomeration performs the worst of the above six algorithms and leads to small Recall, Precision and FScore and large Entropy values on all datasets. That means the number of documents of each cluster yielded by Agglomeration is thus more balanced in one genuine category and as a result, the number of documents that are correctly identified by Agglomeration in the genuine category is very low. 2) Evaluations on the tr* datasets To further investigate the behavior of ATCPSO, we perform another sequence of experiments. Our second set of experiments was focused on evaluating the FScore and Entropy performances of each algorithm on the tr* datasets. Each algorithm was independently performed 10 times on each of the five tr* datasets. The average FScore and Entropy values of each algorithm on the five tr* datasets are present in the graphs (a) and (b) of Figure 2, while a similar comparison based on the average accuracy and corresponding variance of clustering results achieved by each algorithm are also shown in the graphs (c) and (d) of Figure 2. 18

8 (a) FScore (b) Entropy (c) Accuracy (d) Corresponding variance of clustering results Figure 2. The average FScore, Entropy, Accuracy values and corresponding variance of each algorithm on the five tr* datasets. Looking at the four graphs in Figure 2 and comparing the performance of each algorithm over the five tr* datasets, we can see that ATCPSO outperformed all other algorithms, although PSOVW and Graph-based are very close. In terms of FScore and Entropy, Graph-based, Agglomeration and Bisection k-means are on the average 3%-10% and 8%-15% worse than ATCPSO, respectively. PSOVW and Graph-based performed quite well on the tr41 dataset where the key words of different categories overlap a lot, such as the categories: cable and television, multimedia and media etc., because PSOVW through the subspace clustering mechanism and Graph-based through the graph mechanism consider the relationships of clusters. However, both of them should preset the parameter, the number of clusters k, which is different to preset due to the unknown structure these datasets. The k-means and Bisection k-means algorithms occasionally work well, achieving over 85% cluster accuracy on the dataset tr41 where categories, such as cable, politics, sports etc., are highly unrelated, but the cluster accuracy is often low on other datasets. Since ATCPSO, PSOVW, k-means and Bisection k-means were randomly initialized in our experiments, we also examined the variance of 10 clustering solutions produced by them on the five tr* dataset in order to see how sensitive the clustering results yielded by them were to the initial cluster centroids. The variance of the cluster accuracy is shown in the graph (d) of Figure 2. We ignore Graph-based and Agglomeration clustering algorithms, because they are deterministic and therefore produce the same clustering solution on each trial. From the graph (d) of Figure 2, we can easily note that ATCPSO and PSOVW yields so less variance of clustering results than k-means and Bisection k-means over the five tr* dataset. That s because the particles in ATCPSO and PSOVW do not work independently but cooperate with each other in order to move to better regions. k-means and Bisection k-means are sensitive to the initial cluster centroids, although they occasionally work well, achieving the high clustering accuracy on some trials. 19

9 4. Conclusions & Future Work In response to needs arising from the emergence of high-dimensional text datasets and the low quality of clustering algorithms, this paper extends a particle swarm optimization algorithm for high-dimensional data clustering and simultaneously automatic determination of k to handle the problem of text clustering, called ATCPSO. However, there exist some modifications in ATCPSO. Due to the sparsity and very high dimensions of text datasets, the similarity between two documents is measured by the cosine similarity instead of the Euclidean distance and an additional parameter P c is added in ATCPSO in order to maintain the population diversity of particles in PSO. Therefore, the objective function is changed suitable for text clustering, which is to maximize the overall similarity of documents within a cluster and simultaneously to minimize the overall similarity of documents between two clusters, and the search strategy is changed to escape from the premature convergence of PSO. Experiments on ten structured text datasets built from 20 newsgroups as well as text datasets selected from CLUTO show that ATCPSO is able to greatly improve the quality of text clustering compared to four typical clustering algorithms and one competitive subspace clustering method. Although it runs slower than typical clustering algorithms, its deficiency in running time is acceptable. Plus, the clustering quality is of the prime importance and the time taken to obtain it is at most secondary in text clustering. Moreover, ATCPSO shows that the effectiveness of autopso is not confined to Euclidean distance because the objective function used in atuopso is independent of the similarity measure. Other similarity measures can thus be used for different applications, such as the cosine similarity for text data. More importantly, due to the use of the DB objective function ATCPSO does not need to predefine the parameter, i.e. the number of clusters k. One issue needs to be addressed by ATCPSO. ATCPSO can not recover the relevant terms for different clusters, because there is no subspace-based mechanism within the objective function, which consequently cannot totally capture the effect of the variances along each dimension. We will continue to work on recovering the relevant dimensions for different clusters in the future. References [1] A.. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, USA, [2] E.H. Han, G. arypis, V. umar, B. Mobasher, Hypergraph based clustering in high-dimensional data sets: A summary of results, Bulletin of the Technical Committee on Data Engineering, vol.21, no.1, pp.15-22, [3] M. Steinbach, G. arypis, V. umar, A comparison of document clustering techniques, In Proceeding of the Text mining workshop, pp , [4] P. H. Sneath, R. R. Sokal, Numerical Taxonomy, Freeman, London U, [5] G. arypis, E.H. Han, V. umar, Chameleon: A hierarchical clustering algorithm using dynamic modeling, IEEE Computer, vol.32, no.8, pp.68 75, [6] Ying Zhao, G. arypis, Hierarchical Clustering Algorithms for Document Datasets, Data Mining and nowledge Discovery, vol.10, no.2, pp , [7] Yanping Lu, Suping Xu, Xing Gao, Particle Swarm Optimizer for Automatically Clustering High-dimensional Data, Swarm Intelligence Symposium, pp.37-44, [8] Yanping Lu, Shengrui Wang, Shaozi Li, Particle Swarm Optimizer for Optimal Variable Weighting in Clustering High-dimensional Data, Machine Learning, pp.43-70, [9] D. L. Davies, D. W. Bouldin, A cluster separation measure, IEEE Trans Pattern Analysis and Machine Intelligence, vol.1, no.4, pp , [10] Lifei Chen, Gongde Guo, aijun Wang, Class-dependent Projection based Method for Text Categorization, Pattern Recognition Letters, vol.32, no.10, pp , [11] Zhao Y, arypis G, Criterion functions for document clustering: experiments and analysis, Technical Report of Department of Computer Science and Engineering, USA, [12] Zhou X., Hu X, Zhang X., Lin X, Song I. Y, Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR, In Proceeding of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp ,

10 [13] J.J. Liang, A.. Qin, P.N. Suganthan, S. Baskar, Comprehensive learning particle swarm optimizer for global optimization of multimodal functions, IEEE Transactions on Evolutionary Computation, vol.10, no.3, pp , [14] Shang Gao, Cungen Cao, Convergence Analysis of Particle Swarm Optimization Algorithm, AISS: Advances in Information Sciences and Service Sciences, vol.4, no.14, pp.25-32, [15] Donghui Chen, Zhijing Liu, Zonghu Wang, A novel fuzzy clustering algorithm based on ernel method and Particle Swarm Optimization, JCIT: Journal of Convergence Information Technology, vol.7, no.3, pp , [16] Lv Li, Vector Particle Swarm Optimization, IJACT: International Journal of Advancements in Computing Technology, vol.4, no.17, pp , [17] Huizhen Yang, Yaoqiu Li, "Particle Swarm Optimization for Control of Adaptive Optics System", AISS: Advances in Information Sciences and Service Sciences, vol.4, no.22, pp , 2012 *Corresponding author, Yanping Lu, Tel , Fax , Yanpinglv@xmu.edu.cn; Xing Gao, gaoxing@xmu.edu.cn. This work is in part supported by the Natural Science Foundation of Fujian Province under Grant 2010J01345, the Fundamental Research Funds for the Central Universities under Grant , the Research Fund for the Doctoral Program of Higher Education of China under Grant and Shenzhen ey Laboratory for High Performance Data Mining with Shenzhen New Industry Development Fund under Grant CXB A. 21

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 85 CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL 5.1 INTRODUCTION Document clustering can be applied to improve the retrieval process. Fast and high quality document clustering algorithms play an important

More information

An Improvement of Centroid-Based Classification Algorithm for Text Classification

An Improvement of Centroid-Based Classification Algorithm for Text Classification An Improvement of Centroid-Based Classification Algorithm for Text Classification Zehra Cataltepe, Eser Aygun Istanbul Technical Un. Computer Engineering Dept. Ayazaga, Sariyer, Istanbul, Turkey cataltepe@itu.edu.tr,

More information

Hierarchical Clustering Algorithms for Document Datasets

Hierarchical Clustering Algorithms for Document Datasets Hierarchical Clustering Algorithms for Document Datasets Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455 Technical Report #03-027 {yzhao, karypis}@cs.umn.edu

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING

CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 41 CHAPTER 3 ASSOCIATON RULE BASED CLUSTERING 3.1 INTRODUCTION This chapter describes the clustering process based on association rule mining. As discussed in the introduction, clustering algorithms have

More information

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks SOMSN: An Effective Self Organizing Map for Clustering of Social Networks Fatemeh Ghaemmaghami Research Scholar, CSE and IT Dept. Shiraz University, Shiraz, Iran Reza Manouchehri Sarhadi Research Scholar,

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

Multi-Modal Data Fusion: A Description

Multi-Modal Data Fusion: A Description Multi-Modal Data Fusion: A Description Sarah Coppock and Lawrence J. Mazlack ECECS Department University of Cincinnati Cincinnati, Ohio 45221-0030 USA {coppocs,mazlack}@uc.edu Abstract. Clustering groups

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Dynamic Clustering of Data with Modified K-Means Algorithm

Dynamic Clustering of Data with Modified K-Means Algorithm 2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore Dynamic Clustering of Data with Modified K-Means Algorithm Ahamed Shafeeq

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 3, March 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue:

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design

Feature-Guided K-Means Algorithm for Optimal Image Vector Quantizer Design Journal of Information Hiding and Multimedia Signal Processing c 2017 ISSN 2073-4212 Ubiquitous International Volume 8, Number 6, November 2017 Feature-Guided K-Means Algorithm for Optimal Image Vector

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Particle swarm optimizer for variable weighting in clustering high-dimensional data

Particle swarm optimizer for variable weighting in clustering high-dimensional data Mach Learn (2011) 82: 43 70 DOI 10.1007/s10994-009-5154-2 Particle swarm optimizer for variable weighting in clustering high-dimensional data Yanping Lu Shengrui Wang Shaozi Li Changle Zhou Received: 30

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Automatic Cluster Number Selection using a Split and Merge K-Means Approach Automatic Cluster Number Selection using a Split and Merge K-Means Approach Markus Muhr and Michael Granitzer 31st August 2009 The Know-Center is partner of Austria's Competence Center Program COMET. Agenda

More information

Hierarchical Clustering Algorithms for Document Datasets

Hierarchical Clustering Algorithms for Document Datasets Data Mining and Knowledge Discovery, 10, 141 168, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Hierarchical Clustering Algorithms for Document Datasets YING ZHAO

More information

Module: CLUTO Toolkit. Draft: 10/21/2010

Module: CLUTO Toolkit. Draft: 10/21/2010 Module: CLUTO Toolkit Draft: 10/21/2010 1) Module Name CLUTO Toolkit 2) Scope The module briefly introduces the basic concepts of Clustering. The primary focus of the module is to describe the usage of

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering May 25, 2011 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig Homework

More information

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances

Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Performance Assessment of DMOEA-DD with CEC 2009 MOEA Competition Test Instances Minzhong Liu, Xiufen Zou, Yu Chen, Zhijian Wu Abstract In this paper, the DMOEA-DD, which is an improvement of DMOEA[1,

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data Nian Zhang and Lara Thompson Department of Electrical and Computer Engineering, University

More information

Clustering Documents in Large Text Corpora

Clustering Documents in Large Text Corpora Clustering Documents in Large Text Corpora Bin He Faculty of Computer Science Dalhousie University Halifax, Canada B3H 1W5 bhe@cs.dal.ca http://www.cs.dal.ca/ bhe Yongzheng Zhang Faculty of Computer Science

More information

Collaborative Rough Clustering

Collaborative Rough Clustering Collaborative Rough Clustering Sushmita Mitra, Haider Banka, and Witold Pedrycz Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India {sushmita, hbanka r}@isical.ac.in Dept. of Electrical

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION WILLIAM ROBSON SCHWARTZ University of Maryland, Department of Computer Science College Park, MD, USA, 20742-327, schwartz@cs.umd.edu RICARDO

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION

CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 131 CHAPTER 6 ORTHOGONAL PARTICLE SWARM OPTIMIZATION 6.1 INTRODUCTION The Orthogonal arrays are helpful in guiding the heuristic algorithms to obtain a good solution when applied to NP-hard problems. This

More information

Comparison of Agglomerative and Partitional Document Clustering Algorithms

Comparison of Agglomerative and Partitional Document Clustering Algorithms Comparison of Agglomerative and Partitional Document Clustering Algorithms Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455 {yzhao, karypis}@cs.umn.edu

More information

A Modified Hierarchical Clustering Algorithm for Document Clustering

A Modified Hierarchical Clustering Algorithm for Document Clustering A Modified Hierarchical Algorithm for Document Merin Paul, P Thangam Abstract is the division of data into groups called as clusters. Document clustering is done to analyse the large number of documents

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Implementation of Text clustering using Genetic Algorithm

Implementation of Text clustering using Genetic Algorithm Implementation of Text clustering using Genetic Algorithm Dhanya P.M #, Jathavedan M *, Sreekumar A* # Department of Computer Science Rajagiri School of Engineering and Technology, Kochi, India, 682039

More information

Criterion Functions for Document Clustering Experiments and Analysis

Criterion Functions for Document Clustering Experiments and Analysis Criterion Functions for Document Clustering Experiments and Analysis Ying Zhao and George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Scheme of Big-Data Supported Interactive Evolutionary Computation

Scheme of Big-Data Supported Interactive Evolutionary Computation 2017 2nd International Conference on Information Technology and Management Engineering (ITME 2017) ISBN: 978-1-60595-415-8 Scheme of Big-Data Supported Interactive Evolutionary Computation Guo-sheng HAO

More information

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

Open Access Research on the Prediction Model of Material Cost Based on Data Mining Send Orders for Reprints to reprints@benthamscience.ae 1062 The Open Mechanical Engineering Journal, 2015, 9, 1062-1066 Open Access Research on the Prediction Model of Material Cost Based on Data Mining

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Modified Particle Swarm Optimization

Modified Particle Swarm Optimization Modified Particle Swarm Optimization Swati Agrawal 1, R.P. Shimpi 2 1 Aerospace Engineering Department, IIT Bombay, Mumbai, India, swati.agrawal@iitb.ac.in 2 Aerospace Engineering Department, IIT Bombay,

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Texture Image Segmentation using FCM

Texture Image Segmentation using FCM Proceedings of 2012 4th International Conference on Machine Learning and Computing IPCSIT vol. 25 (2012) (2012) IACSIT Press, Singapore Texture Image Segmentation using FCM Kanchan S. Deshmukh + M.G.M

More information

Text clustering based on a divide and merge strategy

Text clustering based on a divide and merge strategy Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 55 (2015 ) 825 832 Information Technology and Quantitative Management (ITQM 2015) Text clustering based on a divide and

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Chapter 4: Text Clustering

Chapter 4: Text Clustering 4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data Journal of Computational Information Systems 11: 6 (2015) 2139 2146 Available at http://www.jofcis.com A Fuzzy C-means Clustering Algorithm Based on Pseudo-nearest-neighbor Intervals for Incomplete Data

More information

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Multi-Stage Rocchio Classification for Large-scale Multilabeled Multi-Stage Rocchio Classification for Large-scale Multilabeled Text data Dong-Hyun Lee Nangman Computing, 117D Garden five Tools, Munjeong-dong Songpa-gu, Seoul, Korea dhlee347@gmail.com Abstract. Large-scale

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid

An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid An Application of Genetic Algorithm for Auto-body Panel Die-design Case Library Based on Grid Demin Wang 2, Hong Zhu 1, and Xin Liu 2 1 College of Computer Science and Technology, Jilin University, Changchun

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.

More information

A Miniature-Based Image Retrieval System

A Miniature-Based Image Retrieval System A Miniature-Based Image Retrieval System Md. Saiful Islam 1 and Md. Haider Ali 2 Institute of Information Technology 1, Dept. of Computer Science and Engineering 2, University of Dhaka 1, 2, Dhaka-1000,

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Hybrid Fireworks Optimization Method with Differential Evolution Operators

A Hybrid Fireworks Optimization Method with Differential Evolution Operators A Fireworks Optimization Method with Differential Evolution Operators YuJun Zheng a,, XinLi Xu a, HaiFeng Ling b a College of Computer Science & Technology, Zhejiang University of Technology, Hangzhou,

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering Team 2 Prof. Anita Wasilewska CSE 634 Data Mining All Sources Used for the Presentation Olson CF. Parallel algorithms

More information

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding

Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding e Scientific World Journal, Article ID 746260, 8 pages http://dx.doi.org/10.1155/2014/746260 Research Article Path Planning Using a Hybrid Evolutionary Algorithm Based on Tree Structure Encoding Ming-Yi

More information

Enhancing K-means Clustering Algorithm with Improved Initial Center

Enhancing K-means Clustering Algorithm with Improved Initial Center Enhancing K-means Clustering Algorithm with Improved Initial Center Madhu Yedla #1, Srinivasa Rao Pathakota #2, T M Srinivasa #3 # Department of Computer Science and Engineering, National Institute of

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University Information Retrieval System Using Concept Projection Based on PDDP algorithm Minoru SASAKI and Kenji KITA Department of Information Science & Intelligent Systems Faculty of Engineering, Tokushima University

More information

An Approach to Improve Quality of Document Clustering by Word Set Based Documenting Clustering Algorithm

An Approach to Improve Quality of Document Clustering by Word Set Based Documenting Clustering Algorithm ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal www.computerscijournal.org ISSN: 0974-6471 December 2011, Vol. 4, No. (2): Pgs. 379-385

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Weighted Suffix Tree Document Model for Web Documents Clustering

Weighted Suffix Tree Document Model for Web Documents Clustering ISBN 978-952-5726-09-1 (Print) Proceedings of the Second International Symposium on Networking and Network Security (ISNNS 10) Jinggangshan, P. R. China, 2-4, April. 2010, pp. 165-169 Weighted Suffix Tree

More information

Comparative Study Of Different Data Mining Techniques : A Review

Comparative Study Of Different Data Mining Techniques : A Review Volume II, Issue IV, APRIL 13 IJLTEMAS ISSN 7-5 Comparative Study Of Different Data Mining Techniques : A Review Sudhir Singh Deptt of Computer Science & Applications M.D. University Rohtak, Haryana sudhirsingh@yahoo.com

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path

A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path A Clustering Method with Efficient Number of Clusters Selected Automatically Based on Shortest Path Makki Akasha, Ibrahim Musa Ishag, Dong Gyu Lee, Keun Ho Ryu Database/Bioinformatics Laboratory Chungbuk

More information

2 Proposed Methodology

2 Proposed Methodology 3rd International Conference on Multimedia Technology(ICMT 2013) Object Detection in Image with Complex Background Dong Li, Yali Li, Fei He, Shengjin Wang 1 State Key Laboratory of Intelligent Technology

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

An improved PID neural network controller for long time delay systems using particle swarm optimization algorithm

An improved PID neural network controller for long time delay systems using particle swarm optimization algorithm An improved PID neural network controller for long time delay systems using particle swarm optimization algorithm A. Lari, A. Khosravi and A. Alfi Faculty of Electrical and Computer Engineering, Noushirvani

More information

Count based K-Means Clustering Algorithm

Count based K-Means Clustering Algorithm International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Count

More information

A Naïve Soft Computing based Approach for Gene Expression Data Analysis

A Naïve Soft Computing based Approach for Gene Expression Data Analysis Available online at www.sciencedirect.com Procedia Engineering 38 (2012 ) 2124 2128 International Conference on Modeling Optimization and Computing (ICMOC-2012) A Naïve Soft Computing based Approach for

More information

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM 96 CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM Clustering is the process of combining a set of relevant information in the same group. In this process KM algorithm plays

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

Clustering Algorithms for Data Stream

Clustering Algorithms for Data Stream Clustering Algorithms for Data Stream Karishma Nadhe 1, Prof. P. M. Chawan 2 1Student, Dept of CS & IT, VJTI Mumbai, Maharashtra, India 2Professor, Dept of CS & IT, VJTI Mumbai, Maharashtra, India Abstract:

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Hannarin Kruajirayu, Teerapong Leelanupab Knowledge Management and Knowledge Engineering Laboratory Faculty of Information Technology

More information

Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm

Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm Design of an Optimal Nearest Neighbor Classifier Using an Intelligent Genetic Algorithm Shinn-Ying Ho *, Chia-Cheng Liu, Soundy Liu, and Jun-Wen Jou Department of Information Engineering, Feng Chia University,

More information

Genetic Algorithm for Circuit Partitioning

Genetic Algorithm for Circuit Partitioning Genetic Algorithm for Circuit Partitioning ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,

More information

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods

Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Cluster Analysis for Effective Information Retrieval through Cohesive Group of Cluster Methods Prof. S.N. Sawalkar 1, Ms. Sheetal Yamde 2 1Head Department of Computer Science and Engineering, Computer

More information