A Study on Clustering Method by Self-Organizing Map and Information Criteria Satoru Kato, Tadashi Horiuchi,andYoshioItoh Matsue College of Technology, 4-4 Nishi-ikuma, Matsue, Shimane 90-88, JAPAN, kato@matsue-ct.ac.jp Tottori University, 4-0 Koyama-cho minami, Tottori 80-80, JAPAN Abstract. In this paper, we propose a clustering method by and information criteria. In this method, initial cluster-candidates are derived by, and then these candidates are merged appropriately based on information criterion such as BIC or AIC (Akaike Information Criterion). Through the clustering experiments for the artificial datasets and UCI Machine Learning Repository s datasets, we confirm that our proposed method can extract clusters more accurately and stably than the only method. Introduction Clustering by Self-Organizing Map ([]) can extract clusters of arbitrary distribution shapes based on the distance between the code-vectors (representative points of the input data)[]. In recent, there are several improvemental methods which alter the basic algorithm [3][4]. Hence, this is one of the distancebased clustering approaches. On the other hand, there are distribution-based clustering approaches that consider the distribution of input data when extracting clusters appropriately. For example, x-means method [] adopts Bayesian Information Criterion (BIC) into k-means method. Information criteria are also easily introduced into the clustering method by. In this paper, we propose a clustering method by and information criteria. In the proposing method, initial cluster-candidates are derived by, and then these candidates are merged appropriately based on the information criterion such as BIC or AIC (Akaike Information Criterion). Through the clustering experiments for the artificial datasets and UCI Machine Learning Repository s datasets, we confirm that our proposed method can extract clusters more accurately and stably than the -only method. Furthermore, we show that AIC is suitable for the proposed method compared to BIC.
Clustering by. Basic algorithm, proposed by Kohonen, is configured as shown in Fig.. In the basic learning algorithm [], the code vectors are updated by using the following equations w i (t +)=w i (t)+α(t) Φ(p i )(x w i (t)) () Φ(p i )=exp ( p i ) σ (t) () Here α(t) is the learning coefficient after t learning steps. The coefficient starts from its initial value α ini, and then decreases monotonically as t increases, thus reaching its minimum at the pre-set maximum number of learning steps T. In addition, Φ(p i ) is a neighborhood function with the center at winner cell c, and p i is the distance from cell i to the winner cell c in the competitive layer. In Eq., σ(t) is a time-varying parameter that defines the neighborhood size in the competitive layer. Like α(t) in Eq., this parameter decreases monotonically from σ ini as learning proceeds. As a result of learning, the similarity between learning data is expressed by the closeness on the grid in the competitive layer. In addition, the data density in the input data space is reflected in the distribution of code vectors after learning.. Cluster extraction from In the maps built by learning, the code vectors between adjacent cells in the grid of competitive layer are similar, and the data density in the input layer is reflected in distribution of code vectors after learning. Using these features, as pointed out by Terashima et al. [], allows clustering by the detection of cluster boundaries as portions where the code vectors between adjacent cells are substantially different. The specific clustering procedure is presented below. In addition, one-dimensional is used for simplicity of analysis; m cells in the competitive layer are arranged in one-dimensional array.. Map building The input data are subjected to learning to obtain a set of code vectors.. Map analysis (a) For every cell i(i =,,..., m ), the code vector density dw i is found from following equations as the Euclidean distance between code vectors for cells i and i +: dw i = w i w i+ (3)
3 Neuron cell i Input layer......... Input vector : x Competitive layer Code vector : Wi Fig.. Basic structure one-dimensional (b) The code vector density dw i for every cell i(i =,,..., m ) is normalized to its maximum and minimum in the range 0, thus obtaining the normalized density dw i : dw i = dw i dw i min dw i max dw i min (4) (c) The histogram of dw i is derived. A cluster boundary is recognized between cell i corresponding to the histogram peak and its neighbor cell i +. 3. Labeling The competitive layer is divided according to the dw i group of cells is labeled appropriately. histogram, and every 3 Proposed method 3. Basic idea There are many upward-peaks in the density histogram of code-vector. Each of these may indicate a boundary of clusters obviously or not, so that many cluster-candidates can be extracted from the density histogram. Basic idea of the proposing method is appropriate mergence of these cluster-candidates by using of information criteria under following procedures. A. Make code-vector density(dw i ) histogram after learning process of onedimensional. B. Extract cluster-candidates from the density histogram and assign continuous number to each candidate. These numbers are ordered correspondingly to that of neuron cells in the competitive-layer of. C. Decide which cluster-candidates should be merged from arbitrary pair of candidates whose numbers are adjoining each other. Fig. shows a practical sequence of the proposed method. Through the procedure A, we obtain Fig.(a) and (b). And Fig.(c) is obtained after the procedure B. Then, cluster-candidates are gradually merged by applying the procedure C over and over until the number of clusters is agree with original one. (See Fig.(d)-(f)).
4 Cell9 Cell3 Cell Cell Cell Cell7 Cell Input data Code-vector dw i 0.9 0.8 0.7 0. 0. 0.4 0.3 0. Cell7 Cell Cell 0. Cell3 Cell Cell9 Cell 0 0 0 Cell number 30 7 8 4 3 (a) Distribution of code-vectors (b) Density histogram of code-vectors (c) Cluster-candidates (Initial state) 34 34 34 7 78 78 8 (d) st. mergence (e) nd. mergence (f) 3rd. mergence Fig.. Clustering process by using proposed method 3. BIC and AIC When a distribution of data x is observed, a family of alternative model which generate the distribution can be considered. Information criterion is one of the useful guideline to determine which model is the most suitable. Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) are typical one, calculated by the following equation respectively. BIC = logl(ˆθ; x)+q log n () AIC = logl(ˆθ; x)+q () Where, q is dimension of the parameter vector ˆθ and n is the number of samples of empirical distribution. And L( ) = f( ),wheref( ) isp-dimensional Gaussiandistribution: f(ˆθ; x) =(π) p V exp { } (x V (x μ) (7) μ)t
In Eq.() and Eq.(), first term is logarithmic likelihood when a model described by parameter ˆθ is applied to the empirical distribution of x and second term indicates complexity of the model which is called penalty term. 3.3 Cluster mergence by using information criterion The procedure for selective cluster mergence (see Sec.3. procedure C) is divided into the following processes in practice. Here, it is assumed that the procedure A and B in Sec.3. have already finished. C. Merge a pair of adjoining cluster-candidates temporally, and calculate the two values what we call IC single and IC twin by using either Eq.() or Eq.(). Where, IC single and IC twin means the value of either BIC or AIC when an applied distribution model for the unified clusters is single-distribution or twin-distribution respectively. C. Calculate ΔIC which means a difference between IC single and IC twin by the following equation. ΔIC = IC single IC twin (8) C3. After calculation of ΔIC for all pairs of adjoining cluster candidates, find the pair which has minimum ΔIC and merge two cluster candidates included in the pair conclusively. Then, consecutive numbers for the cluster candidates including the new cluster are refreshed. C4. Repeat the procedure C to C3 until the number of clusters reaches a specified value. IC single < IC twin when fitting the single-distribution to the unified clusters is more suitable than the case of twin-distribution. Therefore, ΔIC can measure a degree of propriety to merge two adjoining clusters. 4 Clustering experiments 4. Experimental method We use four kinds of data distribution as experimental dataset. Two datasets are generated artificially to consist of two or three clusters whose density or distribution shape is different. Each of another two datasets is UCI Iris and BCW dataset from UCI ML repository[] as examples of practical data. Performance evaluation is carried out by using a degree of the classification error. Classification error is calculated by comparing the indices of the original dataset with which is obtained from the clustering result. When applying learning algorithm in proposed method, we set the iteration of learning is 00 times of the number of the input data and the number of cells of the competitive layer is set to,0,,30 or 3. We make 00 trials for each setting of learning and apply cluster mergence procedure with either BIC or AIC for each learning result so that 00 kinds of clustering result (00 trials patterns of the size of competitive layer) for each dataset.
0.8 Class Class Class 3 4 Class Class 0. 3 0.4 0. 0 0 0 0. 0.4 0. 0.8 - - - 0 3 4 (a) Artificial dataset (Different densities, 3 clusters) (b) Artificial dataset (Distorted distributions, clusters) nd. principal component Iris Setosa Iris Versicolor Iris Verginica nd. Principal component BCW_class(benign) BCW_class(malignant) st. principal component st. Principal component (c) UCI Iris data (d) UCI BCW data Fig. 3. Artificial and practical dataset for the clustering experiments 4. Experimental result Fig.4 shows the result of performance evaluation of clustering for each dataset. We calculate the average value of classification error through 00 trials for each the five pattern of s competitive layer size. So that in the legend of each figure, Worst, Average and Best indicate maximum, average and minimum value of the average classification error respectively among the five patterns of settings. +BIC and +AIC correspond to proposed method and only is conventional method which extracts clusters from histogram of codevector density such as shown in Fig.(b) with appropriate threshold setting. In the case of artificial dataset and UCI BCW dataset, only method shows very high classification error. These dataset are including clusters whose density is quite different each other and it is hard to estimate the boundary of clusters correctly only by using code-vector density histogram. On the other
7 Classification error (%) 0 Worst Average Best 40 49% Classification error (%) Worst Average 4 Best 30 0 0 % 0.78% 0.78% 3.4% 0.9% 0.9%.% k-means +BIC +AIC only k-means +BIC +AIC only (a) Artificial dataset (b) Artificial dataset Classification error (%) Classification error (%) 0 40 Worst Average Best 0 Worst Average Best 3% 30 0 0 4% 7.8% 4.9% 0% 0 3.9%.%.3% k-means +BIC +AIC only k-means +BIC +AIC only (c) UCI Iris data (d) UCI BCW data Fig. 4. Comparison of clustering performance hand, proposed method can extract each cluster in the dataset more accurately than another methods except k-means method in the case of BCW dataset. BCW dataset contains comparatively high-dimensional data (each data has 0 attributes), therefore distribution-based approaches such as +BIC and +AIC may not be able to estimate parameters such as μ and V in the Eq.(7) correctly of the distribution model Looking at the classification error of +BIC and +AIC, both these methods show almost same clustering performance excepting the case of UCI Iris dataset. In Eq.(), the penalty term includes variable of the number of samples n. AndΔIC becomes small if the value of n is large. Hence in the case of +BIC method, one cluster candidate which has a large number of samples tends to take adjoining candidates one after another.
8 Conclusion In this paper, we tried to combine clustering methodology with appropriate cluster mergence approach by using information criteria such as BIC and AIC. Since it can pay attention to a naturalness for each data distribution as a cluster, proposed method can extract clusters more correctly than conventional methods especially when the dataset consists of clusters whose density is different each other. From the results of clustering experiments using several kind of artificial and practical dataset, proposed method shows less classification error than other conventional method such as k-means and -based simple clustering method. Furthermore, we confirmed that AIC is suitable for the proposed method compared to BIC. It is necessary to use more kind of practical dataset for examination of effectiveness of the proposed method as a future work. References. T. Kohonen: Self-Organizing Maps, 3 rd ed., Springer-Verlag Berlin (00). M. Terashima, F. Shiratani, K. Yamamoto: Unsupervised Cluster Segmentation Method Using Data Density Histogram on Self-Organizing Feature Map, IEICE Trans., Vol.J79-D-II, No.7, pp.80 90 (99)(in Japanese) 3. S. Kato, K. Koike, T. Horiuchi and Y. Itoh: A Study on Two-Stage Self-Organizing Map Suitable for Clustering Problems Proceedings of the 00 International Symposium on Intelligent Signal Processing and Communication Systems, pp.77 80, 00. 4. H. Matsushita and Y. Nishio: Reunifying Self-Organizing Map and Disconnecting Self-Organizing Map, RISP Journal of Signal Processing, Vol., no., pp.44 4, 007.. D. Pelleg, and A. Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters, Proc. of the 7th International Conference on Machine Learning, pp.77 734, 000.. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/mlrepository.html