A Novel Validity Index for Determination of the Optimal Number of Clusters

Similar documents
An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

An Alternative Approach to the Fuzzifier in Fuzzy Clustering to Obtain Better Clustering Results

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Extracting Partition Statistics from Semistructured Data

Cluster-Based Cumulative Ensembles

KERNEL SPARSE REPRESENTATION WITH LOCAL PATTERNS FOR FACE RECOGNITION

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

Capturing Large Intra-class Variations of Biometric Data by Template Co-updating

Improved Vehicle Classification in Long Traffic Video by Cooperating Tracker and Classifier Modules

New Fuzzy Object Segmentation Algorithm for Video Sequences *

A {k, n}-secret Sharing Scheme for Color Images

Weak Dependence on Initialization in Mixture of Linear Regressions

the data. Structured Principal Component Analysis (SPCA)

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks

Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introduction Information Retrieval... 8

Cluster Centric Fuzzy Modeling

TUMOR DETECTION IN MRI BRAIN IMAGE SEGMENTATION USING PHASE CONGRUENCY MODIFIED FUZZY C MEAN ALGORITHM

Plot-to-track correlation in A-SMGCS using the target images from a Surface Movement Radar

Cluster-based Cooperative Communication with Network Coding in Wireless Networks

Gray Codes for Reflectable Languages

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

Detection and Recognition of Non-Occluded Objects using Signature Map

An Efficient and Scalable Approach to CNN Queries in a Road Network

Self-Adaptive Parent to Mean-Centric Recombination for Real-Parameter Optimization

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks

Boosted Random Forest

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Graph-Based vs Depth-Based Data Representation for Multiview Images

And, the (low-pass) Butterworth filter of order m is given in the frequency domain by

A New RBFNDDA-KNN Network and Its Application to Medical Pattern Classification

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays

A scheme for racquet sports video analysis with the combination of audio-visual information

Pipelined Multipliers for Reconfigurable Hardware

Introduction to Seismology Spring 2008

Dr.Hazeem Al-Khafaji Dept. of Computer Science, Thi-Qar University, College of Science, Iraq

HEXA: Compact Data Structures for Faster Packet Processing

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

INTERPOLATED AND WARPED 2-D DIGITAL WAVEGUIDE MESH ALGORITHMS

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

Semi-Supervised Affinity Propagation with Instance-Level Constraints

Calculation of typical running time of a branch-and-bound algorithm for the vertex-cover problem

Visualization of patent analysis for emerging technology

A Coarse-to-Fine Classification Scheme for Facial Expression Recognition

Using Augmented Measurements to Improve the Convergence of ICP

Exploring the Commonality in Feature Modeling Notations

timestamp, if silhouette(x, y) 0 0 if silhouette(x, y) = 0, mhi(x, y) = and mhi(x, y) < timestamp - duration mhi(x, y), else

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc. Improvement of low illumination image enhancement algorithm based on physical mode

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

Acoustic Links. Maximizing Channel Utilization for Underwater

FOREGROUND OBJECT EXTRACTION USING FUZZY C MEANS WITH BIT-PLANE SLICING AND OPTICAL FLOW

A Multi-Head Clustering Algorithm in Vehicular Ad Hoc Networks

Video Data and Sonar Data: Real World Data Fusion Example

Optimization of Two-Stage Cylindrical Gear Reducer with Adaptive Boundary Constraints

A Subtractive Relational Fuzzy C-Medoids Clustering Approach To Cluster Web User Sessions from Web Server Logs

An Interactive-Voting Based Map Matching Algorithm

Multi-Piece Mold Design Based on Linear Mixed-Integer Program Toward Guaranteed Optimality

Exploiting Enriched Contextual Information for Mobile App Classification

FUZZY WATERSHED FOR IMAGE SEGMENTATION

Naïve Bayesian Rough Sets Under Fuzziness

A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Measurement of the stereoscopic rangefinder beam angular velocity using the digital image processing method

Approximate logic synthesis for error tolerant applications

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

A Hybrid Neuro-Genetic Approach to Short-Term Traffic Volume Prediction

Gait Based Human Recognition with Various Classifiers Using Exhaustive Angle Calculations in Model Free Approach

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

13.1 Numerical Evaluation of Integrals Over One Dimension

Multi-Channel Wireless Networks: Capacity and Protocols

Discrete sequential models and CRFs. 1 Case Study: Supervised Part-of-Speech Tagging

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

Outline: Software Design

Supplementary Material: Geometric Calibration of Micro-Lens-Based Light-Field Cameras using Line Features

Multivariate Texture-based Segmentation of Remotely Sensed. Imagery for Extraction of Objects and Their Uncertainty

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System

Unsupervised color film restoration using adaptive color equalization

Fuzzy Meta Node Fuzzy Metagraph and its Cluster Analysis

Compressed Sensing mm-wave SAR for Non-Destructive Testing Applications using Side Information

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?

Chromaticity-matched Superimposition of Foreground Objects in Different Environments

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

Direct-Mapped Caches

Incremental Mining of Partial Periodic Patterns in Time-series Databases

Gradient based progressive probabilistic Hough transform

The Implementation of RRTs for a Remote-Controlled Mobile Robot

Shape Outlier Detection Using Pose Preserving Dynamic Shape Models

Rotation Invariant Spherical Harmonic Representation of 3D Shape Descriptors

Batch Auditing for Multiclient Data in Multicloud Storage

Multiple-Criteria Decision Analysis: A Novel Rank Aggregation Method

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

One Against One or One Against All : Which One is Better for Handwriting Recognition with SVMs?

HIGHER ORDER full-wave three-dimensional (3-D) large-domain techniques in

特集 Road Border Recognition Using FIR Images and LIDAR Signal Processing

Segmentation of brain MR image using fuzzy local Gaussian mixture model with bias field correction

SURVEY ON MEDICAL IMAGE SEGMENTATION USING ENHANCED K-MEANS AND KERNELIZED FUZZY C- MEANS

Transcription:

IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 281 LETTER A Novel Validity Index for Determination of the Optimal Number of Clusters Do-Jong KIM, Yong-Woon PARK, and Dong-Jo PARK, Nonmembers SUMMARY The strutural harateristis of lusters are investigated in the partitioning proess. Two partition funtions, whih show opposite properties around the optimal luster number, are found and a new luster validity index is presented based on the ombination of these funtions. Some properties of the index funtion are disussed and numerial examples are presented. key words: lustering, validity index, optimal luster number 1. Introdution The hard -means algorithm (HCMA) and the fuzzy - means algorithm (FCMA) are well known for their effiieny in lustering large data sets [2]. Although these algorithms require several parameters, the most signifiant one affeting the performane is known as the number of lusters. Different hoies of may lead to different lustering results. Thus, the estimation of the optimal luster number ( ) during the lustering proess is a prime onern. Many funtions, alled luster validity or validity riteria, are proposed in the literatures in order to find an optimal number of lusters. The partition oeffiient (v PC ) and the partition entropy (v PE ) whih use the partition matrix was introdued by Bezdek [2]. Other riteria whih take into aount the geometri properties of input data were proposed by Fukayama and Sugeno (v FS ) [3] and Xie and Beni (v XB ) [4]. The indies v PC and v PE are sensitive to noises or a weighting exponent m and v FS is sensitive to both high and low values of the weighting exponent m. Moreover, the indies v PC, v PE and v FS are no more useful for HCMA. Aording to Pal and Bezdek s analysis [5], the index v XB provided a good response over a wide range of hoies both for = 2to 10 and for m =1.01 to 7. However, v XB dereases monotonially as the number of lusters beomes very large and lose to the number of data n. To eliminate the monotonially dereasing tendeny, Kweon [6] added an ad ho punishing term and proposed a new validity index (v K ). Reently, another approah based on a dynami Manusript reeived July 31, 2000. Manusript revised Otober 10, 2000. The authors are with the Department of Eletrial Engineering, Korea Advaned Institute of Siene and Tehnology, 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, Republi of Korea. The authors are with the Ageny for Defene Development,Yusong P.O.Box 35-1, Taejon, Republi of Korea. estimation method (v rit ) was presented by Boudraa [7]. In order to overome some limitations of the previous studies, we present a new validity riterion whih onsiders the strutural harateristis around the optimal luster number in the partitioning proess. It shows a lear valley at = and eliminates the dereasing tendeny for large. It is also appliable to both HCMA and FCMA. 2. Cluster Struture Related Funtions The lustering algorithms based on the objetive funtion (FCMA and HCMA) minimize sum of the intraluster distane in the proess of optimization. Figure 1 shows a simple partitioning proess for = 2to 4. The data onsist of three ompat lasses ( = 3), v i is a prototype assoiated with the i th luster and eah luster is distinguished by different markers. In this paper, it is defined that lusters are in the under-partitioned state when < and in the over-partitioned state when >. In addition, the mean intra-luster distane (MICD) of the i th luster is defined as MD i = x χ i v i x /n i, where χ i is a data set of the i th luster and n i is the number of data in the i th luster. When the data are strutually under-partitioned as shown in Fig. 1 (a), at least one luster maintains large MICD. As the partition state moves to the optimal and over-partitioned ones ( ), the large MICD abruptly dereases. On the other hand, the inter-luster minimum distane (ICMD) whih is defined as d min = min i j v i v j [4] beomes large when the data are under-partitioned and optimally partitioned states. As the state enters into the over-partitioned one, ICMD beomes very small beause at least one of ompat lasses is subdivided as shown in Fig. 1 (). Therefore, it is possible to find an optimal luster number by us- (a) = 2. (b) = 3. () =4. Fig. 1 Example of partitioning proess.

282 IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 (a) MICD and ICMD. (b) v u( ) and v o( ). Fig. 2 Illustrations of luster related funtions w.r.t.. ing two measures, MICD and ICMD, eah of whih presents differently varying aspets around. That is, at least one of MICD s abruptly hanges at and so does ICMD at + 1. A simple illustration of this tendeny is shown in Fig. 2(a). Let X =[x 1, x 2,, x n ] T be a finite data set of a p-dimensional feature spae, where x i is 1 p vetor. And let V =[v 1, v 2,, v ] T be a p prototype matrix, v i is 1 p vetor and eah of whih haraterizes one of the lusters. 2.1 Under-Partition Measure Funtion To find the under-partitioned status, we define v u (, V; X) as an under-partition measure funtion. v u (, V; X) = 1 MD i, 2 max. (1) i=1 It shows the mean of MICD over the luster number and measures the strutural ompatness of eah and every lass. When the data are optimally or over-partitioned, every lass beomes ompat and this makes v u ( ) small. Furthermore, as the luster number beomes very large and lose to the number of data points n, the mean distane beomes 0. However, in the ase of the under-partitioned state, v u ( ) beomes relatively large beause some of ompat lasses may be grouped to a single luster. Therefore, this funtion produes a break point at the optimal luster number, that is, it has very small values for and relatively large values for < as shown in Fig. 2(b). Thus, it plays a key role to determine whether an underpartitioned status ourred or not in the partitioning proess. 2.2 Over-Partition Measure Funtion An over-partition measure funtion is defined as v o (, V) as following: v o (, V) =, 2 max. (2) d min The denominator of this funtion (d min ), whih is the minimum distane between luster enters, measures inter-luster separation. When the data are optimally or under-partitioned, d min beomes large, hene v o ( ) yields a small value. However, as the data are overpartitioned, d min beomes very small beause some of ompat lasses may be subdivided into several lusters. Therefore, this funtion also produe a break point at, that is, it has very large values for > and relatively small values for as shown in Fig. 2(b). Thus, it also plays a key role to determine whether an over-partitioned status ourred or not in the partitioning proess. 3. Validity Index As desribed in the previous setion, both the partition measure funtions have break points at the optimal luster number. v u ( ) beomes small for and v o ( ) beomes small for. Sine both funtions have small values only at =, an appropriate ombination of eah funtion produes the optimal number of lusters easily. On the other hand, eah funtion has different sales with respet to the struture and number of data. In order to aommodate relative mismathes of eah one, we applied a normalization proess. Let us define partition measure vetors as v u =[v u (2, V; X),,v u ( max, V; X)], (3) v o =[v o (2, V),,v o ( max, V)]. (4) For eah vetor, maximum and minimum values are omputed as v max = max v u (, V; X), v min = min v u (, V; X), =2, 3,, max, (5) then, normalization of eah element beomes v un (, V; X) = v u(, V; X) v min. (6) v max v min Thus, v un ( ) always lies between 0 to 1. Consequently, normalized partition measure vetors are written as v un =[v un (2, V; X),,v un ( max, V; X)], (7) v on =[v on (2, V),,v on ( max, V)]. (8) By adding the two normalized partition measure funtions, a new validity index, v SV, is formulated as following: v SV (, V; X) =v un (, V; X)+v on (, V). (9) The goal is to find the optimal luster number with the smallest value of v SV ( ) for =2to max. 4. Experimental Results Two kinds of experiments were performed to verify the proposed method. The first one is intended to determine the optimal number of lusters with two noisy

LETTER 283 (a) Input data. Fig. 3 (b) Index funtions. (a) Input data. Fig. 4 (b) Index funtions. Data 1 and results. Data 2 and results. data sets, Data 1 and Data 2. The seond one is to show how effetively v SV works on objet extration in real images. The proposed index, v SV, is also ompared with other ones: v PC,v PE, v FS, v XB, v K, and v rit. For eah of the data sets, FCMA was performed with the weighting exponent m = 2and a terminating ondition ɛ =10 5. An alternating optimization tehnique is used for the lustering, and the fuzzy partition matrix is hosen as the initial value (U 0 ) instead of the prototype. Also, the partition matrix is randomly generated suh that fuzzy membership values satisfy the following onditions: u ik [0, 1] for all i, k and i u ik = 1 for all k. In the first experiment, the luster numbers are varied from 2to 10 for both low noisy data and relatively high noisy ones to find the optimal number of lusters. As shown in Figs. 3 (a) and 4 (a), eah data is preferably expeted to be = 6 for Data 1, and = 5 for Data 2 as the optimal luster numbers respetively. Figures 3 (b) and 4 (b) show validity related funtions with respet to the luster number. For the low noise ase (Data 1), two funtions, v un ( ) and v on ( ), show a fast gradient hange around = 6. Naturally, the index funtion v SV ( ) shows a steep valley at = 6. The similar results are obtained for the relatively high noise ase (Data 2). It is also investigated that more noises in the data generate a less steep valley at the optimal luster number. In other words, the steepness degree of the valley at = implies how the lusters are ompat and well-separated. Table 1 shows the values of the seven validity indies for = 2to 10. We highlighted the optimal value of hosen by eah index. For the noisy data sets, v PC or v PE gives inorret numbers. Although v FS works fairly well for two data sets, it is known that this index beomes unreliable for large or small m [5]. v XB and v K also work well in these experiments, however, they have a dereasing tendeny when the number of lusters beomes very large. The punishing term added by Kweon [6] does not play effiiently in eliminating the dereasing tendeny sine this ad ho term beomes relatively small for the large data set by the following relation: n u 2 ij x j v i 1 v i v. (10) j=1 i=1 i=1 Both the index v SV and v rit indiate the optimal luster number orretly and provide two advantages ompared with other methods. First, the index funtions do not derease when beomes large. Seond, these indies show a steep valley at the optimal luster number. It is also investigated that the proposed one shows a more steep valley at than v rit. Moreover, the index values remain around 0 to 1 regardless of the numbers and strutures of the data. The seond experiment shows how the proposed index works when applied to the objet extration problem from a real image. The image is taken from the Hamburg Taxi sequenes as shown in Fig. 5 (a). The image ontains two ars, the one, a taxi, is moving to the left and upper diretions and the other is parked near the taxi. The objetive is to extrat eah ar from the bakground image using some of the features. We hoose a segmented image as the first feature and motion vetors as the seond one as shown in Figs. 5 (b), (). Otsu s method [1] was used for the image segmentation, and the result is a binary image whih has 1 for the pixels lassified into the objet and 0 for the bakground. The motion vetors are omputed by the blok mathing algorithm with a blok size 4 4. The seleted features in this experiment are expressed as X =[x 1, x 2,, x n ] T, (11) where, x i = [s i,m xi,m yi ], and s i, m xi and m yi are a segmentation label, x and y diretional motions of the i th pixel respetively. The optimal luster numbers found by seven validity indies for = 2to 5 are shown in Table 2. Only two indies v rit and v SV find the orret luster number = 3. The other indies derease or inrease monotonially for = 2to 5. Hene, the lustering algorithm based on = 3 will provide the best results suh that all lusters are strutually ompat and well separated. Figures 6 (a) () are the lustering results whih show the extrated objets from the input image. 5. Conlusion We investigated the strutural harateristis of the lusters in the partitioning proess. And an effiient validity index is developed by ombining the two partition

284 IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 Fig. 5 Input image and features. Fig. 6 Clustering results. Table 1 Performane omparison. Data 1 Data 2 v PC v PE v FS v XB v K v rit v SV v PC v PE v FS v XB v K v rit v SV 2 0.68 0.69 32.78 0.15 147.76 66.64 1.00 0.69 0.67 36.07 0.14 179.31 48.46 1.00 3 0.61 0.97-24.14 0.07 72.14 38.92 0.52 0.62 0.96-26.01 0.08 100.63 30.39 0.54 4 0.63 1.03-57.02 0.04 46.22 28.08 0.32 0.66 0.94-92.67 0.04 52.20 19.98 0.26 5 0.67 0.99-80.34 0.03 36.50 22.45 0.21 0.64 1.05-99.00 0.03 47.36 17.41 0.16 6 0.69 1.01-83.45 0.02 27.28 16.72 0.09 0.59 1.26-95.24 0.04 58.49 22.69 0.25 7 0.63 1.18-78.92 0.07 71.55 58.46 0.60 0.55 1.41-91.80 0.04 53.14 23.75 0.29 8 0.60 1.31-76.49 0.05 51.58 43.19 0.46 0.53 1.52-89.80 0.03 49.68 25.17 0.35 9 0.57 1.40-75.58 0.04 44.32 44.22 0.51 0.50 1.59-87.49 0.04 60.47 49.58 0.81 10 0.54 1.51-70.08 0.05 54.45 74.05 1.00 0.49 1.68-85.88 0.04 56.51 55.61 1.00 Table 2 Test results of the image. v PC v PE v FS v XB v K v rit v SV 2 0.9639 0.0833-1412.1284 0.0284 198.2599 13.0161 1.0000 3 0.9821 0.0550-1912.4026 0.0088 62.7172 7.4593 0.4959 4 0.9955 0.0133-2192.3681 0.0021 16.8974 11.9804 0.7985 5 0.9997 0.0008-2250.0552 0.0001 3.2486 12.0756 1.0000 funtions, i.e., an over-partition measure funtion and an under-partition measure funtion. The proposed index was suessfully applied to two numerial data and a real image. It provided enhaned performanes when ompared with the previous studies. Most of all, v SV showed the steepest valley at the = for the three different data sets and did not derease for large. In addition, sine only the strutural harateristis are used instead of the partition matrix, this validity index works effetively not only for HCMA but for FCMA. Referenes [1] N. Otsu, A threshold seletion method from gray-level histogram, IEEE Trans. Syst., Man. & Cybern. vol.smc-9, no.1, pp.62 66, 1979. [2] J.C. Bezdek, Pattern reognition with fuzzy objetive funtion algorithms, New York, 1981. [3] Y. Fukayama and M. Sugeno, A new method of hoosing the number of lusters for the fuzzy -means method, Pro. 5th Fuzzy Syst. Symp., pp.247 250, 1989. [4] N.L. Xie and G.A. Beni, A validity measure for fuzzy lus-

LETTER 285 tering, IEEE Trans. PAMI, vol.13, no.8, pp.841 847, 1991. [5] N.R. Pal and J.C. Bezdek, On luster validity for the fuzzy -means model, IEEE Trans. Fuzzy Syst., vol.3, no.3, pp.370 379, 1995. [6] S.H. Kweon, Cluster validity index for fuzzy lustering, Eletron. Lett., vol.34, no.22, pp.2176 2177, 1999. [7] A.O. Boudraa, Dynami estimation of number of lusters in data sets, Eletron. Lett., vol.35, no.19, pp.1606 1607, 1999.