An Optimized Approach on Applying Genetic Algorithm to Adaptive Cluster Validity Index

IJCSES International Journal of Computer Sienes and Engineering Systems, ol., No.4, Otober 2007 CSES International 2007 ISSN 0973-4406 253 An Optimized Approah on Applying Geneti Algorithm to Adaptive Cluster alidity Index Tzu-Chieh LIN, Hsiang-Cheh HUANG 2, Bin-Yih LIAO and Jeng-Shyang PAN Department of Eletroni Engineering, National Kaohsiung University of Applied Sienes Kaohsiung, 807, Taiwan E-mail:{tlin, byliao, span}@bit.kuas.edu.tw 2 Department of Eletrial Engineering, National University of Kaohsiung Kaohsiung, 8, Taiwan E-mail: hhuang@nuk.edu.tw Abstrat The partitioning or lustering method is an important researh branh in data mining area, and it divides the dataset into an arbitrary number of lusters based on the orrelation attribute of all elements of the dataset. Most datasets have the original lusters number, whih is estimated with luster validity index. But most methods give the error estimation for most real datasets. In order to solve this problem, this paper applies the optimization tehnique of geneti algorithm (GA) to the new adaptive luster validity index, whih is alled the Gene Index (GI). The algorithm applies GA to adust the weighting fators of adaptive luster validity index to train an optimal luster validity index. It is tested with many real datasets, and results show the proposed algorithm an give higher performane and aurately estimate the original luster number of real datasets ompared with the urrent luster validity index methods. Key words: Clustering, Geneti Algorithm, Cluster alidity Index, Optimization, Data Mining.. Introdution Data partitioning is ommonly enountered in real appliations. Lots of shemes are proposed to assess the performanes for speifi algorithms in literature. The main onern of data partitioning is how to orretly divide the data points into lusters. Some algorithms in literature are speifially designed for ertain databases. Thus, these may perform well in some ases but not always good in general. In this paper, we would like to propose a generalized sheme, whih is integrated with optimization tehniques, for better partitioning the data. There are a number of indies proposed in literature to assess the performanes of data lustering. The main ideas are twofold: () data points within the same luster should loate as lose as possible, and (2) data points in different lusters should be as apart as possible. Based on the two onepts, a variety of the luster validity indies are proposed. We make neessary simulations and verify that not all the indies perform well. Therefore, we employ the geneti algorithm (GA) [] for resulting in better performanes in data partitioning. This paper is organized as follows. In Setion 2 we point out the data partitioning shemes and the luster validity indies. In Setion 3 we desribe the proposed algorithm by integrating existing indies and training with GA. Simulation results are demonstrated in Setion 4. Finally, we onlude this paper in Setion 5. 2. Data Partitioning Shemes and Cluster alidity Indies In this paper, we employ the fuzzy C-means (FCM) [2] algorithm for data lustering, and then make omparisons among several indies. By using the onepts of fuzzy theory, every data point does not absolutely belong to a ertain luster; it is denoted by a floating number to represent the degree of belonging to a ertain luster. The maor drawbak for FCM or other algorithms is that the orret number of lusters annot be known exatly in advane. Thus, the luster validity indies with several kinds of representations are proposed to evaluate the orret number of lusters. Every index has its advantages and drawbaks. We ite several ommonly enountered indies; then we perform verifiations in Se. 2, and finally ombine the advantages of these indies and propose the geneti-based luster validity index in Se. 3. 2. Cluster validity index: PC index PC (partition oeffiient) index [3] was one of the measures used in early days, with the definition in Eq. (): n 2 PC ( U ) = u ik () n k= i= Manusript reeived July 30, 2007. Manusript revised September 20, 2007.

254 IJCSES International Journal of Computer Sienes and Engineering Systems, ol., No.4, Otober 2007 where u ik denotes the degree of membership of x i in the luster k, x i is the i th of d-dimensional measured data (and we use d=2 here as an example), under the ondition that [, ], i, k u ik 0 ; u =, k. i= ik To assess the effetiveness of lustering algorithm, the larger the PC index value, the better the performane. 2.2 Cluster validity index: PE index PE (partition entropy) index was also proposed in [3], with the definition in Eq. (2): n ( ) = [ ( )] PE U u ik log u (2) ik n k = i= To assess the effetiveness of lustering algorithm, the smaller the PE index value, the better the performane. 2.3 Cluster validity index: XB index The XB index was proposed by Xie and Beni in [4] with the two important onepts of ompatness and separation. For a good lustering result, the data points within the same luster should be as ompat as possible, while any two different lusters should be as far as possible. It an be formulated by Eq. (3): XB n 2 2 u x ν i i = i = ( U, ; X ) 2 n min( νi νk ) = (3) i k where x i is the i th of d-dimensional measured data (and we use d=2 here), v k is the d-dimension enter of the luster. In Eq. (3), the numerator implies the ompatness and the denominator denotes the separation. Therefore, to assess the effetiveness of lustering algorithm, the smaller the XB index value, the better the performane. 2.4 Cluster validity index: K index The K index was proposed by Kwon [5] based on the improvement of the XB index. In Eq. (3), we find when n, XB 0, and it is generally inorret for pratial appliations. By modifying Eq. (3), we obtain Eq. (4): K n 2 2 2 u x ν + νi ν i i = i = = ( U, ; X ) 2 min( νi νk ) = (4) i k where ν denotes the geometri enter of data points. To assess the effetiveness of lustering algorithm, the smaller the XB index value, the better the performane. 2.5 Cluster validity index: B rit index B rit index was proposed in [6]. It is also omposed of the ompatness and separation parameters in order to obtain the optimal number of lusters. The measure of ompatness and separation are independently derived. First, the separation between lusters is denoted by G(), max δ( i, ) i, G() = min (5) δ(, ) i i where ( ) δ, is a distane measure between the i geometri enters of lusters i and, with the definition of δ T 2 (, ) ( ) A( ) i = (6) i i and A denotes a positive definite matrix with dimension of d d (or 2 2 here). For simpliity, people use the identity matrix I to replae the matrix A in Eq. (6) to verify the distane measure. Next, the ompatness is represented by the ratio of varianes between the data points of the urrent luster, and the data points within every luster, denoted by wt (), wt () d varq ( k ) q= k = = d vartotal ( q) q= (7) where var q denotes the urrent luster and var total denotes the variane of the whole data set. From experimental results, the value of G() is muh larger than that of wt () with the ranges of G() [0, 20] and wt () [0, 0.8], thus we need to inlude a weighting fator α to balane the effets from both fators, and we obtain B rit( ) = G( ) + α wt ( ) (8) max G( ) where α = denotes the weighting fator. max ( ) wt From derivations above, when the smaller B rit index is obtained, the lustering performane would be better. 2.6 Cluster validity index: S index S index was proposed in [7]. It also adopted the onepts of ompatness and separation. Unlike the B rit index in Se. 0, both fators are normalized to the values between 0 and to balane the effets from both fators. In measuring the ompatness, the mean distane of the lusters in the data set is alulated, u (, ; X ) = i x (9) i= ni x xi where n i denotes the number of data points within luster i, i is the geometri enter of luster i, and the total of mean distanes are alulated. The separation measure is simply denoted by o =, where d d min denotes the min minimum distane between any two lusters.

An Optimized Approah on Applying Geneti Algorithm to Adaptive Cluster alidity Index 255 Next, normalization of Eqs. (9) and (0) is performed by u, ; X min[ u, ; X ], ; X =, (0) un on ( ) ( ) ( ) max u (, ; X ) min u o (, ) min[ o (, )] (, ) [ ] [ (, ; X )] =. () max[ (, )] min[ (, )] o o Finally, the S index is defined by, ; X =, ; X,. (2) ( ) ( ) ( ) S un + To assess the effetiveness of lustering algorithm, the smaller the S index value, the better the performane. 2.7 Preliminary results with existing indies To evaluate the effetiveness of existing indies, we generate a two-dimensional, 2000-point, 9-luster testing database alled `My_sample, illustrated in Fig.. All six indies are examined, and results are in Table. With the database, we an expet that the olumn with k=9 should perform the best, i.e., the largest PC value and the smallest values of the other five should be obtained. As we an see, not all of the indies indiate that the orret lustering result is when k=9. Moreover, the riterion for PC is to searh for its maximum value, while for the rest indies the riterion is to find their minimum values. Based on the two findings, the optimization tehniques an be inluded into the lustering algorithm to searh for the better and more orret results. 3. Geneti-based Cluster alidity Index As we an see from Se. 2. to 2.6, every index has its own speifi onept for data lustering and the results in Se. 2.7 have a diversity of performanes. Therefore, we employ geneti algorithm (GA) for finding an optimized result based on the onept of every index above. GA onstitutes of three maor steps: rossover, mutation, and seletion. Based on the fitness funtion, we try to integrate our watermarking sheme with GA proedures. on Table : The index values for lustering from 2 to 0 lusters in six different shemes for My_sample database. The shaded bloks represent the orret lustering results. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.722 0.673 0.625 0.640 0.674 0.720 0.757 0.797 0.770 PE 0.63 0.852.05.07.030 0.936 0.855 0.754 0.834 XB 0.277 0.0 0.88 0.224 0.095 0.00 0.068 0.042 0.628 K 555 22 377 449 9 202 39 87 300 B rit 7.87 0.85 9.86 0.77 8.4 8.79 8.59 8.39 7.42 S.000 0.69 0.69 0.485 0.334 0.32 0.26 0.220.000 3. Preproessing in GA We need to have hromosomes to perform the three steps in GA. We employ five popularly used databases, inluding auto-mpg [8], bupa [9], m [0], iris [], and wine [2] in Table 2 for GA optimization. Half of the data set in eah database is used for training, and the other half is used for testing. 3.2 Deiding the fitness funtion After onsidering pratial implementations in GA, and based on the indies desribed in Se. 0 to 0, in this paper, we proposed the geneti-based index for data lustering. The fitness funtion is denoted by INTRA( k ) max d ( i, ) k = i, (, ; X ) = α + β. (3) gene MSDt min d ( i, ) i In the first term, it denotes the ompatness with INTRA k = n t - x, and (4) ( ) t = k x x k MSD t = t - x. (5) nt n In the seond term, d( i, ) is the same as that defined in Eq. (6). Also, α and β are the weighting fators, whih at as the output after GA training. The goal for optimization is to find the minimized value in the fitness funtion. Under the best ondition, the fitness value reahes 0. 3.3 Proedures in GA training The GA proedures for optimized luster validity index are desribed as follows. Fig. The two-dimensional, 2000-point, 9-luster database My_sample. Step. Produing the hromosomes: 40 hromosomes are produed. Eah hromosome denotes the weighting fators in the fitness funtion, i.e., ( α i, β i), i 40. Beause the fitness funtion is omposed of two opposing onditions, we only onern about the ratio between the two weights; we set 0 α i, β i.

256 IJCSES International Journal of Computer Sienes and Engineering Systems, ol., No.4, Otober 2007 Table 2: The five databases used in this paper. Training database # of data points Testing database # of data points auto-mpg_train 96 auto-mpg_test 96 bupa_train 73 bupa_test 72 m_train 737 m_test 736 iris_train 75 iris_test 75 wine_train 89 wine_test 89 Fitness values are alulated from the training databases in Table 2. At the beginning of first iteration, hromosome values are randomly set. In training, hromosome values are modified based on the output of the previous iteration. Step 2. Seleting the better hromosomes: All the 40 sets of hromosomes are inluded into the fitness funtion and the orresponding fitness sores are alulated. The 20 hromosomes with smaller fitness values are kept for use in the next iteration, and the other 20 are disarded. 20 new hromosomes in the next iteration are produed from rossover and mutation based on the 20 hromosomes remained. Step 3. Crossover of hromosome: From the 20 remained hromosomes, we randomly hoose 0 of them, and gather into 5 pairs, to perform the rossover operation. By swapping the α or β values of every pair, 0 new hromosomes are produed. Step 4. Mutation of hromosome: The 0 hromosomes that are not hosen in 0 are used in this step. The α values in the first five hromosomes are replaed by randomly set, new α values. Similar operation is performed on the β values of the other five. Step 5. The stopping ondition: One the pre-determined number of iterations is reahed, or when the fitness value equals 0, the training is stopped, and the weighting fators orresponding to the smallest fitness sore in the final iteration, ( α, β ), is the output. 4. Simulation Results After training for 000 iterations the GA optimization in Se. 3.3, we obtain the optimized weighting fators ( α, β ) = ( 0.856, 0.0826). With the two values, we an ompare the GA optimized result with those in Se. 2. to 2.6 by verifying the five test databases in Table 2. We depit the detailed results with the auto-mpg database in Table 3, the bupa database in Table 4, the iris database in Table 3: Index values from 2 to 0 lusters in seven different shemes. Shaded bloks show the orret results for auto-mpg database. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.866 0.80 0.787 0.764 0.726 0.76 0.73 0.704 0.700 PE 0.330 0.522 0.596 0.679 0.804 0.847 0.869 0.95 0.944 XB 0.056 0.073 0.083 0.067 0.45 0.2 0.2 0.04 0.23 K.3 5.30 8.2 5.74 35.53 33.5 36.00 32.70 4.44 B rit 3.57 8.20 6.94 6.47 9.2 9.89..2 3.24 S.000 0.633 0.466 0.45 0.548 0.592 0.705 0.77.000 GI 0.523 0.487 0.52 0.536 0.780 0.854 0.960 0.974.48 Table 4: Index values from 2 to 0 lusters in seven different shemes. Shaded bloks show the orret results for bupa database. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.882 0.664 0.562 0.476 0.4 0.383 0.346 0.328 0.295 PE 0.304 0.809.36.435.676.826 2.0 2.3 2.309 XB 0.065 0.5 0.587 0.623.480.307.073.395.407 K.64 94.6 0.2 8.5 284.2 256. 22.8 27.9 282.5 B rit 59.03 46.69 45.75 47.62 67.55 83.79 49.4 56.06 63.48 S.000 0.78 0.67 0.555 0.702 0.786 0.699 0.882.000 GI.088.225.286.328.754.787.690.903.98 Table 5, the wine database in Table 6, respetively. Numerial values in Table 3 depit the results for the auto-mpg database, whih has three lusters. We an see that only with the proposed GA-based index has the orret result. In bupa, m, iris, and wine databases, similar results an be obtained, and detailed omparisons an be found from Table 4 to Table 7, respetively. In addition, from Table 8, we see that the proposed GI results in orret luster numbers in four of the five test databases. Comparing to other six indies that only result in one orret luster number, our sheme gets better performane. In addition, regarding to the m database, none of the seven indies have the orret luster number. 5. Conlusion In this paper, we disussed about data lustering shemes and proposed a new luster validity index based on GA. GI index outperforms all the six existing indies in literature. However, lustering results for appliations to some database are not orret even after GA training. And this is the motivation for our researhes in the future. Aknowledgments This work is partially supported by National Siene Counil (Taiwan) under grant NSC95-228-E-005-034.

An Optimized Approah on Applying Geneti Algorithm to Adaptive Cluster alidity Index 257 Table 5: Index values from 2 to 0 lusters in seven different shemes. Shaded bloks show the orret results for m database. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.809 0.704 0.597 0.528 0.474 0.423 0.378 0.342 0.32 PE 0.459 0.773.089.323.523.723.905 2.066 2.89 XB 0.096 0.25 0.97 0.222 0.23 0.296 0.388 0.604 0.539 K 70.86 92.96 46.9 65.7 73.6 223. 293.0 458.3 40.4 B rit 8.57 3.26.96 3.35 3.37 7.26 6.77 9.7 22.55 S.000 0.580 0.452 0.428 0.440 0.54 0.664 0.935.000 GI 0.67 0.595 0.660 0.72 0.77 0.866 0.990.24.206 Table 6: Index values from 2 to 0 lusters in seven different shemes. Shaded bloks show the orret results for iris database. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.888 0.790 0.738 0.678 0.60 0.584 0.562 0.538 0.535 PE 0.290 0.559 0.736 0.933.08.26.337.435.486 XB 0.058 0.5 0.60 0.265 0.36 0.549 0.239 0.227 0.289 K 4.622 9.920 4.72 25.25 33.2 6.43 26.73 28.05 36.45 B rit 8.46 2.3 0.40 0.77 7.03 2.2 6.08 6.80 6.53 S.000 0.724 0.598 0.628 0.695 0.907 0.700 0.832.000 [3] J.C. Bezdek, Pattern Reognition with Fuzzy Obetive Funtion Algorithms, New York, NY: Plenum, 98. [4] X.L. Xie and G. Beni, "A validity measure for fuzzy lustering", IEEE Trans. Patt. Anal. Mahine Intell., ol. 3, No. 8, 99, pp. 84-846. [5] S.H. Kwon, "Cluster validity index for fuzzy lustering", Eletronis Letters, vol. 34, no. 22, pp. 276-277, 998. [6] A.-O. Boudraa, "Dynami estimation of number of lusters in data sets", Eletronis Letters, ol. 35, No. 9, 999, pp. 606-608. [7] D.-J. Kim, Y.-W. Park, and D.-J. Park, "A novel validity index for determination of the optimal number of lusters", IEICE. Trans. Inf. & Syst., ol. E84-D, No. 2, 200, pp. 28-285. [8] R. Quinlan, "Auto-mpg data", ftp://ftp.is.ui.edu/pub/mahine-learning-databases/auto-mpg/, 993. [9] BUPA Medial Researh Ltd, "BUPA liver disorders", ftp://ftp.is.ui.edu/pub/mahine-learning-databases/liver-disorders/,990. [0] T.S. Lim, "Contraeptive method hoie", ftp://ftp.is.ui.edu/pub/mahine-learning-databases/m/, 999. [] R.A. Fisher, "Iris plants database", ftp://ftp.is.ui.edu/pub/mahine-learning-databases/iris/, 988. [2] S. Aeberhard, "Wine reognition data", ftp://ftp.is.ui.edu/pub/mahine-learning-databases/wine/, 998. GI 0.442 0.50 0.602 0.755 0.887.47 0.88 0.953.09 Table 7: Index values from 2 to 0 lusters in seven different shemes. Shaded bloks show the orret results for wine database. index k=2 k=3 k=4 k=5 k=6 k=7 k=8 k=9 k=0 PC 0.868 0.783 0.772 0.746 0.75 0.784 0.786 0.760 0.738 PE 0.328 0.572 0.636 0.720 0.738 0.663 0.677 0.764 0.830 XB 0.067 0.4 0.0 0.08 0.23 0.07 0.097 0.209 0.26 K 6.264 3.8.28.00 8.96 4.83 22.75 50.47 67.97 B rit 22.85 4.7.64 9.69.0 0.06 2.82 9.8 22.89 S.000 0.672 0.569 0.43 0.406 0.357 0.46 0.772.000 GI 0.570 0.566 0.605 0.64 0.84 0.828.06.594.896 Table 8: Comparisons of the seven indies for the five test databases. Our sheme performs the best. Database Original lusters PC PE XB K B rit S GA auto-mpg 3 2 2 2 2 5 5 3 bupa 2 2 2 2 2 5 5 2 m 3 2 2 2 2 4 4 2 iris 3 2 2 2 2 4 5 3 wine 3 2 2 2 2 5 7 3 Referenes [] D.E. Goldberg, Geneti Algorithms in Searh, Optimization and Mahine Learning, Boston, MA: Kluwer, 989. [2] J.C. Bezdek, R. Ehrlih, and W. Full, "FCM: Fuzzy C- Means Algorithm", Computers and Geosienes, ol. 0, No. 2-3, 984, pp. 6-20.