Optimal Fuzzy Clustering in Overlapping Clusters

46 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 4, October 008 Optmal Fuzzy Clusterng n Overlappng Clusters Ouafa Ammor, Abdelmoname Lachar, Khada Slaou 3, and Noureddne Ras Department of Mathematcs, Faculty of Scences and Technology of Fes, Morocco ESTM, Moulay Ismal Unversty, Morocco 3 Department of Physcs, Faculty of Scences Dhar Mehraz of Fes, Morocco Abstract: The fuzzy c-means clusterng algorthm has been wdely used to obtan the fuzzy -parttons. Ths algorthm requres that the user gves the number of clusters. To fnd automatcally the rght number of clusters,, for a gven data set, many valdty ndexes algorthms have been proposed n the lterature. Most of these ndexes do not wor well for clusters wth dfferent overlappng degree. They usually have a tendency to fals n selectng the correct optmal clusters number when dealng wth some data sets contanng overlappng clusters. To overcome ths lmtaton, we propose n ths paper, a new and effcent clusters valdty measure for determnaton of the optmal number of clusters whch can deal successfully wth or wthout stuaton of overlappng. Ths measure s based on mamum entropy prncple. Our approach does not requre any parameter adustment, t s then completely automatc. Many smulated and real examples are presented, showng the superorty of our measure to the estng ones. Keywords: Unsupervsed clusterng, cluster valdty ndex, optmal clusters number, overlappng clusters, mamum entropy prncple. Receved November 30, 006; accepted June, 007. Introducton Cluster analyss has been playng an mportant role n solvng many problems n pattern recognton, mage processng, colour mage segmentaton, machne learnng, data mnng, and dfferent felds le medcne, bology, technology, maretng. The am of the cluster valdty s to fnd the parttonng that best fts the underlyng data. A wde varety of clusterng algorthms have been proposed for dfferent applcatons and a good overvew can be found n the lterature [, 4, 6, 7]. Snce there are no predefned classes, t s therefore dffcult to fnd an approprate metrc for measurng f the found clusters confguraton s acceptable or not. The result of a clusterng algorthm can be very dfferent from each other on the same data set, and nput parameters of an algorthm can extremely modfy the behavour and executon of that algorthm. Usually, n well separated clusters, D data sets are used for evaluatng clusterng algorthms as the reader can easly verfy the result. But n case of hgh dmensonal data, the vsualzaton and vsual valdaton s not a trval tas. Therefore some formal methods are needed. The process of evaluatng the results of a clusterng algorthm s called cluster valdty assessment. For ths, there are three dfferent technques: external crtera, nternal crtera and relatve crtera. Both nternal and external crtera are based on statstcal methods and they have hgh computaton demand. A revew of clusterng valdty ndexes that are based on external and nternal crtera can be found n [6]. Also as was mentoned [7], the valdty assessment approaches based on relatve crtera wor well n non overlappng cases. If the data set consdered contans overlappng clusters, then, the maorty of the estng valdty ndex fals to detect the rght number of clusters. In ths paper, we propose a new and effcent measure to determne the optmal number of clusters, based on Mamum Entropy Prncple (MEP), whch not only handles effcently hgh degree overlapped cases, but also are completely automatc, requrng any parameter determnaton. We show also n some examples that t wors well also n non gaussan mxture models. The organzaton of the rest of the paper s as follows. In secton, we brefly revew some valdty measures related to our wor, and also present some of ther shortcomngs. In secton 3, the proposed measure based on MEP s presented and the correspondent algorthm. Secton 4 presents many examples usng artfcal and real data sets to demonstrate the effectveness of the proposed measure. Fnally, secton 5 concludes the paper.. Related Wor The Fuzzy C-Means (FCM) clusterng algorthm has been wdely used to obtan the fuzzy c-partton. Ths algorthm requres that the user predefne the number of clusters ; however, t s not always possble to now the number of clusters n advance. Dfferent fuzzy parttons are obtaned at dfferent values of. Thus, an evaluaton methodology s requred to

47 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 4, October 008 valdate each of the fuzzy c-parttons and, to obtan an optmal partton (or optmal number of clusters c). Ths quanttatve evaluaton s the subect of cluster valdty. The mathematcal formula used to compute the valdaton s referred to as a cluster valdty ndex. Many clusters valdty ndexes for fuzzy clusterng are proposed n the lterature n order to fnd an optmal number of clusters. [3] two cluster valdty ndexes for fuzzy clusterng: Partton Coeffcent (V PC ) and Partton Entropy (V PE.) These ndexes V PC and V PE are senstve to nose or a weghtng exponent m. Other ndexes such as V FS and V XB whch tae nto account the geometrc propertes of nput data were proposed respectvely [5] and Xe-Ben [5].The V FS ndex s senstve to both hgh and low exponent m. V XB provded a good response over a wde range of choces both for c= to 0 and for <m 7. However, V XB decreases monotoncally as the number of clusters c becomes very large and close to the number of data n. [] extended Xe-Ben ndex V XB to elmnate ts monotonc decreasng tendency. To acheve ths, a punshng functon was ntroduced to the numerator part of Xe and Ben orgnal valdty ndex. [7] have defned a new valdty ndex V S_Dbw. Ths latter explots also the compactness and the separaton propertes of the data set. The compactness s measured by the cluster varance whereas the separaton by the densty between clusters. As was mentoned [8] the ndex V S_Dbw s optmzed for data sets that nclude compact and well-separated clusters that s n non overlappng cases. [0] attempted to determne the optmal number of clusters by measurng the status of the gven partton wth both an under-partton functon and an over-partton functon. The proposed ndex V SV s the sum of the two functons. V SV provdes enhanced performances when compared wth the prevous studes. More recently, a new valdty ndex V OS was proposed [], V OS explots an overlap measure and a separaton measure between clusters. The proposed ndex V OS was defned as the rato of the overlappng degree to the separaton. The overlap measure, whch ndcates the degree of overlap between fuzzy clusters, s obtaned by computng an nter-cluster overlap. The separaton measure, whch ndcates the solaton dstance between fuzzy clusters, s obtaned by computng a dstance between fuzzy clusters. As was mentoned [] the proposed ndex V OS s more relable than other ndexes. Unfortunately, from the tests on the IRIS data that have real overlappng clusters, the authors have seen that V OS does not dscrmnate the two overlappng clusters. 3. The Valdty Index For a gven data set, we obtan, after some clusterng process, a partton on clusters c c c. Now, defne P as a measure of the lns between any pont and the cluster c, for =. As all membershps of any of those clusters c are nown, we can set P =0 for c and, for c, P >0 are normalzed by: P =, for =. () c For all the clusters, we have: (3) P = = c = c P ( ) = The entropy of all the clusters s defned by: where = P P S ln S S s gven by: = c = = c (4) P ln ( P ) + ln( ) (5) S= S ln( ) + = (6) S = P ln c (7) ( P ) () S s the entropy correspondng to the cluster. Ths entropy wll be mamal when all the data ponts of each cluster have the same assocaton wth ther cluster centres. Therefore, the optmal number of clusters s the number whose value of entropy s mamal. In addton to mamzng the above entropy, we use another constrant whch wll be mnmzed. In ths second constrant, for each cluster, the nearest neghbour data ponts to the cluster centre wll be prvleged. The proposed constrant s gven by the followng formula: (8) W = P = c g where s the eucldean dstance, and x represents the pont. We are tryng to reach the hgher possble concentraton around or near each cluster centre.

Optmal Fuzzy Clusterng n Overlappng Clusters 48 To satsfy the above two constrans, that s to mamze S whle mnmzng W, s equvalent to mnmze the followng expresson: T = = c T = W S () P ln ( P ) ln( ) (0) under constrants: P = c + P g = c for =,... The lagrange optmzng the formula gven n equaton 0 under the constrants s gven by L = + = c = c P P ln( P ) ln( ) x + g α P = c () where α s the lagrange multplcator assocated to th constrant. We then annul the dervaton of L per P : ln( P ) + g () + + α = 0 we can then gve the expressons of P for =...N, and =... by the followng: [ P Z exp g ] = (3) where Z s a normalzaton coeffcent gven by: Z = exp ( + α ) By replacng the expresson of P gven by (3) n the correspondng constrant expresson, we obtan the expresson of Z gven below: Z = c exp x (4) g then P coeffcents can be computed by: c [ g ] [ g ] exp P = exp (5) now, we defne, our proposed ndex V MEP as the whole entropy: V MEP = S = S ln( ) + = (6) where S s defned by equaton 7 whch use P defned n equaton 5. The optmal number of clusters s then the number whose value of V MEP s mamal. The proposed new algorthm usng the new ndex V MEP. We propose n ths secton our new general algorthm based on the novel ndex V MEP. The optmal number of clusters s the number whose value of V MEP s mamal. The steps of the algorthm are: A. Fx the mamal number of classes K max B. K K max C. Do whle, C..Applcaton of clusterng algorthm (exp: call Fuzzy C-means or -means to defne the classes c c c. and determne the g centres, for =,,) C..Compute the P probablty wth formula 5 C.3.Compute the S entropy wth formula 7 C.4.K K- End D. V MEP = max S, for =, K max. ( The correct number of clusters s then the for whch the mamum s due) 4. Expermental Results To test the performance of proposed valdty V MEP, we use t to determne the optmal number of clusters n some of synthetc data and also n a well nown real data set. However, n earler publcatons, V SV, proposed [0], was compared wth the followng valdty ndexes V PC, V PE, V FS, V XB, V K and V S-Dbw. It provdes enhanced performances; and n the prevous wor of one of the authors [3] t was also mplemented and used to fnd the optmal number of clusters usng Gaussan Mxture Model (GMM), and the EM algorthm for clusterng process, ths scheme was successfully appled to extract the desgn regons n color textle mage. Therefore, n ths nvestgaton, we wll ust compare our proposed ndex V MEP to V SV. In frst, we revew some applcatons of V MEP to GMM. We generate sxteen artfcal data sets. The frst one,, s le the well nown four polonase balls [4], Fgure shows the scatter plot of ths data set; t has 4 compact and well-separated clusters algned n dagonal. Each cluster was generated usng normal dstrbuton; the parameters used for generatng ths data set are gven n Table. Table. Parameters used for generatng. Cluster Number of Covarance Mean Vector Number Ponts Matrx Cluster 000 (-4; -4) ( 0; 0 ) Cluster 000 (0; 0) ( 0; 0 ) Cluster 3 000 ( 0; 0 )

4 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 4, October 008 Cluster 4 000 (8; 8) ( 0; 0 ) The others ffteen Data Sets:,, 5 and 6, are derved from the frst one, by producng a two overlappng clusters wth dfferent degree of overlap. We move the coordnates centre of cluster havng as coordnates centre (0, 0) as shown n Table, to a seres of the followng centers coordnates (, ), (.5;.5), (.6;.6), (.7;.7), (.8;.8), (; ), (.5;.5), (.;.), (3; 3), (3.5; 3.5), (3.5; 3.5), (3.6; 3.6), (3.7; 3.7), (3.; 3.), and fnally whch are the coordnates centre of cluster 3, as shown n Table. We apply V SV and V MEP to these data sets, and we see f V MEP can perform V SV? If yes, how well does t, and up what lmt?. The cluster valdaton results usng V SV and V MEP are shown n Fgure. For the, havng well-separated clusters, both V SV and V MEP can select correctly 4 as optmal number of clusters. For the to 4, whch have two overlappng clusters wth low degree of overlap, also both V SV and V MEP select correctly 4 as the optmal number of clusters. For 5, V SV select 3 whch s a falure result. By ncreasng the degree of overlap n 6, 7, V SV also fals, t select 3 whch s not a correct optmal number of clusters. Instead, V MEP selects correctly 4 clusters for all these data sets (5, 6, and 7). (0;0) (;) (,7;,7) 3. 0.. 0. 0.07 0.7 7 0.77 7 0.57 7 0.37 3 4 5 6 7 Index VSV. 0. 0.7 0.5 Index.8.6.4..0 8.8 8.6 8.4 8. 8.0 Index VMEP 0.5.75.5 8.75 8.5 7.75 Fgure. Results of clusters valdaton usng V SV (mnmal value) and V MEP (mamal value), dsplayed from to 7. From the above results, we conclude that V SV can wor correctly only n the presence of a low degree of overlap, and t produces a falure results when dealng wth relatvely a hgh overlappng degree. We then stop to apply V SV to data sets havng a superor overlappng degree such as 8 6; and we contnue to apply only V MEP. The result of applyng V MEP to the 8 to 3, are presented respectvely n Fgure. V MEP can stll wor well; t selects correctly 4 as the optmal number of clusters. In 4 6, the centre coordnates of the moved cluster number are respectvely (3.7; 3.7), (3.; 3.), and. These centers are very close to those of the fxed cluster number 3 whose coordnates centre are. Ths yelds a very hgh overlappng degree. In ths case, we can see n Fgure 3 that the two overlappng clusters represent appromately one cluster. V MEP can not select 4 as optmal number of clusters, t select 3 clusters whch can be consdered as evdent result. 8 (.5;.5) (. ;. ) 0 (3; 3,8,7,6,5,4,3,,,8,7,6,5,4,3,, ndex VME,,8,7,6,5,4,3,, (,8;,8) 4.3 0.3 0.73 0.53 0.33 Num ber of Clusters Index VMPE 0.5 8.5 8 7.5 7 (3.5; 3.5, 0,8,7,6,5,4,3,, (;) 5 (3,6;3,6) 6 (3,7;3,7) 7.4 0.74 0.54 0.34.08 8 0.78 8 0.58 8 0.38. 0. 0.7 0.5 Num ber of Clusters Index VMEP 0.5 8.5 8 7.5 7 0..8.7.6.5.4.3...8.7.6.5.4.3.. (3.5; 3.5) 3 (3.6; 3.6 ), 0,8,7,6,5,4,3,,, 0,8,7,6,5,4,3,,

Optmal Fuzzy Clusterng n Overlappng Clusters 50 Fgure. Results of clusters valdaton usng the proposed V MEP, dsplayed from 8 to 3 Then, we see that V MEP performs clearly V SV, t can stll wor well and select the correct optmal number of clusters for all data sets up 3, as shown n Fgures and 3. From the data sets 4 up 6, as shown n Fgure 3, V MEP can not select 4 as optmal number of clusters; because the clusters number and 3 are extremely overlapped, and they may be regarded as one cluster. We conclude that V SV can wor correctly only n the presence of a low degree of overlap, and t produces a falure results when dealng wth relatvely hgh overlappng degree, whle V MEP stll wors well, and selects correctly 4 as optmal number of clusters wth very hgh overlap. We conclude that V MEP performs clearly V SV for GMM as verfed n our early wor []. The performance of V MEP s also examned usng a well nown real Irs data set []. Results are shown n Fgure 4. Both the two ndex V MEP and V SV select correctly 3 as optmal number of clusters. Here, V SV can wor well because the low degree of overlap. 4 (3.7; 3.7) 5 (3.; 3. ) 6 (4; 4 ) Fgure 3. Results of clusters valdaton usng the proposed V MEP, dsplayed from 4 to 6,8,7,6,5,4,3,,,,8,7,6,5,4,3,,,75,5,5 8,75 8,5 0 respectvely Banana set, Banana set, Banana set3, Banana set4. In all of them, V MEP detects the correct and real number of clusters. Banana set descrbe two banana forms enclosed nto one crcle whch s wrapped by one banana form. The result of applyng V MEP to the Banana set shows that t can select 4 clusters whch s the correct number of clusters for banana set. For banana set, we stay the same two banana forms enclosed now n two symmetrc banana forms wth same centre but wth dfferent radus. In ths case V MEP can select also 4 clusters whch s the correct number of clusters for banana set. The llustraton of the banana set3 show two symmetrc banana forms wth same centre and same radus. We eep nto them the same two banana forms enclosed n banana set and banana set. V MEP wors also well and selects 3 clusters whch s the logc and correct number of clusters. Fnally, we test our new ndex on a combnaton of dfferent forms and overlappng case. The result of ths applcaton showng n graphc BSet4 s very nterestng. V MEP can detect 5 clusters whch s the correct number of clusters. Ths last result completes the performance and the robustness of V MEP. BSet BSet BSet3 BSet4..7.5.3..0..5..05.5.4.3....0 3.05 3.5..85 Fgure 5. Results of clusters valdaton usng V MEP for some banana forms. Fgure 4. Results of clusters valdaton usng s ndex V SV (mnmal value), and the proposed V MEP (mamal value), appled to the Irs Data Set. Now, what about non GMM, Fgure 5 shows results when V MEP s appled to other forms le banana forms. In the present wor, we generate 4 banana forms named 5. Concluson We ntroduced n ths paper a new formulaton of a cluster valdty ndex for the valdaton of the fuzzy - parttons that are generated by the applcaton of the FCM clusterng algorthm. Ths new ndex can be playng an mportant role n solvng many problems n pattern recognton, ameloraton of the qualty of

5 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 4, October 008 products n maretng. The proposed ndex V MEP s based on the MEP and the optmal number of clusters s the number * whose value of V MEP s mamal. The performance of our ndex V MEP was examned, n both our generated synthetc data sets and n real data example and a robustness of ths new ndex s completed by the extenson of the method to non-gmm wth overlap. The expermental results show the superorty of our measure V MEP to the estng ones and ts capacty to detect the rght number of clusters wth dfferent shapes and degree of overlap. Fnally, we report also another advantage of our ndex. The defnton of V MEP uses any parameter produced by the adopted clusterng algorthm. Therefore, V MEP s ndependent of any clusterng algorthm. Ths allows us to choose any one, such as Gustafson-Kessel (GK) algorthm whch can deals wth ellpsodal clusters, or EM clusterng algorthm. Ths wll be the subect of our next nvestgaton. References [] Ammor O., Lachar A., Slaou K., and Ras N., New Effcent Approach to Determne the Optmal n Overlappng Cases, n Proceedngs of the IEEE on Advances n Cybernetc Systems, UK, pp. 6-3, 006. [] Anderson E., The IRISes of the Gaspe Pennsula, Bulletn of the Amercan Irs Socety, vol. 5, no. 35, pp. -5, 5. [3] Bezde J., Cluster Valdty wth Fuzzy Sets, Cybernetcs and Systems, vol. 3, no. 3, pp. 58-7, 75. [4] Cembrzyns T., Banc D'essa Sur Les Boules Polonases, Des Tros Crtères de Décson Utlsés Dans La Procédure de Classfcaton, MNDOPT Pour Chosr un Nombre de Classes, RR-0784 Rapport de recherche de l'inria, 86. [5] Fuuyama Y. and Sugeno M., A New Method of Choosng the for the Fuzzy C-Means Method, n Proceedngs of the 5 th Fuzzy Systems Symposum, Japan, pp. 47-50, 8. [6] Hald M., Batstas Y., and Vazrganns M., Qualty Scheme Assessment n the Clusterng Process, n Proceedngs of the 4 th European Conference on Prncples of Data Mnng and Knowledge Dscovery, London, pp. 65-76, 000. [7] Hald M. and Vazrganns M., Clusterng Valdty Assessment: Fndng the Optmal Parttonng of a Data Set, n Proceedngs of st IEEE Internatonal Conference on Data Mnng (ICDM'00), USA, pp. 87-4, 00. [8] Hald M., Batstas Y., and Vazrganns M., Clusterng Valdty Checng Methods: Part II, Specal Interest Group on Management of Data (SIGMOD), vol. 3, no. 3, pp. -7, 00. [] Jan A., Murty M., and Flynn P., Data Clusterng a Revew, ACM Computng Surveys, vol. 3, no. 3, pp. 64-33,. [0] Km D., Par Y., and Par D., A Novel Valdty Index for Determnaton of the Optmal Number of Clusters, IEICE Transactons on Informaton System, vol. D-E84, no., pp. 8-85, 00. [] Km D., Kwang A, Lee H., and Lee D., On Cluster Valdty Index for Estmaton of the Optmal Number of Fuzzy Clusters, Pattern Recognton, vol. 37, pp. 00-05, 004. [] Kwon S., Cluster Valdty Index for Fuzzy Clusterng, Pattern Recognton Letters, vol. 34, no., pp. 76-77, 8. [3] Lachar A., Benslmane R., D Orazo L., and Martuscell E., A System for Textle Desgn Patterns Retreval Part : Desgn Patterns Extracton by Adaptve and Effcent Color Image Segmentaton Method, Journal of the Textle Insttute, no. 0.533. ot, 4r, 005. [4] Lu J. and Yang Y., Multresoluton Color Image Segmentaton, IEEE Transactons on Pattern Analyss and Machne Intellgence, vol. 6, no. 7, pp. 68-700, 4. [5] Lu X. and Ben G., A Valdty Measure for Fuzzy Clusterng, IEEE Transactons on Pattern Analyss Machne Intellgence, vol. 3, no. 8, pp. 84-847,. [6] Sharma S., Appled Multvarate Technques, John Wley and Sons, 6. [7] Trved M. and Bezde J., Low-Level Segmentaton of Aeral Images wth Fuzzy Clusterng, IEEE Transactons on Systems, Man and Cybernetcs, vol. 6, no. 4, pp. 58-58, 86. Ouafae Ammor s a professor n the Department of Mathematcs at Faculty of Scences and Technology of Fes, Morocco. She has obtened her Master s degree n statstcs from Polytechnque school of Montreal, Canada. Her research focuses on the clusterng n data analyss, pattern recognton, colour mage segmentaton,

Optmal Fuzzy Clusterng n Overlappng Clusters 5 medcal mage processng, fuzzy logc and spatal data analyss. Abdelmoname Lachar receved the Master degree n automatc and systems analyss n, and hs PhD degree from the USMBA, Morocco n 004. In computer scences. He s assocate professor n Computer Scences n ESTM at Moulay Ismal Unversty-Morocco. Hs current research nterests nclude mage ndeng and retreval, shape representaton, ndeng and retreval n large shapes databases, colour mage segmentaton, unsupervsed clusterng, cluster valdty ndex, pattern recognton, arabc and latn handwrtten recognton, document clusterng and categorsaton, medcal mage processng, mage compresson and watermarng. Khada Slaou s a professor n the Department of Physcs at Faculty of Scences Dhar Mehraz of Fes, Morocco. She has obtened her Master s n mage processng at Polytechnque Insttute of Toulouse, France, and PHD n data analyss at Unversty Sd Mohammed Ben Abdellah. Her research focuses on sgnal processng, clusterng, and mages processng. Noureddne Ras s full professor at the Unversty of Fez n Morocco. He has a PhD n mathematcal statstcs from the Unversty of Montreal, Canada and a 3 rd Cycle Doctorate n Statstcs and Mathematcal Models from Unversty of Pars-sud, Orsay, France. Hs current research nterests covers bootstrap, spatal data, survey samplng, decson theory, data analyss, data mnng, expermental desgn, lnear models and analyss of varance, wth applcatons n many felds, specally statstcal processng control, arabc handwrtten recognton, and medcal mage processng.