TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia

Size: px

Start display at page:

Download "TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES. MINYAR SASSI National Engineering School of Tunis BP. 37, Le Belvédère, 1002 Tunis, Tunisia"

Katherine Small
5 years ago
Views:

1 TOWARDS FUZZY-HARD CLUSTERING MAPPING PROCESSES MINYAR SASSI Natonal Engneerng School of Tuns BP. 37, Le Belvédère, 00 Tuns, Tunsa Although the valdaton step can appear crucal n the case of clusterng adoptng fuzzy approaches, the problem of the partton valdty obtaned by those adoptng the hard ones was not tackled. To cure ths problem, we propose n ths paper fuzzy-hard mappng processes of clusterng whle benefttng from those adoptng the fuzzy case. These mappng processes concern: () local and global clusterng evaluaton measures: the frst for the detecton of the worst clusters to mergng or splttng them. The second relates to the evaluaton of the obtaned partton for each teraton, () mergng and splttng processes takng nto account the proposed measures, and (3) automatc clusterng algorthms mplementng these new concepts.. Introducton Classfcaton problem s a process for groupng a set of data nto groups so that data wthn a group have hgh smlarty, but are very dssmlar to data n other groups. Ths problem s declned n two alternatves: supervsed [,] and unsupervsed approaches [3]. In the frst, we know the possble groups and we have a data already classfed, beng used overall as tranng. The problem conssts n assocatng the data n the most adapted group whle beng useful of those already labeled. In unsupervsed classfcaton, also called clusterng, possble groups (or clusters) are not known n advance, and the data avalable are not classfed. The goal s then to classfy n the same cluster the data consdered as smlar. Snce clusterng s an unsupervsed method, there s a need of some knd of clusterng result qualty valdaton. Ths qualty s udged n general on the bass of two contradctory crtera [4]. The frst supposes that generated clusters must be most varous possble to each other, and the second requres that each cluster have to be most homogeneous possble. The data are grouped nto clusters based on a number of dfferent approaches. Partton-based clusterng and herarchcal clusterng are two of the man technques. Herarchcal clusterng technques [5] generate a nested seres of parttons based on a crteron, whch measures the smlarty between clusters or the separablty of a cluster, for mergng or splttng

2 Mnyar Sass clusters. We can menton, BIRCH (Balanced Iteratve Reducng and Clusterng usng Herarches) algorthm [6], CURE (Clusterng Usng REpresentatves) algorthm [5]. Partton-based clusterng often starts from an ntal partton and optmzes (usually locally) a clusterng crteron. A wdespread accepted clusterng scheme subdvdes these technques n two man groups: hard and fuzzy [3]. The dfference between these s manly the degree of membershp of each data to the clusters. Durng the constructon of the clusters, n the case of the hard group, each data belongs to only one cluster wth a untary degree of membershp, whereas, for the fuzzy group, each data can belong to several clusters wth dfferent degrees of membershp. We menton some algorthm lke k-means [7] and the Fuzzy C-Means (FCM) [8]. In ths work we lmt ourselves to partton-based clusterng methods. Snce most basc clusterng algorthms assume that the number of clusters s a user-defned parameter, whch s dffcult to known n advance n real applcatons. Thus t s dffcult to guarantee that the clusterng result can reflect the natural cluster structure of the datasets. Several work tackled ths problem [9,0,]. When we are confronted wth a problem of determnaton of the number of clusters, we brought to make assumptons on ths last. To prevent user to choose ths number, a soluton conssts n makng teratons untl obtanng an optmal number of clusters. Each teraton tres to mnmze (or maxmze) an obectve functon called valdty ndex [8,,3,4] whch measures the clusterng qualty to choose the optmal partton among all those obtaned wth the varous plausble values of the requred cluster s number. In [,5], clusterng approaches adoptng the fuzzy concept were presented and proven. They are based on mergng and splttng processes. Although these processes can appear crucal n the fuzzy case, the problem of the valdty of the partton obtaned by automatc hard clusterng methods wasn t tackled. To cure ths problem, we propose n ths artcle rules for mappng clusterng for hard parttonbased clusterng whle benefttng from those adoptng the fuzzy partton-based approaches. These mappng rules concern: The defnton of local and global measures for clusterng evaluaton: the frst for the detecton of worst clusters to mergng or splttng them wth others. The second relate to the evaluaton of the obtaned partton n each teraton. The bnarzaton of mergng and splttng processes. The modelng and mplementaton of automatc hard clusterng algorthm takng nto account the new concepts.

3 Towards Fuzzy-Hard Clusterng Mappng Processes The rest of the artcle s organzed as follows. In secton, we dscuss backgrounds related to fuzzy and hard clusterng technques. In secton 3, we present our motvaton. In secton 4, we present the mappng fuzzy-hard processes. Secton 5 gves the expermentaton and fnally, secton 7 concludes the paper and gves some futures works.. Backgrounds Clusterng methods group homogeneous data n clusters by maxmzng the smlarty of the obects n the same cluster whle mnmzng t for the obects n dfferent clusters. To make t easer for the readers understand the deas clusterng technques, we tred to unfy notatons used n ths last. To acheve that, the followng defntons are assumed: representng a set of N obects x n of clusters found. M R, N M X R denotes a set of data tems c denoted the cluster and c denoted the optmal number th In ths secton, we present basc concepts related to fuzzy and hard clusterng... Fuzzy Clusterng Fuzzy clusterng methods allow obects to belong to several clusters smultaneously, wth dfferent degrees of membershp [6]. The data set X s thus parttoned nto C fuzzy subsets [7]. The result s the partton matrx [ ] U µ = for N, c. Several researches were carred out for the automatc determnaton of the number of clusters and the qualty evaluaton [3,5,] of the obtaned parttons. Bezdek [8] ntroduced a famly of algorthms known under the name of Fuzzy C-Means (FCM). Te obectve functon ( J m ) mnmzed by FCM s defned as follows: c ( U, V ) = N m J m = = µ x c U and V can be calculated as : ( ) x m c µ =, c x m c = c = N m ( µ ) = x N m ( µ ) =

4 Mnyar Sass where, µ s the membershp value of the = { c c } a cluster scheme ' = C,,..., c c th th example, x, n the cluster, ' C pk and C pk s not a sngleton, k =,,..., m where C C { C FCM clusterng algorthm s as follows : Algorthm FCM algorthm m =. M Inputs: The dataset X = { x : =,..., N} R, the number of clusters c, the fuzzfer parameter m and Eucldan dstance functon. Outputs: The cluster centers c ( =,,..., c), the membershp matrx U and the elements of each cluster,.e., all the x such that u > u for all k. k Step : Input the number of clusters c, the fuzzffer m and the dstance functon. 0 Step : Intalze the cluster centers c ( =,,..., c) Step 3: Calculate u (,,..., N; =,,..., c) Step 4: Calculate c (,,..., c) =. =. Step 5: If max c 0 0 c ck ck ε then go to step 6; else let c = c ( =,,..., c) and go to step 3. A smpler way to prevent bad clusterng due to nadequate cluster s centers s to modfy the basc FCM algorthm. We start wth a large (resp. small) number of unformly dstrbuted seeds n the bounded M -dmensonal feature space, but we decrease (resp. ncrease) them consderably by mergng (resp. splttng) worst clusters untl the qualty measure stops ncreasng (resp. decreasng). In consequent, compactness and separaton are two reasonable measures makng t possble to evaluate the qualty of obtaned clusters. Automatc fuzzy clusterng algorthm s based on global measure representng the separatoncompactness for clusterng qualty evaluaton. Gven a cluster scheme C = { c c,..., } ' for a dataset X = { x x,..., }, let =, c c, x N C pk and C pk s not C { C sngleton, gven by: ' k =,,..., m where m = C }, the global separaton-compactness, SC, of a cluster scheme C s

5 Towards Fuzzy-Hard Clusterng Mappng Processes k SC = c c mn v ck = k c k µ = x C p, x v x C p, x v ( x ) x c / µ ( x ) th th where u ( x ) s the membershp value of x belongng to cluster. v s the center of cluster c. c s the number of clusters and c N. Whle basng on these measures, mergng (resp. splttng) clusterng algorthm s as follows: Algorthm Iteratf clusterng process M Inputs: A dataset X = { x : =,..., N} R and the Eucldan dstance functon. Outputs: Output C { c c,..., } =, c opt as an optmal cluster centers. Step : Intalze the parameters related to the FCM, c = cmax, c mn =, c opt = N, Step : Apply the basc FCM algorthm to update the membershp matrx ( U ) and the cluster schema. Step 3: Test for convergence; f no, go to step. Step 4: Calculate SC measure. Step 5: Repeat Perform mergng (resp. splttng) process to get canddate cluster centers, decrease c c (resp. ncrease c c + ); perform the bass FCM algorthm based on parameter c to fnd the optmal cluster centers. Calculate SC measure for new clusters, denote t as ' SC ; f ' SC > SC then SC = Untl c. SC, ' c opt = c.... Fuzzy Mergng Process The fuzzy mergng process generally used by earler studes nvolves some smlarty or compatblty measure to choose the most smlar or compatble par of clusters to merge nto one. In our merge process, we choose the worst cluster and delete t. Each element ncluded n ths cluster wll then be placed nto ts own nearest cluster. Then, centers of all clusters wll be adusted. That means, our merge process may affect multple clusters, whch we consder to be more practcal. How to choose the worst cluster? We stll use the measures of separaton and compactness to evaluate ndvdual clusters (except sngleton). Gven a cluster scheme C = { c c,..., c c } for a dataset X = { x x,..., x N }, for each c C, f c s not, sngleton, the local separaton-compactness of c, denoted as sc, s gven by:,

6 Mnyar Sass sc = mn c, k c c k x C, x v x C, x v u ( x ) u ( x ) x c th th where u ( x ) s the membershp value of x belongng to cluster. c s the center of cluster c. c s the number of clusters and c N. A small value of sc ndcates the worst cluster to be merged.... Fuzzy Splttng Process In ths process, we operate by splttng the worst cluster at each stage n testng the number of clusters c from c mn to c max. The global separaton-compactness measure s used. The general strategy adopted for the new algorthm s as follows: at each step of the algorthm, we dentfy the worst cluster and splt t nto two clusters whle keepng the other c clusters. The general dea n the splttng approach s to dentfy the worst cluster and splt t, thus ncreasng the value of c by one. Our maor contrbuton les n the defnton of the crteron for dentfyng the worst cluster. For dentfyng the worst cluster, a score functon S ( ) assocated wth each cluster, as follows: S ( ) N = µ = number _ of _ data _ vectors _ n _ cluster _ In general, when S( ) small, cluster s tends to contan a large number of data vectors wth low membershp values. The lower the membershp value, the farther the obect s from ts cluster center. Therefore, a small S( ) means that cluster s large n volume and sparse n dstrbuton. Ths s the reason that the cluster correspondng to the mnmum of S( ) as the canddate to splt when the value of c s ncreased. On the other hand, a larger S( ) tends to mean that cluster has a smaller number of elements and exerts a strong attracton on them.

7 Towards Fuzzy-Hard Clusterng Mappng Processes.. Hard Clusterng In hard clusterng, the data can be gathered n a table wth N lnes and M columns. If the data belong to a set of clusters, t s possble to assocate to ths data table a membershp table whose values or 0 respectvely mply the membershp of cluster C, wth =,,..., c. One of most known hard clusterng algorthms s k-means. Its goal s to mnmze the dstance from each data compared to the cluster center to whch t belongs. Ths algorthm corresponds to the search of centers c mnmzng the followng crteron: [9] E = c N = = δ x c where δ fx c = 0else k-means algorthm conssts n choosng ntal centers and mprovng the partton obtaned n an teratve way n three steps. Algorthm 3 k-means algorthm M Inputs: A dataset X = { x : =,..., N} R and the Eucldan dstance functon. M Output: Cluster scheme C = { c =,..., c} R :. Step : Intalze the c centers wth randomly values. Step : Affect each data n the nearest cluster x, f x c p x cl for =... N, l, =... c. C : c Step 3: Recalculate the poston of each new center : * = x, where N the cardnalty of center c and c N x c =,..., c. N and k are the sze of data and the number of clusters respectvely. It s necessary to repeat steps and 3 untl convergence,.e. untl the centers do not change. The result of mplementaton of ths algorthm does not depend on the data order nput. It has a lnear complexty [9] and adapts to the large data sets and requres fxng the number of clusters

8 Mnyar Sass whch nfluences the output. The result s senstve to the startng stuaton, as well on the number of cluster as the ntal center s postons. When we are confronted wth a problem of determnaton of the number of clusters, we brought to make assumptons on ths last. To prevent user to choose ths number, a soluton conssts n makng teratons untl obtanng an optmal number of clusters. Each teraton tres to mnmze (or maxmze) an obectve functon called valdty ndex [8,,3,4] whch measures the clusterng qualty to choose the optmal partton among all those obtaned wth the varous plausble values of the requred cluster s number. In ths sub-secton, we present, frstly, some valdty ndces; then, we gve new defntons of separaton and compactness qualty measures applcable n the context of hard partton-based clusterng. The Mean Squared Error (MSE) [0] of an estmator s one of many ways to quantfy the amount by whch an estmator dffers from the true value of the quantty beng estmated. MSE measures the average of the square of the error. The error s the amount by whch the estmator dffers from the quantty to be estmated. The dfference occurs because of randomness or because the estmator doesn't account for nformaton that could produce a more accurate estmate. In clusterng, ths measure corresponds to compactness. MSE = c N N = = δ x c In the case of a hard partton-based clusterng, the Dunn ndex [] takes nto account the compactness and the separablty of the clusters: the value of ths ndex s all the more low snce the clusters are compact and qute separate. Let us note that the complexty of the ndex of Dunn becomes prohbtory as soon as large data sets; t s consequently seldom used. k where D (, ) c and ( ) I Dum mn = max { Dmn ( c, ck )} { D ( c )} max mn c c k s the mnmal dstance whch separates a data from cluster c to data from cluster D max c s the maxmal dstance whch separates two data from cluster c : D mn D ( c, c ) = mn{ x y : x c ety c } max k ( c ) = max{ x y : ( x, y ) c } k

9 Towards Fuzzy-Hard Clusterng Mappng Processes The Daves-Bouldn ndex [] of the clusterng s combned wth the centrod dameters comparson between clusters. In the computaton of the Daves-Bouldn ndex, the centrod lnkage s used as the nter-cluster dstance. The centrod nter-cluster and ntra-cluster measures are selected for compatblty wth the k-means clusterng algorthm used n the sensors (whch essentally computes centrods of clusters at each teraton). Ths ndex takes nto account the compactness and the separablty of the clusters: ts value s all the more low snce the clusters are compact and qute separate. It s defned by the followng expresson: I DB c = max c = l { D c ( c ) + D c ( c k )} D ( c, c ) ce k where D c ( c ) s the average dstance between a data from cluster c and hs center, ce ( c ck ) D, s the dstance whch separates the cluster s centers c and c k. It s defned by the followng expresson: D c N ( c ) = N = x c D ce ( c, ck ) = c ck 3. Motvatons When we are confronted wth a problem of determnaton of the number of clusters, we brought to make assumptons on ths last. To prevent user to choose ths number, a soluton conssts n makng teratons untl obtanng an optmal number of clusters. Each teraton tres to mnmze (or maxmze) an obectve functon called valdty ndex whch measures the clusterng qualty to choose the optmal partton among all those obtaned wth the varous plausble values of the requred cluster s number. The prevously presented ndces are applcable n the case of hard partton-based clusterng and prove ther lmts n determnaton of optmal number of clusters. Moreover, all these measures are global,.e. they allow the evaluaton of all the partton. However, n the case of automatc clusterng where the number of clusters s unknown for the user, we wll need to terate a clusterng algorthm untl obtanng the optmal number of clusters. Therefore we wll have recourse to local evaluaton for the detecton of worst clusters to be merged or splt.

10 Mnyar Sass Whle basng on these prncples, we propose to evaluate, globally and locally, the obtaned to reach the optmum. The global evaluaton makes t possble to udge on the qualty of generated partton whereas the local evaluaton makes t possble to detect the worst clusters. 4. Mappng Processes When we are confronted wth a problem of determnaton of the number of clusters, we brought to make assumptons on ths last. As mentoned n secton, clusterng algorthms allow to partton data n clusters by maxmzng the smlarty of the data n the same cluster whle mnmzng t for data n the varous clusters. Consequently, compactness and separaton are the two reasonable crtera allowng the cluster s qualty evaluaton. The proposed qualty evaluaton measures are based on these measures. Global evaluaton makes t possble to udge on the qualty of generated partton. We propose here new formulatons whch defne hard global compactness and hard global separaton wthn the partton. Hard Global Compactness. Gven a cluster scheme C = { C C,..., } for a data set X = { x x,..., }, let, C c, x N ' C = { C C and C s not a sngleton, cluster scheme C s gven by:,,..., k = wth ' Cmp = k k = C = k Var }, the hard global compactness, Cmp, of the where Var s the varance of th cluster. It s gven by the followng expresson: Var = x C card x c ( C ) where c s the center of cluster C. Hard Global Separaton. The hard global separaton, Sep, of a cluster scheme C { C C,..., } a dataset X { x x,..., } =, C c for =, x N s gven by: c mn c ck = Sep = c

11 Towards Fuzzy-Hard Clusterng Mappng Processes Hard Global Separaton-Compactness. Gven a scheme C = { C C,..., } for a dataset X { x x,..., }, C c =, x N, let C ' = { C C and C s not a sngleton, ' =,,..., k where k = C }, the hard global separatoncompactness, SepCmp, of a cluster scheme C, s gven by: k SepCmp = Sep Cmp c In consequent, the choce of the best partton s obtaned by maxmzng the measure SepCmp. For the determnaton of the optmal number of clusters, we propose two approaches. The frst s based on the mergng prncple, whch we called EMk-means (Enhanced Mergng k-means). We start wth a maxmum number of clusters dentfyng the worst cluster to merge t. c max and whch t decreases durng varous teratons by The second s based on the splttng prncple whch we called ESk-means (Enhanced Splttng k- means). We start wth a mnmum number of clusters c mn and whch t ncrease durng teratons. Ths ncreasng s done by dvdng the clusters havng a maxmum value of varance. The prncple of the two adopted approaches s summarzed n the followng algorthm: Algorthm 3 Iteratve Clusterng Approach Step : Intalze cmax and c mn Step : Apply k-means algorthm. Step 3: Calculate SepCmp. Step 4: For k from cmax Step 5: Calculate SepCmp., Intalze the cluster s centers. to c mn (respectvely cmn to c max ) do Mergng (respectvely splttng) the clusters. Step 6: Determne the optmal value of the number of cluster copt whose value SepCmp s maxmal. 4.. Mergng Process Mappng The local evaluaton s used to dentfy the worst clusters to be merged wth the others. Each data belongng to ths cluster s affected to the nearest one. Then, the centers of all clusters are adusted. To dentfy worst cluster n each teraton, separaton and compactness measures are used for local evaluaton. Whle basng on hard local separaton-compactness measures, we present n ths sub secton mappng rules from fuzzy to hard. Hard local Compactness. Gven a cluster scheme C = { C C,..., } for a dataset X { x x,..., } each, C C, f C s not a sngleton, the hard local compactness of C, denoted as C c =, for, x N cmp, s gven by :

12 Mnyar Sass cmp = Var x C x ( C ) c where Var s the varance of th cluster. Hard Local Separaton. Gven a cluster scheme C = { C C,..., } for a dataset X { x x,..., } hard local separaton, denoted as sep, s gven by:, C c =, x N, the sep = mn c c l c l where c and c l s the centers of clusters C and Cl respectvely. Hard Local Separaton-Compactness. Gven a cluster scheme C { C C,..., } =, C c for a dataset = { x x x N }, for each C C, f C s not a sngleton, the hard local separaton-compactness of X,,..., denoted as sepcmp, s gven by: Thus, the worst cluster s the one wth the least sepcmp = sep cmp sepcmp value. By combnng these measures wth the k-means algorthm, the number of clusters s thus obtaned automatcally. The proposed algorthm, called EMk-means, s based on merge strategy. We start wth a maxmum number of clusters c max and whch t decreases durng varous teratons by dentfyng the worst cluster to merge t. For max c, we adopted the followng Bezdek suggeston [8]: c max = N (N s the sze of the dataset). In each teraton, the algorthm found to maxmze the hard global separaton-compactness measure SepCmp, obtaned for the varous plausble values of the requred number of clusters. The worst cluster s dentfed and merged wth the remanng clusters. The EMk-means algorthm s descrbed below. Algorthm 4 EMk-means algorthm Inputs: A dataset X = { x x,..., }, the Euclden dstance functon., Output: Optmal cluster scheme C { C C,..., } x N =, C c. Step : Intalze the parameters related to k-means algorthm Step : Apply the k-means algorthm. c max = c, c. mn = C,

13 Towards Fuzzy-Hard Clusterng Mappng Processes Step 3: If convergence, then go to step 3, else go to step. Step 4: Calculate SepCmp measure. Step 5: Repeat Apply merge-based procedure for obtan the canddate cluster scheme. Decrease the cluster number c Untl c c. Calculate the SepCmp measure for the new clusters. Step 6: Store SepCmp measure whch value s maxmal. Hard merge-based procedure s presented as follows: Procedure Hard merge-based procedure Input: Cluster scheme C * = { C C,..., }, * Output: Canddate cluster scheme C { C C,..., C } Step : Calculate C c =, c sepcmp measure for each cluster belongng to C. * Step : Delete the worst cluster whch have the mnmal value of sepcmp. Step 3: Assgn the data of ths cluster to the varous remanng clusters Step 4: Calculate the values of the new clusters centers accordng to the medan formula. Step 5: Apply the k-means algorthm to the new clusters. 4.. Splttng Process Mappng The proposed algorthm s based on a constructon strategy,.e., ntalzed wth a mnmum number of clusters c mn, whch s ncremented durng varous algorthm teratons accordng to a splttng process untl obtanng the maxmum cluster number c max = N, (N s the sze of the dataset). In each teraton, SepCmp measure s calculated for the determnaton of the optmal number of clusters c opt. In ths strategy, we select the worst cluster to dvde t nto two new clusters. Each data belongng to ths cluster s then affected n a new cluster of whch we must agan calculate the center. The splttng-based process s carred out by calculatng the varance of each cluster whch s gven by the followng equaton: Var ( C ) = D ( x c ) N C x C

14 Mnyar Sass where, N C : the number of data belongng to the cluster C. D( x c ) : the Eucldan dstance between x data belongng to the cluster C and hs center c. The obectve of the ESk-means algorthm ams at mnmzng the average ntra-cluster dstance, ths s why we choose the cluster correspondng to the maxmum value of V ( C ) as a canddate for the splttng-based process. When the value of Var ( C ) s hgh, the data belongng to cluster C are dspersed,.e. the data tend to move away from center c of cluster C. The splttng-based procedure conssts on dentfyng the worst cluster havng the maxmum value of varance, the set of ts data s noted E, then calculate for each one as of the ts data the total dstance compared to the varous centers of the not selected classes. The frst two data havng a maxmum total dstance of the centers are selected lke ntal centers beng used for startng of the k- means algorthm n order to parttoned E n two new sets E and E. The bnary procedure of dvson s detaled lke followng. Procedure Hard splttng-based procedure Input: Cluster scheme C * = { C C,..., }, Output: Canddate cluster scheme * { C C,..., C c, C } C c C =, c+. Step : Identfy the cluster to be dvded whle calculatng ts varance. Note E the set of data belongng to ths cluster and ts center c 0. Step : Seek n E two data vectors whch ther total dstances separatng them from all the cluster E remanders s maxmum. Note these two vectors c and c. Step 3: Apply the k-means algorthm to new centers c and c. n order to obtan two new parttons E and E. The am of algorthm s to select the worst cluster, to remove t and to replace t by two other new clusters and ths by applyng splttng-based process. At the end of each teraton, SepCmp measure s calculated and the clusters number s ncremented untl obtanng c max cluster whch value of SepCmp s maxmum.. Ths number s allowed to the

15 Towards Fuzzy-Hard Clusterng Mappng Processes The proposed algorthm for the determnaton of the values of the centers and descrbed below: Algorthm ESk-means algorthm Inputs: A dataset X { x x,..., } =, x N, the Eucldan dstance functon. Output: Optmal cluster scheme C { C C,..., } =, C c. copt clusters s Step : Intalze the parameters related to k-means algorthm Step : Apply the k-means algorthm. Step 3: If convergence, then go to step 3, else go to step. Step 4: Calculate SepCmp measure Step 5: Repeat Untl c max = c, c mn =. Apply splttng-based procedure for obtan the canddate cluster scheme. Increase the cluster number c c +. Calculate the SepCmp measure for the new clusters. c = cmax Step 6: Store SepCmp measure whch value s maxmal. 5. Expermentaton 5.. Data of Expermentaton To valdate the proposed approaches n determnaton of the number of clusters and clusterng qualty evaluaton, three data sets are used among the varous data fles placed at the dsposal of the artfcal tranng communty by the Unversty of Calforna wth Irvne (UCI) [3] as well as an artfcal data fle comng from the benchmark Concentrc : Irs: ths data set contans 3 clusters. Each cluster refers to a type of flower of the Irs: Setosa, Verscolor or Vrgnca. Each cluster contans 50 patterns of whch each one has 4 components. The st cluster (Setosa) s lnearly separable compared to the others, the two other clusters (Verscolor and Vrgnca) overlapped. Wne: ths data set counts the results of a chemcal analyss of varous wnes produced n the same area of Italy startng from varous type of vnes. The concentraton of 3 components s ndcated for each of the 78 analyzed wnes whch are dstrbuted as follows: 59 n the st cluster, 7 n the nd cluster and 48 n the 3rd cluster.

16 Mnyar Sass Dabetes: ths data set counts the results of an analyss concernng the dabetes, carred out on 7 donors to dagnose the dsease. The sze of ths data set s equal to 768 patterns dstrbuted n two clusters. 500 for the st cluster and 68 for the nd cluster. Patterns are wth 8 dmensons. Concentrc: ths data set s artfcal and rather complex. It contans 500 patterns, 579 for the st cluster and 9 for the nd cluster. The st cluster s nsde the second. Patterns are wth dmensons. 5.. Comparatve Study In ths sub-secton, we propose to evaluate the proposed qualty measures and clusterng algorthms Clusterng Qualty Evaluatons In ths sub-secton, we propose to evaluate the proposed qualty measures and clusterng algorthms. We present a comparatve study of the proposed clusterng algorthms va qualty evaluaton measures. We represent only some teratons for each data set Table. Results of the evaluaton of the EMk-meanss and ESk-means algorthms va Irs dataset. Iteratons Algorthm MSE I Dum I DB SepCmp EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means As shown n table, only I DB and SepCmp measures determned the optmal number of clusters wth respectve values (0.) and (0.69). Ths s explaned by the fact why the maxmum value (resp. mnmal) of SepCmp measure s assocated to the optmal number of clusters. For the two proposed algorthms (EMk-means and ESk-means), MSE and I Dum measures can not determne ths number.

17 Towards Fuzzy-Hard Clusterng Mappng Processes Table. Results of the evaluaton of the EMk-meanss and ESk-means algorthms va Wne dataset. Iteratons Algorthms MSE I Dum I DB SepCmp EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means As shown n table, only SepCmp measure could determne the optmal number of clusters va the two proposed clusterng algorthms. Wth I DB wth the EMk-means algorthm wth the (0.5). MSE and I Dum wth the two algorthms. measure, the optmal number of clusters s obtaned only measures dd not determne ths number Table 3. Results of the evaluaton of the EMk-meanss and ESk-means algorthms va Dabète dataset. Iteratons Algorthms MSE I Dum I DB SepCmp EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means 0, EMk-means ESk-means

18 Mnyar Sass As shown n table 3, wth the EMk-means algorthm, all measures ( MSE, I Dum, I DB and SepCmp) determne the optmal number of clusters wth the respectve values (0.0), (0.), (0.8), (.63). Wth the ESk-means algorthm, only measures respectve values (0.7) and (.45). I Dum et SepCmp determne ths number wth the Table 4. Results of the evaluaton of the EMk-meanss and ESk-means algorthms va Concentrc dataset. Iteratons Algorthms MSE I Dum I DB SepCmp EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means EMk-means ESk-means As shown n table 4, wth the EMk-means algorthm, all measures ( MSE, I Dum, I DB and SepCmp) determne the optmal number of clusters wth the respectve values (0.), (0.9), (0.), (3.09). Wth the ESk-means algorthm, all used measures could not determne ths number. However, the values obtaned by SepCmp measure va the ESk-means algorthm for and 3 number of clusters are respectvely (.83) and (.8). These two values are very close and almost equal. Based on these results, we can say that the proposed clusterng algorthms always gve more powerful results for any type of dataset.

19 Towards Fuzzy-Hard Clusterng Mappng Processes 5... Dstrbuton Unformty In ths sub secton, we propose to evaluate the two proposed clusterng algorthms n terms of dstrbuton of data n the varous clusters. Table 5. Results of the evaluaton of the dstrbutons of data va EMk-means and ESk-means. Optmal number Dataset of clusters Irs 3 Wne 3 Dabète Concentrc Data dstrbuton Algorthms Realy Obtaned Smlarty EMk-means 50, 50, 50 50, 48, 5.6% ESk-means 50, 47, % EMk-means 6, 7, 47.4% 59, 7, 48 ESk-means 6, 70, % EMk-means 500, , % ESk-means 446, 7.4% EMk-means 57, % 579, 9 ESk-means 567, % As shown n table 5, In the case of Irs data set, the smlarty between really and obtaned data dstrbuton wth algorthm EMk-means (resp. ESk-means) s equal to,6% (resp. 4,0%). In the case of Wne data set, the smlarty between really and obtaned data dstrbuton wth algorthm EMk-means (resp. ESk-means) s equal to,4% (resp. 3,0%). In the case of Dabetes data set, the smlarty between really and obtaned data dstrbuton wth algorthm EMk-means (resp. ESk-means) s equal to 0,8% (resp.,4%). In the case of the Concentrc data set, the smlarty between really and obtaned data dstrbuton wth algorthm EMk-means (resp. ESk-means) s equal to 0,69% (resp.,03%). Compared to the sze of the used data sets, we can conclude that these percentages are very satsfactory f the number of classes s unknown for the users. We also note that the obtaned results whle followng the merge-based strategy are more nterestng than those are gven by the splttng-based strategy About Complexty The theoretcal complexty of SepCmp measure s based on the complexty of two terms defned by Sep and Cmp measures.

20 Mnyar Sass M For a data set X = { x : =,..., N} R, c s the ntal number of cluster, theoretcal complexty of Sep measure s about ( NMc) complexty of SepCmp s about ( NMc ) O whle the complexty of Cmp measure s about O ( NMc ) O.. Then, The Usually, ( Nc ) O. M pp N, therefore the complexty of ths measure for a specfc cluster scheme s about We present n table 6 a study of the theoretcal complexty of the proposed clusterng algorthms: Table 6. Study of the complexty of the proposed clusterng algorthms. Algorthms k-means EMk-means ESk-means Theoretcal complexty O ( Nc) O ( Nc ) O ( Nc ) As shown n table 6, the complexty of these algorthms s square proportonal to the sze of the data and to the maxmum number of clusters. 6. Concluson A concluson secton s not requred. Although a concluson may revew the man ponts of the paper, do not replcate the abstract as the concluson. A concluson mght elaborate on the mportance of the work or suggest applcatons and extensons. The maorty of clusterng algorthms often run up aganst the problem of the optmal number of clusters to generate and therefore the clusterng qualty evaluaton of obtaned clusters. A soluton conssts n makng teratons untl obtanng a satsfyng number of clusters. Each teraton tres to mnmze a qualty measure called valdty ndex. Ths qualty s udged n general on the bass of two contradctory crtera. The frst supposes that the generated clusters must be most varous possble to each other wth respect to certan characterstcs, and the second requres that each cluster have to be most homogeneous possble wth respect to these characterstcs. Whle nsprng by publshed algorthms, we have proposed mappng rules for generalzng these last. Our maor contrbutons are as follows: Defnton of global qualty measures of a partton generated by a clusterng algorthm whle basng on compactness and separaton measures, Defnton of local qualty measures to dentfy the worst clusters to be merge or splt. Bnarzaton of the mergng and splttng processes.

21 Towards Fuzzy-Hard Clusterng Mappng Processes Modelng and mplementaton of two clusterng approaches mplementng the new proposed measures and processes. The frst proposed approach s based on the prncple of mergng. It starts wth a maxmum number of clusters and decreases t durng varous teratons. The second proposed approach s based on the prncple of splttng. It starts wth a mnmum number of clusters whch s ncremented durng ts executon. For the two approaches, the basc dea conssts n determnng the optmal number of clusters, by global and local evaluatons, of the obtaned partton. Proposed measures, processes and approaches are exploted successfully for varous data sets. Future work relate to prmarly the test of these approaches on large data sets. Indeed, we must smplfy ths operaton by a data samplng whle allowng a better evaluaton of ths last. References [] M. B Dale, P. E. R. Dale and P. Tan, Supervsed clusterng usng decson trees and decson graphs: An ecologcal comparson, Eco. Model. 04(-) (007), [] M. Grmald, K. Demal, R. Redon, J. L. Jamet and B. Rossetto, Reconnassance de formes et classfcaton supervsée applquée au comptage automatque de phytoplankton 40 (003). [3] B. Evertt, Cluster Analyss, 3rd ed., Edward Arnold, 993. [4] E. Atnour, F. Dubeau, S. Wang and D. Zou, Controllng mxture component overlap for clusterng algorthms evaluaton, J. Pattern Recog. Image Anal. (4) (00), [5] S. Guha, R. Rastog and K. Sh, CURE: an effcent clusterng algorthm for large databases, ACM SIGMOD Int. Conf. Management of Data, 998, pp [6] T. Zhang, R. Ramakrshnan and M. Lvny, Brch: an effcent data clusterng method for very large database, ACM-SIGMOD Int. Conf. Management of Data, Montreal, Canada (996), [7] J. McQueen, Some methods for classfcaton and analyss of multvarate observatons, The Ffth Berkeley Symposum on Mathematcal Statstcs and Probablty, 967, [8] J. C. Bezdek, Chapter F6: Pattern Recognton n Handbook of Fuzzy Computaton, IOP Publshng Ltd., 998. [9] R. P. Agrawal, J. Gehrke, D. Gunopulos and P. Raghavan, Automatc subspace clusterng of hgh dmensonal data for data mnng applcatons, ACM SIGMOD Int.Conf. Management of Data (SIGMOD 98), Seattle, WA (998), [0] J. Costa and M. Netto, Estmatng the number of clusters n multvarate data by self organzng maps, Int. J. Neural Systems 9(3) (999), [] H. Sun, S. Wang and Q. Jang, FCM-based model selecton algorthm for determnng the number of clusters, Pattern Recog. 37(0) (004), [] D. W. Km, K. H. Lee and D. Lee, On cluster valdty ndex for estmaton of the optmal number of fuzzy clusters, Pattern Recog. 37(0) (004), [3] M. Sass, A. Grssa Touz and H. Ounell, Two levels of extensons of valdty functon based fuzzy clusterng, 4th Internatonal Multconference on Computer Scence and Informaton Technology, Amman, Jordan ( 006), [4] X. Xe and G. Ben, A valdty measure for fuzzy clusterng, IEEE Trans. Pattern Anal. Mach. Intell. 3(8) (99),

22 Mnyar Sass [5] M. Sass, A. Grssa Touz and H. Ounell, Usng Gaussans functons to determne representatve clusterng prototypes, 7 th IEEE Internatonal Conference on Database and Expert Systems Applcatons, Poland (006), [6] M. Sato, Y. Sato and L. C. Jan, Fuzzy Clusterng Models and Applcatons, Physca Verlag, Hedelberg, New York, 997. [7] L. A. Zadeh, Fuzzy sets, Inf. Control 8 (965), [8] J. Bezdek, R. Hathaway, M. Sobn and W. Tucker, Convergence theory for fuzzy c-means: counterexamples and repars, IEEE Trans. Systems, Man and Cybernetcs 7(5) (987), [9] J. McQueen, Some methods for classfcaton and analyss of multvarate observatons, The Ffth Berkeley Symposum on Mathematcal Statstcs and Probablty, 967, pp [0] G. Casella and E. L. Lehmann, Theory of Pont Estmaton, Sprnger, 999. [] J. Dunn, Well separated clusters and optmal fuzzy parttons, J. Cybernetcs (4) (974), [] D. L. Daves and D. W. Bouldn, A cluster separaton measure, IEEE Trans. Pattern Anal. Mach. Intell. (4) (979), 4-7.

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features