Understanding K-Means Non-hierarchical Clustering

SUNY Albany - Techncal Report 0- Understandng K-Means Non-herarchcal Clusterng Ian Davdson State Unversty of New York, 1400 Washngton Ave., Albany, 105. DAVIDSON@CS.ALBANY.EDU Abstract The K-means algorthm s a popular approach to fndng clusters due to ts smplcty of mplementaton and fast executon. It appears extensvely n the machne learnng lterature and n most datamnng sutes of tools. However, ts mplementaton smplcty masks an algorthm whose behavor s complex. Understandng the algorthm s behavor s vtal to successful use and nterpretng the results. In ths paper we dscuss nne nsghts nto the behavor of the algorthm n the clusterng context and llustrate how they affect preparaton/representaton of the data, formulaton of the learnng problem and fnally, how we nterpret and use the resultant clusters. We dscuss mplct problems wth K-Means clusterng that cannot be overcome. 1. Introducton The feld of ntrnsc classfcaton attempts to take a collecton of homogeneously descrbed nstances and group them nto sub-populatons commonly known as classes or clusters. Intrnsc classfcaton s nherently densty estmaton as one tres to fnd the varous dense collectons of nstances n a m dmensonal space f m attrbutes represent each nstance. Intrnsc classfcaton s also parameter estmaton as we often dentfy the dstrbuton of each of the m attrbutes for each cluster. We wll also use the terms cluster and classes nterchangeably. Intrnsc classfcaton has two popular sub-felds: clusterng and mxture modelng though the names are often used nterchangeably. Clusterng attempts to maxmally separate the subpopulatons by exclusvely assgnng (known as hard assgnment) an nstance to only one class. Ths s effectvely fndng the best set partton of the nstances or nstallng hard hyperplanes that partton the m dmensonal space. Colloqually, clusterng attempts to fnd groups of nstances, so that the nstances wthn a group are smlar whlst beng dssmlar to those nstances n all other groups. The am of mxture modelng s to model the underlyng sub-populatons, not to fnd classes that maxmally separate the nstances. Dependng on the nstance's degree of belongng (or dstance) to a class, t s partally assgned (soft assgnment) to one or more classes. Mxture modelng technques are usually probablstc and often Bayesan. To llustrate the dfference consder Fgure. A clusterng technque (lke K-Means) wll most lkely fnd that there exsts only one cluster. A mxture modeler wll separate the two classes because there s an underlyng dfference n the classes even though the nstances are qute smlar. We can say that a mxture modeler attempts to fnd the mplct structure n the data whlst a clusterng tool attempts to explctly ntroduce a segmentaton scheme to group together smlar nstances. In ths paper, we wll descrbe the K-Means algorthm for clusterng. The algorthm s extremely popular and appears n at least fve popular commercal data mnng sutes [1]. We wll outlne and descrbe nne nsghts nto K-Means clusterng. Each nsght affects at least one of: a) How we prepare the data for clusterng b) How we formulate the problem and use the clusterng tool c) How we nterpret the clusters and use them. The nsghts are both notes of cauton (whch can be overcome) and nherent lmtatons of the approach (whch cannot be overcome). Bayesan mxture modelng tool that uses the MML prncple and MCMC samplng can n theory overcome the lmtatons of K-Means clusterng []. The nsghts we wll dscuss are: 1. The Effect Of Exclusve Assgnment. The Inconsstency of the learnng algorthm 3. The learnng algorthm s not nvarant to non-lnear 4. The learnng algorthm s not nvarant to scale 5. The learnng algorthm fnds the local mnma of ts loss functon, whch s the vector quantzaton error (dstorton). 6. The learnng algorthm provdes based class parameter estmates 7. The learnng algorthm requres the a-pror specfcaton of the number of classes

SUNY Albany - Techncal Report 0-8. Eucldean dstance measures can unequally weght attrbutes 9. Non-parametrc modelng of contnuous attrbutes. We begn the paper by outlnng the hstory of the algorthm, then descrbe the algorthm tself and dscuss ts computatonal behavor. We then llustrate and dscuss our nne nsghts and how each nsght effects applcaton of the clusterng algorthm. We conclude by llustratng the contrbuton of ths paper.. The Algorthm The K Means clusterng algorthm was postulated n a number of papers n the nneteen sxtes [3][4]. For a m attrbute problem, each nstance maps nto a m dmensonal space. The cluster centrod descrbes the cluster and s a pont n m dmensonal space around whch nstances belongng to the cluster occur. The dstance from an nstance to a cluster center s typcally the Eucldean dstance though varatons such as the Manhattan dstance (step-wse dstance) are common. As most mplementatons of K-Means clusterng use Eucldean dstance, ths paper wll focus on Eucldean space. The K-means algorthm consst of two prmary steps: 1) The assgnment step where the nstances are placed n the closest class. [1] The re-estmaton step where the class centrods are recalculated from the nstances assgned to the class. We repeat the two steps untl convergence occurs whch s when the re-estmaton step leads to mnmal change n the class centrods. Algorthmcally, the K-means algorthm n ts general form dffers only slghtly to the EM algorthm [5] though the loss functons and results obtaned usng them can dffer greatly. The frst step of the K Means algorthm nvolves exclusve assgnment of nstances to the closest class. The EM algorthm partally assgns nstances to clusters; the porton of the nstance assgned dependng on how probable (or lkely) the class generated the object. The Eucldean dstance between two nstances X and Y whch are represented by m contnuous attrbutes s: d ( X, Y) ( X 1 Y1 ) + ( X Y )... + ( X m Ym ) = ( 1 ) In the second step, the algorthm uses the attrbute values of the nstances assgned to a cluster to recalculate the cluster s centrod. We recompute the estmates from only those nstances currently assgned to the class. Suppose an nstance belongs to class A wth probablty a and class B wth probablty b. If a s larger than b under the K means algorthm the nstance would be assgned totally to class A even f the dfference between a and b s mnmal. The nstance would only contrbute to class A s centrod locaton. 3. Computatonal Behavor of the Algorthm The K Means algorthm ams to fnd the mnmum dstorton wthn each cluster for all clusters. The dstorton s also known as the vector quantzaton error. Let the k classes partton the nstances nto the subsets C 1 k, the cluster centrods be represented by w 1 k and the n elements to cluster be S 1 n. The mnmum dstorton or vector quantzaton error that the K means algorthm attempts to mnmze s: E = ( ) The mathematcal trval soluton that mnmzes ths expresson s to have a cluster for each nstance. Recent work has nvestgated the algorthm from an nformaton theoretc perspectve [6]. The major nsght from ths work s that any algorthm that attempts to mnmze the dstorton must manage a trade-off the authors to refer to as the nformaton-modelng trade-off. The trade-off s between dstrbutng the nstances amongst the k clusters and how well each cluster models the nstances assgned to t. For a two-class problem, the expectaton of the partton loss wth respect to the samplng densty Q (the true dstrbuton of the objects) s: Ε x Q[ 0 0 0 1 1 1 + F where K N j= 1 = 1 S C j S w Class( S ) χ ( x)] = ω KL( Q P ) + ω KL( Q P ) I( Q ) ( 3 ) Q 0, Q 1 are the samplng densty of the frst and second sub-populatons respectvely P 0, P 1 are the densty estmaton of the frst and second clusters respectvely ω 0 ω 1 are the weghts of the frst and second clusters respectvely. F s the set partton mparted by the clusterng soluton The frst two terms of ths expresson are the Kullback- Lebler dstance between the samplng densty and the hypotheszed sub-populaton densty for each cluster. Ths measures how well the clusters model the two subpopulatons ndvdually. The thrd expresson s how much the partton reduces the entropy of Q and measures

SUNY Albany - Techncal Report 0- the effectveness of dstrbutng the nstances amongst the clusters. From equaton ( 3 ) we see that the ndvdual classes are modeled separately as s the dstrbuton of the nstances amongst the classes. There are no terms consderng the nteracton between the classes. 4. Nne Insghts nto K-Means Clusterng 4.1 The Use of Exclusve Assgnment The assgnment mechansm of a K-means clusterng requres that an nstance belong to only one of the clusters. Consder the followng stuatons: 1. If an nstance could belong to two classes, t must only be assgned to one.. Instances that do not belong to any partcular cluster are assgned to one. These nusance or outler nstances can affect the poston of class centrods n relatvely small szed clusters. Ths has two mplcatons. Frstly, f there are two hghly overlappng classes (.e. Fgure ) then K-Means wll not be able to resolve that there are two clusters. Secondly, the class centrods wll be based away from the true class centrods (ths wll be descrbed n a later nsght). If the sze of a class s small or the number of nusance nstances large then the cluster centrods maybe dstorted from ther true value. The nablty to model overlappng classes s nherent to the K-Means technque and cannot be overcome. We can see ths clearly n equaton ( 3 ) as the classes are modeled separately and an nstance can only belong to one class. Therefore, nstances n the regons where the classes overlap are problematc. The effect of nusance nstances on the cluster centrods can be partally overcome f the cluster szes are relatvely large 4. The learnng algorthm s Inconsstent Consder a model space Θ k, whch contans models of only k classes, n whch θ TRUE s the true model that generated the nstances. Intally, there maybe only a small number of nstances so θ TRUE s not the most probable model. If a learnng algorthm s consstent then we fnd that: lm P( θ TRUE ) = 1 n where ns the number of observatons ( 4 ) That s, as the amount of data ncreases the probablty that the true model s the most probable model approaches certanty. An nconsstent learnng algorthm does not have ths property and results n overlookng the true model n favor of ncreasngly complex (more classes) models. Consder the loss functon of K-means ( ), the (trval) optmal soluton s to have a cluster for each nstance. It s precsely ths bas whch leads the learnng algorthm to consstently favor ncreasngly (as more data s avalable) complcated models. 4.3 The learnng algorthm s not nvarant to nonlnear Learnng algorthms that are nvarant to non-lnear have the desrable property that regardless the data representaton, the most probable model wll be the same. Consder the representaton of the same data usng Eucldean co-ordnates and polar coordnates. We would hope that n both representatons the best model would be the same as the data s the same. However, ths s not necessarly the case wth all learnng algorthms. Eucldean dstance s not nvarant to non-lnear and as the K-means algorthm uses ths dstance to assgn nstances we fnd that dfferent representatons of the same data can gve dfferent results. Therefore, the result of the K-means clusterng algorthm wll depend on the representaton of the data as well as the ntrnsc structure amongst the nstances. 4.4 The learnng algorthm s not nvarant to scale From the defnton of the Eucldean dstance ( 1 ) we can see that the attrbutes wth a larger scale provde a larger contrbuton to the dstance. Ths means attrbutes wth a larger range of values contrbute more to the dstance functon than those wth smaller scales. A common soluton to ths s to transform each attrbute to Z scores (x j =(µ j -x j )/σ j ) or 0-1 scores (x j =(x j - mn j )/max j ). Ths transforms all attrbutes to be wthn the approxmate range [-1,1] and [0,1] respectvely. However, ths transformaton effectvely provdes more weght to those attrbutes wth a smaller standard devaton. 4.5 The learnng algorthm provdes the local optma of the vector quantzaton error (dstorton) The vector quantzaton error, equaton ( ), s locally maxmzed by the K-means algorthm. The vector quantzaton error s the error surface (or objectve functon) whch K-Means tres to mnmze. The algorthm performs a gradent descent of ths error functon and can therefore become stuck n local mnma. For most nterestng practcal problems the error surface wll contan many local mnma [7]. 4.6 The learnng algorthm provdes based class parameter estmates The use of exclusve assgnment means the algorthm does not model overlappng classes well. Consder a

SUNY Albany - Techncal Report 0- unvarate example of two classes whch are created from the Gaussan dstrbutons N(µ=0,σ=1) and N(µ=,σ=1). We wll fnd that nstances generated from each of the nstances wll overlap n the same area of nstance space. Dagrammatcally we can llustrate the stuaton n Fgure 1. Due to the requrement of exclusve assgnment, we wll fnd nstances generated from the rght hand tal of class 1 wll be assgned to class and the nstances from the left hand tal of class wll be assgned to class 1. When the class centrods are then calculated they wll devate from the generaton mechansms. The mean of class 1 wll be under-estmated and the mean of class over-estmated. The algorthm s estmate of the two sub-populatons means s based. Class 1 Class -3 - -1 0 1 3 4 5 6 Fgure 1 Two overlappng classes centered on N(µ=0,σ=1) and N(µ=,σ=1). 4.7 The learnng algorthm requres the a-pror specfcaton of the number of classes The K means algorthm requres the apror specfcaton of the number of classes. The model space t explores s all possble models wth k classes and effectvely removes k (an mportant unknown n ntrnsc classfcaton) from the problem. One can select a desrable range of k and use the algorthm for each value wthn the range, but due to the algorthm fndng only a local mnmum, ths process would need to be completed many tmes for each value to get the best model for each k. However, we cannot easly compare models obtaned for dfferent values of k. The dstorton for models wth a large k wll have a greater potental to be lower to than those models wth a small k. We are somewhat stuck. The algorthm requres a specfcaton of k but we cannot compare the loss functon across dfferent values of k. Of course, the models could be compared qualtatvely n terms of ftness for busness purposes. 4.8 Eucldean dstance measures can unequally weght underlyng factors. Consder the stuaton where our nstances are represented by say ten attrbutes. Suppose that these attrbutes are manfestatons of two underlyng factors A and B and that eght of the attrbutes represent A and two B. That s the attrbutes representng a factor are hghly correlated. As the number of attrbutes representng the factors s not equal, then the algorthms dependence on the Eucldean dstance ( 1 ) wll weght factor A more hghly than B n the analyss. There are two methods of overcomng ths problem. Frstly, one can perform factor analyss to reduce the ten attrbutes to two attrbutes (one for each factor) to use n the analyss. That s the nput to the clusterng process s two attrbutes not ten. Alternatvely, one may drectly compute the correlaton between each and every attrbute and use a modfed form of Eucldean dstance called the Mahalanobs dstance [8]. The Mahalanobs dstance effectvely performs a transformaton on a Eucldean space so that correlated attrbutes are closer together and therefore do not contrbute overwhelmngly to the dstance measure. In Eucldean dstance the axes are at rght angles to each other. However, the Mahalanobs dstance effectvely transforms the axes so that the angle between two axes s nversely proportonal to the correlaton between the two attrbutes. However, the Mahalanobs dstance s computatonally ntensve to compute as the number of attrbutes ncrease. 4.9 Non-parametrc modelng of contnuous attrbutes. As earler stated ntrnsc classfcaton s densty estmaton wth the am of fndng patterns whch solate dense sub-populatons of nstances. How we represent the sub-populatons lmts whether we wll be able to fnd them. The K-means algorthm specfes a class by descrbng ts centrod. However, t s well establshed that contnuous attrbutes can be successfully modeled as beng drawn from some parametrc populaton such as the Gaussan, Posson or bnomal. Ths provdes a more rcher class descrpton language and hence the ablty to specfy more complex patterns. As ths method of modelng attrbutes s gnored n K-Means clusterng then some patterns whch exst n the populaton are overlooked. Consder a unvarate two class stuaton shown n Fgure.

SUNY Albany - Techncal Report 0- Fgure. Two overlappng classes wth the same mean. The classes only dffer by ther standard devaton as ther means are the same. However, the K-Means algorthm would not dfferentate between the two classes because t does not model the specfc aspect whch dfferentates them. 5. Concluson Class 1 Class -3 - -1 0 1 3 4 5 6 A classc problem dlemma posed by phlosophers can be paraphrased as, do the theores/models we dscover descrbe the mplct structure that exsts n nature or s the structure we force onto t. If we use K-Means clusterngs to nduce the theory/model then the answer to the queston s the later suggeston. K-Means clusterng s not orented towards fndng the mplct structure n the nstances, rather, t fnds the structure wthn the bounds we explctly state for t. Ths s an mportant message to understand, that the K-Means algorthm produces results whch are an nteracton between the ntrnsc structure n the nstances and how we represent the data and defne the clusterng problem. To use the K-Means clusterng algorthm successfully, care must be taken to totally understand the behavor of the algorthm. Our nne nsghts help to provde a more complete understandng of the algorthm for practoners. We rase several problems wth the algorthm. Some can be overcome at least partally and we suggest ways to acheve ths, the remanng problems cannot be overcome and ther affects need to be consdered when representng the data, formulatng the problem and nterpretng the results of the clusterng process. References [1] [1] The Two Crows Report: 1999. Avalable at http://www.twocrows.com/ [] Davdson, I., Mnmum Message Length Clusterng Usng Gbbs Samplng, The 16th Uncertanty n Artfcal Intellgence Conference, Stanford Unversty, 000. [3] Max, J., Quantzng for Mnmum Dstorton, IEEE Transactons on Informaton Theory, 6, pages 7-1, 1960 [4] MacQueen, J., Some Methods for classfcaton and analyss of multattrbute nstances, Proceedngs of the Ffty Berkeley Symposum on Mathematcs, Statstcs and Probablty, volume 1, pages 81-96, 1967. [5] Dempster, A.P et al, Maxmum Lkelhood from ncomplete data va the EM algorthm, Journal of the Royal Statstcal Socety B, Vol 39 pages 1-39, 1977. [6] Kearns, M., Mansour, Y., Ng, A., An Informaton-Theoretc Analyss of Hard and Soft Assgnment Methods for Clusterng, Proceedngs of the Internatonal Conference on Uncertanty n Artfcal Intellgence, 1996. [7] Glks, W.R., Rchardson, S. and Spegelhalter D., Markov Chan Monte Carlo In Practce, Chapman and Hall, 1996. [8] Mahalanobs. P.C., On Tests and Measures of Groups Dvergence, Journal of the Asatc Socety of Benagal, 6:541, 1930.