This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Ths excerpt from Foundatons of Statstcal Natural Language Processng. Chrstopher D. Mannng and Hnrch Schütze. 1999 The MIT Press. s provded n screen-vewable form for personal use only by members of MIT CogNet. Unauthorzed use or dssemnaton of ths nformaton s expressly forbdden. If you have any questons about ths materal, please contact cognetadmn@cognet.mt.edu.

14 Clusterng clusters dendrogram data representaton model bags Clusterng algorthms partton a set of objects nto groups or clusters. Fgure 14.1 gves an example of a clusterng of 22 hgh-frequency words from the Brown corpus. The fgure s an example of a dendrogram, a branchng dagram where the apparent smlarty between nodes at the bottom s shown by the heght of the connecton whch jons them. Each node n the tree represents a cluster that was created by mergng two chld nodes. For example, n and on form a cluster and so do wth and for. These two subclusters are then merged nto one cluster wth four objects. The heght of the node corresponds to the decreasng smlarty of the two clusters that are beng merged (or, equvalently, to the order n whch the merges were executed). The greatest smlarty between any two clusters s the smlarty between n and on correspondng to the lowest horzontal lne n the fgure. The least smlarty s between be and the cluster wth the 21 other words correspondng to the hghest horzontal lne n the fgure. Whle the objects n the clusterng are all dstnct as tokens, normally objects are descrbed and clustered usng a set of features and values (often known as the data representaton model), and multple objects may have the same representaton n ths model, so we wll defne our clusterng algorthms to work over bags objects lke sets except that they allow multple dentcal tems. The goal s to place smlar objects n the same group and to assgn dssmlar objects to dfferent groups. What s the noton of smlarty between words beng used here? Frst, the left and rght neghbors of tokens of each word n the Brown corpus were talled. These dstrbutons gve a farly true mplementaton of Frth s dea that one can categorze a word by the words that occur around t. But now, rather than lookng for dstnctve collocatons, as n

496 14 Clusterng be not he I t ths the hs a and but n on wth for at from of to as s was Fgure 14.1 A sngle-lnk clusterng of 22 frequent Englsh words represented as a dendrogram. chapter 5, we are capturng and usng the whole dstrbutonal pattern of the word. Word smlarty was then measured as the degree of overlap n the dstrbutons of these neghbors for the two words n queston. For example, the smlarty between n and on s large because both words occur wth smlar left and rght neghbors (both are prepostons and tend to be followed by artcles or other words that begn noun phrases, for nstance). The smlarty between s and he s small because they share fewer mmedate neghbors due to ther dfferent grammatcal functons. Intally, each word formed ts own cluster, and then at each step n the

497 exploratory data analyss clusterng, the two clusters that are closest to each other are merged nto a new cluster. There are two man uses for clusterng n Statstcal NLP. The fgure demonstrates the use of clusterng for exploratory data analyss (EDA). Somebody who does not know Englsh would be able to derve a crude groupng of words nto parts of speech from fgure 14.1 and ths nsght may make subsequent analyss easer. Or we can use the fgure to evaluate neghbor overlap as a measure of part-of-speech smlarty, assumng we know what the correct parts of speech are. The clusterng makes apparent both strengths and weaknesses of a neghbor-based representaton. It works well for prepostons (whch are all grouped together), but seems napproprate for other words such as ths and the whch are not grouped together wth grammatcally smlar words. Exploratory data analyss s an mportant actvty n any pursut that deals wth quanttatve data. Whenever we are faced wth a new problem and want to develop a probablstc model or just understand the basc characterstcs of the phenomenon, EDA s the frst step. It s always a mstake to not frst spend some tme gettng a feel for what the data at hand look lke. Clusterng s a partcularly mportant technque for EDA n Statstcal NLP because there s often no drect pctoral vsualzaton for lngustc objects. Other felds, n partcular those dealng wth numercal or geographc data, often have an obvous vsualzaton, for example, maps of the ncdence of a partcular dsease n epdemology. Any technque that lets one vsualze the data better s lkely to brng to the fore new generalzatons and to stop one from makng wrong assumptons about the data. There are other well-known technques for dsplayng a set of objects n a two-dmensonal plane (such as pages of books); see secton 14.3 for references. When used for EDA, clusterng s thus only one of a number of technques that one mght employ, but t has the advantage that t can produce a rcher herarchcal structure. It may also be more convenent to work wth snce vsual dsplays are more complex. One has to worry about how to label objects shown on the dsplay, and, n contrast to clusterng, cannot gve a comprehensve descrpton of the object next to ts vsual representaton. The other man use of clusterng n NLP s for generalzaton. We re- ferred to ths as formng bns or equvalence classes n secton 6.1. But there we grouped data ponts n certan predetermned ways, whereas here we nduce the bns from data. generalzaton

498 14 Clusterng learnng classfcaton As an example, suppose we want to determne the correct preposton to use wth the noun Frday for translatng a text from French nto Englsh. Suppose also that we have an Englsh tranng text that contans the phrases on Sunday, on Monday, andon Thursday, but not on Frday. That on s the correct preposton to use wth Frday can be nferred as follows. If we cluster Englsh nouns nto groups wth smlar syntactc and semantc envronments, then the days of the week wll end up n the same cluster. Ths s because they share envronments lke untl day-of-the-week, last day-of-the-week, and day-of-the-week mornng. Under the assumpton that an envronment that s correct for one member of the cluster s also correct for the other members of the cluster, we can nfer the correctness of on Frday from the presence of on Sunday, on Monday and on Thursday. So clusterng s a way of learnng. We group objects nto clusters and generalze from what we know about some members of the cluster (lke the approprateness of the preposton on) toothers. Another way of parttonng objects nto groups s classfcaton, whch s the subject of chapter 16. The dfference s that classfcaton s supervsed and requres a set of labeled tranng nstances for each group. Clusterng does not requre tranng data and s hence called unsupervsed because there s no teacher who provdes a tranng set wth class labels. The result of clusterng only depends on natural dvsons n the data, for example the dfferent neghbors of prepostons, artcles and pronouns n the above dendrogram, not on any pre-exstng categorzaton scheme. Clusterng s sometmes called automatc or unsupervsed classfcaton, but we wll not use these terms n order to avod confuson. There are many dfferent clusterng algorthms, but they can be classfed nto a few basc types. There are two types of structures produced by clusterng algorthms, herarchcal clusterngs and flat or non- herarchcal clusterngs. Flat clusterngs smply consst of a certan number of clusters and the relaton between clusters s often undetermned. Most algorthms that produce flat clusterngs are teratve. They start wth a set of ntal clusters and mprove them by teratng a reallocaton operaton that reassgns objects. A herarchcal clusterng s a herarchy wth the usual nterpretaton that each node stands for a subclass of ts mother s node. The leaves of the tree are the sngle objects of the clustered set. Each node represents the cluster that contans all the objects of ts descendants. Fgure 14.1 s an example of a herarchcal cluster structure. herarchcal flat non-herarchcal teratve

499 Another mportant dstncton between clusterng algorthms s whether they perform a soft clusterng or hard clusterng. In a hard assgn- ment, each object s assgned to one and only one cluster. Soft assgnments allow degrees of membershp and membershp n multple clusters. In a probablstc framework, an object x has a probablty dstrbuton P( x ) over clusters c j where P(c j x ) s the probablty that x s a member of c j. In a vector space model, degree of membershp n multple clusters can be formalzed as the smlarty of a vector to the center of each cluster. In a vector space, the center of the M ponts n a cluster c, otherwse known as the centrod or center of gravty s the pont: soft clusterng hard clusterng centrod center of gravty (14.1) µ = 1 M x x c In other words, each component of the centrod vector µ s smply the average of the values for that component n the M ponts n c. In herarchcal clusterng, assgnment s usually hard. In non-herarchcal clusterng, both types of assgnment are common. Even most soft assgnment models assume that an object s assgned to only one cluster. The dfference from hard clusterng s that there s uncertanty about whch cluster s the correct one. There are also true multple assgnment dsjunctve models, so-called dsjunctve clusterng models, n whch an object can clusterng truly belong to several clusters. For example, there may be a mx of syntactc and semantc categores n word clusterng and book would fully belong to both the semantc object and the syntactc noun category. We wll not cover dsjunctve clusterng models here. See (Saund 1994) for an example of a dsjunctve clusterng model. Nevertheless, t s worth mentonng at the begnnng the lmtatons that follow from the assumptons of most clusterng algorthms. A hard clusterng algorthm has to choose one cluster to whch to assgn every tem. Ths s rather unappealng for many problems n NLP. Itsa commonplace that many words have more than one part of speech. For nstance play can be a noun or a verb, and fast canbeanadjectveoran adverb. And many larger unts also show mxed behavor. Nomnalzed clauses show some verb-lke (clausal) behavor and some noun-lke (nomnalzaton) behavor. And we suggested n chapter 7 that several senses of a word were often smultaneously actvated. Wthn a hard clusterng framework, the best we can do n such cases s to defne addtonal clusters correspondng to words that can be ether nouns or verbs, and so on. Soft clusterng s therefore somewhat more approprate for many prob-

500 14 Clusterng Herarchcal clusterng: Preferable for detaled data analyss Provdes more nformaton than flat clusterng No sngle best algorthm (each of the algorthms we descrbe has been found to be optmal for some applcaton) Less effcent than flat clusterng (for n objects, one mnmally has to compute an n n matrx of smlarty coeffcents, and then update ths matrx as one proceeds) Non-herarchcal clusterng: Preferable f effcency s a consderaton or data sets are very large K-means s the conceptually smplest method and should probablybeusedfrstonanewdata set because ts results are often suffcent K-means assumes a smple Eucldean representaton space, and so cannot be used for many data sets, for example, nomnal data lke colors In such cases, the EM algorthm s the method of choce. It can accommodate defnton of clusters and allocaton of objects based on complex probablstc models. Table 14.1 A summary of the attrbutes of dfferent clusterng algorthms. lems n NLP, snce a soft clusterng algorthm can assgn an ambguous word lke play partly to the cluster of verbs and partly to the cluster of nouns. The remander of the chapter looks n turn at varous herarchcal and non-herarchcal clusterng methods, and some of ther applcatons n NLP. In table 14.1, we brefly characterze some of the features of clusterng algorthms for the reader who s just lookng for a quck soluton to an mmedate clusterng need. For a dscusson of the pros and cons of dfferent clusterng algorthms see Kaufman and Rousseeuw (1990). The man notatons that we wll use n ths chapter are summarzed n table 14.2. 14.1 Herarchcal Clusterng The tree of a herarchcal clusterng can be produced ether bottom-up, by startng wth the ndvdual objects and groupng the most smlar

14.1 Herarchcal Clusterng 501 Notaton Meanng X={x 1,...,x n } the set of n objects to be clustered C ={c 1,...,c j,...c k } the set of clusters (or cluster hypotheses) P(X) powerset (set of subsets) of X sm(, ) smlarty functon S( ) group average smlarty functon m Dmensonalty of vector space R m M j Number of ponts n cluster c j s(c j ) Vector sum of vectors n cluster c j N number of word tokens n tranng corpus w,...,j tokens through j of the tranng corpus π( ) functon assgnng words to clusters C(w 1 w 2 ) number of occurrences of strng w 1 w 2 C(c 1 c 2 ) number of occurrences of strng w 1 w 2 s.t. π(w 1 ) = c 1, π(w 2 ) = c 2 µ j Centrod for cluster c j Σ j Covarance matrx for cluster c j Table 14.2 Symbols used n the clusterng chapter. agglomeratve clusterng ones, or top-down, whereby one starts wth all the objects and dvdes them nto groups so as to maxmze wthn-group smlarty. Fgure 14.2 descrbes the bottom-up algorthm, also called agglomeratve clusterng. Agglomeratve clusterng s a greedy algorthm that starts wth a separate cluster for each object (3,4). In each step, the two most smlar clusters are determned (8), and merged nto a new cluster (9). The algorthm termnates when one large cluster contanng all objects of S has been formed, whch then s the only remanng cluster n C (7). Let us flag one possbly confusng ssue. We have phrased the clusterng algorthm n terms of smlarty between clusters, and therefore we jon thngs wth maxmum smlarty (8). Sometmes people thnk n terms of dstances between clusters, and then you want to jon thngs that are the mnmum dstance apart. So t s easy to get confused between whether you re takng maxmums or mnmums. It s straghtforward to produce a smlarty measure from a dstance measure d, for example by sm(x, y) = 1/(1 + d(x, y)). Fgure 14.3 descrbes top-down herarchcal clusterng, also called dv- sve clusterng (Jan and Dubes 1988: 57). Lke agglomeratve clusterng dvsve clusterng

502 14 Clusterng 1 Gven: a set X={x 1,... x n } of objects 2 a functon sm: P(X) P(X) R 3 for := 1 to n do 4 c :={x } end 5 C := {c 1,...,c n } 6 j := n + 1 7 whle C > 1 8 (c n1,c n2 ) := arg max (cu,c v ) CC sm(c u,c v ) 9 c j = c n1 c n2 10 C := C\{c n1,c n2 } {c j } 11 j := j + 1 Fgure 14.2 Bottom-up herarchcal clusterng. 1 Gven: a set X={x 1,... x n } of objects 2 a functon coh: P(X) R 3 a functon splt: P(X) P(X) P(X) 4 C := {X} (= {c 1 }) 5 j := 1 6 whle c C s.t. c > 1 7 c u := arg mn cv C coh(c v ) 8 (c j+1,c j+2 ) = splt(c u ) 9 C := C\{c u } {c j+1,c j+2 } 10 j := j + 2 Fgure 14.3 Top-down herarchcal clusterng. monotonc t s a greedy algorthm. Startng from a cluster wth all objects (4), each teraton determnes whch cluster s least coherent (7) and splts ths cluster (8). Clusters wth smlar objects are more coherent than clusters wth dssmlar objects. For example, a cluster wth several dentcal members s maxmally coherent. Herarchcal clusterng only makes sense f the smlarty functon s monotonc: (14.2) Monotoncty. c,c,c S :mn ( sm(c, c ), sm(c, c ) ) sm(c, c c ) In other words, the operaton of mergng s guaranteed to not ncrease smlarty. A smlarty functon that does not obey ths condton makes

14.1 Herarchcal Clusterng 503 Functon sngle lnk complete lnk group-average Defnton smlarty of two most smlar members smlarty of two least smlar members average smlarty between members Table 14.3 Smlarty functons used n clusterng. Note that for group-average clusterng, we average over all pars, ncludng pars from the same cluster. For sngle-lnk and complete-lnk clusterng, we quantfy over the subset of pars from dfferent clusters. the herarchy unnterpretable snce dssmlar clusters, whch are placed far apart n the tree, can become smlar n subsequent mergng so that closeness n the tree does not correspond to conceptual smlarty anymore. Most herarchcal clusterng algorthms follow the schemes outlned n fgures 14.2 and 14.3. The followng sectons dscuss specfc nstances of these algorthms. 14.1.1 Sngle-lnk and complete-lnk clusterng local coherence Table 14.3 shows three smlarty functons that are commonly used n nformaton retreval (van Rjsbergen 1979: 36ff). Recall that the smlarty functon determnes whch clusters are merged n each step n bottom-up clusterng. In sngle-lnk clusterng the smlarty between two clusters s the smlarty of the two closest objects n the clusters. We search over all pars of objects that are from the two dfferent clusters and select the par wth the greatest smlarty. Sngle-lnk clusterngs have clusters wth good local coherence snce the smlarty functon s locally defned. However, clusters can be elongated or straggly as shown n fgure 14.6. To see why sngle-lnk clusterng produces such elongated clusters, observe frst that the best moves n fgure 14.4 are to merge the two top pars of ponts and then the two bottom pars of ponts, snce the smlartes a/b, c/d, e/f,andg/h are the largest for any par of objects. Ths gves us the clusters n fgure 14.5. The next two steps are to frst merge the top two clusters, and then the bottom two clusters, snce the pars b/c and f/g are closer than all others that are not n the same cluster (e.g., closer than b/f and c/g). After dong these two merges we get fgure 14.6. We end up wth two clusters that

504 14 Clusterng 5 a b c d 4 3 2 1 0 d e 3 2 d 2d f g h 0 1 2 3 4 5 6 7 8 Fgure 14.4 A cloud of ponts n a plane. 5 a b c d 4 3 2 1 0 e f g h 0 1 2 3 4 5 6 7 8 Fgure 14.5 Intermedate clusterng of the ponts n fgure 14.4. channg effect mnmum spannng tree are locally coherent (meanng that close objects are n the same cluster), but whch can be regarded as beng of bad global qualty. An example of bad global qualty s that a s much closer to e than to d, yeta and d are n the same cluster whereas a and e are not. The tendency of sngle-lnk clusterng to produce ths type of elongated cluster s sometmes called the channg effect snce we follow a chan of large smlartes wthout takng nto account the global context. Sngle-lnk clusterng s closely related to the mnmum spannng tree (MST) of a set of ponts. The MST s the tree that connects all objects wth edges that have the largest smlartes. That s, of all trees connectng the set of objects the sum of the length of the edges of the MST s mn-

14.1 Herarchcal Clusterng 505 5 a b c d 4 3 2 1 0 e f g h 0 1 2 3 4 5 6 7 8 Fgure 14.6 Sngle-lnk clusterng of the ponts n fgure 14.4. 5 a b c d 4 3 2 1 0 e f g h 0 1 2 3 4 5 6 7 8 Fgure 14.7 Complete-lnk clusterng of the ponts n fgure 14.4. mal. A sngle-lnk herarchy can be constructed top-down from an MST by removng the longest edge n the MST so that two unconnected components are created, correspondng to two subclusters. The same operaton s then recursvely appled to these two subclusters (whch are also MSTs). Complete-lnk clusterng has a smlarty functon that focuses on global cluster qualty (as opposed to locally coherent clusters as n the case of sngle-lnk clusterng). The smlarty of two clusters s the smlarty of ther two most dssmlar members. Complete-lnk clusterng avods elongated clusters. For example, n complete-lnk clusterng the two best merges n fgure 14.5 are to merge the two left clusters, and then the

506 14 Clusterng two rght clusters, resultng n the clusters n fgure 14.7. Here, the mnmally smlar par for the left clusters (a/f or b/e) s tghter than the mnmally smlar par of the two top clusters (a/d). So far we have made the assumpton that tght clusters are better than straggly clusters. Ths reflects an ntuton that a cluster s a group of objects centered around a central pont, and so compact clusters are to be preferred. Such an ntuton corresponds to a model lke the Gaussan dstrbuton (secton 2.1.9), whch gves rse to sphere-lke clusters. But ths s only one possble underlyng model of what a good cluster s. It s really a queston of our pror knowledge about and model of the data whch determnes what a good cluster s. For example, the Hawa an slands were produced (and are beng produced) by a volcanc process whch moves along a straght lne and creates new volcanoes at more or less regular ntervals. Sngle-lnk s a very approprate clusterng model here snce local coherence s what counts and elongated clusters are what we would expect (say, f we wanted to group several chans of volcanc slands). It s mportant to remember that the dfferent clusterng algorthms that we dscuss wll generally produce dfferent results whch ncorporate the somewhat ad hoc bases of the dfferent algorthms. Nevertheless, n most NLP applcatons, the sphere-shaped clusters of complete-lnk clusterng are preferable to the elongated clusters of sngle-lnk clusterng. The dsadvantage of complete-lnk clusterng s that t has tme complexty O(n 3 ) snce there are n mergng steps and each step requres O(n 2 ) comparsons to fnd the smallest smlarty between any two objects for each cluster par (where n s the number of objects to be clustered). 1 In contrast, sngle-lnk clusterng has complexty O(n 2 ). Once the n n smlarty matrx for all objects has been computed, t can be updated after each merge n O(n): f clusters c u and c v are merged nto c j = c u c v, then the smlarty of the merge wth another cluster c k s smply the maxmum of the two ndvdual smlartes: sm(c j,c k ) = max(sm(c u,c k ), sm(c v,c k )) Each of the n 1 merges requres at most n constant-tme updates. Both mergng and smlarty computaton thus have complexty O(n 2 ) 1. O(n 3 ) s an nstance of Bg Oh notaton for algorthmc complexty. We assume that the reader s famlar wth t, or else s wllng to skp ssues of algorthmc complexty. It s defned n most books on algorthms, ncludng (Cormen et al. 1990). The notaton descrbes just the basc dependence of an algorthm on certan parameters, whle gnorng constant factors.

14.1 Herarchcal Clusterng 507 n sngle-lnk clusterng, whch corresponds to an overall complexty of O(n 2 ). Sngle-lnk and complete-lnk clusterng can be graph-theoretcally nterpreted as fndng a maxmally connected and maxmally complete graph (or clque), respectvely, hence the term complete lnk for the latter. See (Jan and Dubes 1988: 64). 14.1.2 Group-average agglomeratve clusterng cosne Group-average agglomeratve clusterng s a compromse between snglelnk and complete-lnk clusterng. Instead of the greatest smlarty between elements of clusters (sngle-lnk) or the least smlarty (complete lnk), the crteron for merges s average smlarty. We wll see presently that average smlarty can be computed effcently n some cases so that the complexty of the algorthm s only O(n 2 ). The group-average strategy s thus an effcent alternatve to complete-lnk clusterng whle avodng the elongated and straggly clusters that occur n sngle-lnk clusterng. Some care has to be taken n mplementng group-average agglomeratve clusterng. The complexty of computng average smlarty drectly s O(n 2 ). So f the average smlartes are computed from scratch each tme a new group s formed, that s, n each of the n mergng steps, then the algorthm would be O(n 3 ). However, f the objects are represented as length-normalzed vectors n an m-dmensonal real-valued space and f the smlarty measure s the cosne, defned as n (14.3): (14.3) m =1 sm( v, w) = v w m =1 v = x y m =1 w then there exsts an algorthm that computes the average smlarty of a cluster n constant tme from the average smlarty of ts two chldren. Gven the constant-tme for an ndvdual mergng operaton, the overall tme complexty s O(n 2 ). We wrte X for the set of objects to be clustered, each represented by a m-dmensonal vector: X R m For a cluster c j X, the average smlarty S between vectors n c j s

508 14 Clusterng (14.4) defned as follows. (The factor c j ( c j 1) calculates the number of (non-zero) smlartes added up n the double summaton.) 1 S(c j ) = sm( x, y) c j ( c j 1) x y c j x c j Let C be the set of current clusters. In each teraton, we dentfy the two clusters c u and c v whch maxmze S(c u c v ). Ths corresponds to step 8 n fgure 14.2. A new, smaller, partton C s then constructed by mergng c u and c v (step 10 n fgure 14.2): C = (C {c u,c v }) {c u c v } For cosne as the smlarty measure, the nner maxmzaton can be done n lnear tme (Cuttng et al. 1992: 328). One can compute the average smlarty between the elements of a canddate par of clusters n constant tme by precomputng for each cluster the sum of ts members s(c j ). s(c j ) = x c j x (14.5) (14.6) The sum vector s(c j ) s defned n such a way that: () t can be easly updated after a merge (namely by smply summng the s of the clusters that are beng merged), and () the average smlarty of a cluster can be easly computed from them. Ths s so because the followng relatonshp between s(c j ) and S(c j ) holds: s(c j ) s(c j ) = x s(c j ) x c j = x y y c j x c j = c j ( c j 1)S(c j ) + x x x c j = c j ( c j 1)S(c j ) + c j Thus, S(c j ) = s(c j) s(c j ) c j c j ( c j 1) Therefore, f s( ) s known for two groups c and c j, then the average smlarty of ther unon can be computed n constant tme as follows: S(c c j ) = ( s(c ) + s(c j )) ( s(c ) + s(c j )) ( c + c j )( c + c j 1)

14.1 Herarchcal Clusterng 509 Gven ths result, ths approach to group-average agglomeratve clusterng has complexty O(n 2 ), reflectng the fact that ntally all parwse smlartes have to be computed. The followng step that performs n mergers (each n lnear tme) has lnear complexty, so that overall complexty s quadratc. Ths form of group-average agglomeratve clusterng s effcent enough to deal wth a large number of features (correspondng to the dmensons of the vector space) and a large number of objects. Unfortunately, the constant tme computaton for mergng two groups (by makng use of the quanttes s(c j )) depends on the propertes of vector spaces. There s no general algorthm for group-average clusterng that would be effcent ndependent of the representaton of the objects that are to be clustered. 14.1.3 An applcaton: Improvng a language model language model Now that we have ntroduced some of the best known herarchcal clusterng algorthms, t s tme to look at an example of how clusterng can be used for an applcaton. The applcaton s buldng a better language model. Recall that language models are useful n speech recognton and machne translaton for choosng among several canddate hypotheses. For example, a speech recognzer may fnd that Presdent Kennedy and precedent Kennedy are equally lkely to have produced the acoustc observatons. However, a language model can tell us what are apror lkely phrases of Englsh. Here t tell us that Presdent Kennedy s much more lkely than precedent Kennedy, and so we conclude that Presdent Kennedy s probably what was actually sad. Ths reasonng can be formalzed by the equaton for the nosy channel model, whch we ntroduced n secton 2.2.4. It says that we should choose the hypothess H that maxmzes the product of the probablty gven by the language model, P(H), and the condtonal probablty of observng the speech sgnal D (or the foregn language text n machne translaton) gven the hypothess, P(D H). P(D H)P(H) Ĥ = arg max P(H D) = arg max H H P(D) = arg max P(D H)P(H) H Clusterng can play an mportant role n mprovng the language model (the computaton of P(H))bywayofgeneralzaton. As we saw n chapter 6, there are many rare events for whch we do not have enough tranng data for accurate probablstc modelng. If we medate probablstc

510 14 Clusterng nference through clusters, for whch we have more evdence n the tranng set, then our predctons for rare events are lkely to be more accurate. Ths approach was taken by Brown et al. (1992c). We frst descrbe the formalzaton of the language model and then the clusterng algorthm. The language model cross entropy perplexty (14.7) (14.8) (14.9) (14.10) (14.11) The language model under dscusson s a bgram model that makes a frst order Markov assumpton that a word depends only on the prevous word. The crteron that we optmze s a decrease n cross entropy or, equvalently, perplexty (secton 2.2.8), the amount by whch the language model reduces the uncertanty about the next word. Our am s to fnd a functon π that assgns words to clusters whch decreases perplexty compared to a smple word bgram model. We frst approxmate the cross entropy of the corpus L = w 1...w N for the cluster assgnment functon π by makng the Markov assumpton that a word s occurrence only depends on ts predecessor: H(L,π) = 1 N log P(w 1,...,N) 1 N N 1 log P(w w 1 ) 1 N 1 =2 w 1 w 2 C(w 1 w 2 ) log P(w 2 w 1 ) Now we make the basc assumpton of cluster-based generalzaton that the occurrence of a word from cluster c 2 only depends on the cluster c 1 of the precedng word: 2 H(L,π) 1 N 1 w 1 w 2 C(w 1 w 2 ) log P(c 2 c 1 )P(w 2 c 2 ) Formula (14.10) can be smplfed as follows: H(L,π) C(w1 w 2 ) N 1 [log P(w2 c 2 ) + log P(c 2 )] w 1 w 2 2. One can observe that ths equaton s very smlar to the probablstc models used n taggng, whch we dscuss n chapter 10, except that we nduce the word classes from corpus evdence nstead of takng them from our lngustc knowledge about parts of speech.

14.1 Herarchcal Clusterng 511 (14.12) (14.13) (14.14) + w 2 ) N 1 [log P(c 2 c 1 ) log P(c 2 )] w 1 w 2 = w 1 C(w1 w 2 ) log P(w 2 c 2 )P(c 2 ) N 1 w 2 + C(c 1 c 2 ) c 1 c 2 N 1 log P(c 2 c 1 ) P(c 2 ) P(w)log P(w)+ P(c 1 c 2 ) log P(c 1c 2 ) w c 1 c 2 P(c 1 )P(c 2 ) = H(w) I(c 1 ; c 2 ) w In (14.13) we rely on the approxmatons w 2 ) N 1 P(w 2 ) and C(c 1 c 2 ) N 1 P(c 1 c 2 ), whch hold for large n. In addton, P(w 2 c 2 )P(c 2 ) = P(w 2 c 2 ) = P(w 2 ) holds snce π(w 2 ) = c 2. Equaton (14.14) shows that we can mnmze the cross entropy by choosng the cluster assgnment functon π such that the mutual nformaton between adjacent clusters I(c 1 ; c 2 ) s maxmzed. Thus we should get the optmal language model by choosng clusters that maxmze ths mutual nformaton measure. Clusterng (14.15) The clusterng algorthm s bottom-up wth the followng merge crteron whch maxmzes the mutual nformaton between adjacent classes: MI-loss(c,c j ) = I(c k ; c ) + I(c k ; c j ) I(c k ; c c j ) c k C\{c,c j } In each step, we select the two clusters whose merge causes the smallest loss n mutual nformaton. In the descrpton of bottom-up clusterng n fgure 14.2, ths would correspond to the followng selecton crteron for the par of clusters that s to be merged next: (c n1,c n2 ) := arg mn MI-loss(c,c j ) (c,c j ) CC The clusterng s stopped when a pre-determned number k of clusters has been reached (k = 1000 n (Brown et al. 1992c)). Several shortcuts are necessary to make the computaton of the MI-loss functon and the clusterng of a large vocabulary effcent. In addton, the greedy algorthm

512 14 Clusterng ( do the merge wth the smallest MI-loss ) does not guarantee an optmal clusterng result. The clusters can be (and were) mproved by movng ndvdual words between clusters. The nterested reader can look up the specfcs of the algorthm n (Brown et al. 1992c). Here are three of the 1000 clusters found by Brown et al. (1992c): plan, letter, request, memo, case, queston, charge, statement, draft day, year, week, month, quarter, half evaluaton, assessment, analyss, understandng, opnon, conversaton, dscusson We observe that these clusters are characterzed by both syntactc and semantc propertes, for example, nouns that refer to tme perods. The perplexty for the cluster-based language model was 277 compared to a perplexty of 244 for a word-based model (Brown et al. 1992c: 476), so no drect mprovement was acheved by clusterng. However, a lnear nterpolaton (see secton 6.3.1) between the word-based and the clusterbased model had a perplexty of 236, whch s an mprovement over the word-based model (Brown et al. 1992c: 476). Ths example demonstrates the utlty of clusterng for the purpose of generalzaton. We conclude our dscusson by pontng out that clusterng and clusterbased nference are ntegrated here. The crteron we optmze on n clusterng, the mnmzaton of H(L,π) = H(w) I(c 1 ; c 2 ), s at the same tme a measure of the qualty of the language model, the ultmate goal of the clusterng. Other researchers frst nduce clusters and then use these clusters for generalzaton n a second, ndependent step. An ntegrated approach to clusterng and cluster-based nference s preferable because t guarantees that the nduced clusters are optmal for the partcular type of generalzaton that we ntend to use the clusterng for. 14.1.4 Top-down clusterng Herarchcal top down clusterng as descrbed n fgure 14.3 starts out wth one cluster that contans all objects. The algorthm then selects the least coherent cluster n each teraton and splts t. The functons we ntroduced n table 14.3 for selectng the best par of clusters to merge n bottom-up clusterng can also serve as measures of cluster coherence n top-down clusterng. Accordng to the sngle-lnk measure, the coherence of a cluster s the smallest smlarty n the mnmum spannng tree

14.1 Herarchcal Clusterng 513 Kullback-Lebler dvergence (14.16) Ths dssmlarty measure s not defned for p(x) > 0andq(x) = 0. In cases where ndvdual objects have probablty dstrbutons wth many zeros, one cannot compute the matrx of smlarty coeffcents for all objects that s requred for bottom-up clusterng. An example of such a constellaton s the approach to dstrbutonal clusterng of nouns proposed by (Perera et al. 1993). Object nouns are represented as probablty dstrbutons over verbs, where q n (v) s estmated as the relatve frequency that, gven the object noun n, the verb v s ts predcate. So for example, for the noun apple and the verb eat, we wll have q n (v) = 0.2 f one ffth of all occurrences of apple as an object noun are wth the verb eat. Any gven noun only occurs wth a lmted number of verbs, so we have the above-mentoned problem wth sngulartes n computng KL dvergence here, whch prevents us from usng bottom-up clusterng. To address ths problem, dstrbutonal noun clusterng nstead per- forms top-down clusterng. Cluster centrods are computed as (weghted and normalzed) sums of the probablty dstrbutons of the member nouns. Ths leads to cluster centrod dstrbutons wth few zeros that have a defned KL dvergence wth all ther members. See Perera et al. (1993) for a complete descrpton of the algorthm. dstrbutonal noun clusterng for the cluster; accordng to the complete-lnk measure, the coherence s the smallest smlarty between any two objects n the cluster; and accordng to the group-average measure, coherence s the average smlarty between objects n the cluster. All three measures can be used to select the least coherent cluster n each teraton of top-down clusterng. Splttng a cluster s also a clusterng task, the task of fndng two subclusters of the cluster. Any clusterng algorthm can be used for the splttng operaton, ncludng the bottom-up algorthms descrbed above and non-herarchcal clusterng. Perhaps because of ths recursve need for a second clusterng algorthm, top-down clusterng s less often used than bottom-up clusterng. However, there are tasks for whch top-down clusterng s the more natural choce. An example s the clusterng of probablty dstrbutons usng the Kullback-Lebler (KL) dvergence. Recall that KL dvergence whch we ntroduced n secton 2.2.5 s defned as follows: D(p q) = x X p(x) log p(x) q(x)

514 14 Clusterng 14.2 Non-Herarchcal Clusterng Non-herarchcal algorthms often start out wth a partton based on randomly selected seeds (one seed per cluster), and then refne ths ntal partton. Most non-herarchcal algorthms employ several passes of re- allocatng objects to the currently best cluster whereas herarchcal algorthms need only one pass. However, reallocaton of objects from one cluster to another can mprove herarchcal clusterngs too. We saw an example n secton 14.1.3, where after each merge objects were moved around to mprove global mutual nformaton. If the non-herarchcal algorthm has multple passes, then the queston arses when to stop. Ths can be determned based on a measure of goodness or cluster qualty. We have already seen canddates of such a measure, for example, group-average smlarty and mutual nformaton between adjacent clusters. Probably the most mportant stoppng crteron s the lkelhood of the data gven the clusterng model whch we wll ntroduce below. Whchever measure we choose, we smply contnue clusterng as long as the measure of goodness mproves enough n each teraton. We stop when the curve of mprovement flattens or when goodness starts decreasng. The measure of goodness can address another problem: how to determne the rght number of clusters. In some cases, we may have some pror knowledge about the rght number of clusters (for example, the rght number of parts of speech n part-of-speech clusterng). If ths s not the case, we can cluster the data nto n clusters for dfferent values of n. Often the goodness measure mproves wth n. For example, the more clusters the hgher the maxmum mutual nformaton that can be attaned for a gven data set. However, f the data naturally fall nto a certan number k of clusters, then one can often observe a substantal ncrease n goodness n the transton from k 1 tok clusters and a small ncrease n the transton from k to k + 1. In order to automatcally determne the number of clusters, we can look for a k wth ths property and then settle on the resultng k clusters. A more prncpled approach to fndng an optmal number of clusters s the Mnmum Descrpton Length (MDL) approach n the AUTOCLASS system (Cheeseman et al. 1988). The basc dea s that the measure of goodness captures both how well the objects ft nto the clusters (whch s what the other measures we have seen do) and how many clusters there are. A hgh number of clusters wll be penalzed, leadng to a lower good- reallocatng Mnmum Descrpton Length AUTOCLASS

14.2 Non-Herarchcal Clusterng 515 ness value. In the framework of MDL, both the clusters and the objects are specfed by code words whose length s measured n bts. The more clusters there are, the fewer bts are necessary to encode the objects. In order to encode an object, we only encode the dfference between t and the cluster t belongs to. If there are more clusters, the clusters descrbe objects better, and we need fewer bts to descrbe the dfference between objects and clusters. However, more clusters obvously take more bts to encode. Snce the cost functon captures the length of the code for both data and clusters, mnmzng ths functon (whch maxmzes the goodness of the clusterng) wll determne both the number of clusters and how to assgn objects to clusters. 3 It may appear that t s an advantage of herarchcal clusterng that the number of clusters need not be determned. But the full cluster herarchy of a set of objects does not defne a partcular clusterng snce the tree can be cut n many dfferent ways. For a usable set of clusters n herarchcal clusterng one often needs to determne a desrable number of clusters or, alternatvely, a value of the smlarty measure at whch lnks of the tree are cut. So there s not really a dfference between herarchcal and non-herarchcal clusterng n ths respect. For some non-herarchcal clusterng algorthms, an advantage s ther speed. We cover two non-herarchcal clusterng algorthms n ths secton, K- means and the EM algorthm. K-means clusterng s probably the smplest clusterng algorthm and, despte ts lmtatons, t works suffcently well n many applcatons. The EM algorthm s a general template for a famly of algorthms. We descrbe ts ncarnaton as a clusterng algorthm frst and then relate t to the varous nstantatons that have been used n Statstcal NLP, some of whch lke the nsde-outsde algorthm and the forward-backward algorthm are more fully treated n other chapters of ths book. 14.2.1 K-means K-means recomputaton K-means s a hard clusterng algorthm that defnes clusters by the center of mass of ther members. We need a set of ntal cluster centers n the begnnng. Then we go through several teratons of assgnng each object to the cluster whose center s closest. After all objects have been assgned, we recompute the center of each cluster as the centrod or mean 3. AUTOCLASS can be downloaded from the nternet. See the webste.

516 14 Clusterng 1 Gven: a set X={ x 1,..., x n } R m 2 a dstance measure d : R m R m R 3 a functon for computng the mean µ : P(R) R m 4 Select k ntal centers f 1,..., f k 5 whle stoppng crteron s not true do 6 for all clusters c j do 7 c j ={ x f l d( x, f j ) d( x, f l )} 8 end 9 for all means f j do 10 fj = µ(c j ) 11 end 12 end Fgure 14.8 The K-means clusterng algorthm. µ of ts members (see fgure 14.8), that s µ = (1/ c j ) x c j x. The dstance functon s Eucldean dstance. A varant of K-means s to use the L 1 norm nstead (secton 8.5.2): L 1 ( x, y) = l x l y l medods Ths norm s less senstve to outlers. K-means clusterng n Eucldean space often creates sngleton clusters for outlers. Clusterng n L 1 space wll pay less attenton to outlers so that there s hgher lkelhood of gettng a clusterng that parttons objects nto clusters of smlar sze. The L 1 norm s often used n conjuncton wth medods as cluster centers. The dfference between medods and centrods s that a medod s one of the objects n the cluster a prototypcal class member. A centrod, the average of a cluster s members, s n most cases not dentcal to any of the objects. The tme complexty of K-means s O(n) snce both steps of the teraton are O(n) and only a constant number of teratons s computed. Fgure 14.9 shows an example of one teraton of the K-means algorthm. Frst, objects are assgned to the cluster whose mean s closest. Then the means are recomputed. In ths case, any further teratons wll not change the clusterng snce an assgnment to the closest center does not change the cluster membershp of any object, whch n turn means that no center wll be changed n the recomputaton step. But ths s

14.2 Non-Herarchcal Clusterng 517 7 7 6 5 6 5 4 3 2 1 4 3 2 1 0 0 1 2 3 4 5 6 0 0 1 2 3 4 5 6 assgnment recomputaton of means Fgure 14.9 One teraton of the K-means algorthm. The frst step assgns objects to the closest cluster mean. Cluster means are shown as crcles. The second step recomputes cluster means as the center of mass of the set of objects that are members of the cluster. not the case n general. Usually several teratons are requred before the algorthm converges. One mplementaton problem that the descrpton n fgure 14.8 does not address s how to break tes n cases where there are several centers wth the same dstance from an object. In such cases, one can ether assgn objects randomly to one of the canddate clusters (whch has the dsadvantage that the algorthm may not converge) or perturb objects slghtly so that ther new postons do not gve rse to tes. Here s an example of how to use K-means clusterng. Consder these twenty words from the New York Tmes corpus n chapter 5. Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot, fnance, nnng, payments, polls, proft, quarterback, researchers, scence, score, scored, seats

518 14 Clusterng Cluster Members 1 ballot (0.28), polls (0.28), Gov (0.30), seats (0.32) 2 proft (0.21), fnance (0.21), payments (0.22) 3 NFL (0.36), Reds (0.28), Sox (0.31), nnng (0.33), quarterback (0.30), scored (0.30), score (0.33) 4 researchers (0.23), scence (0.23) 5 Scott (0.28), Mary (0.27), Barbara (0.27), Edward (0.29) Table 14.4 An example of K-means clusterng. Twenty words represented as vectors of co-occurrence counts were clustered nto 5 clusters usng K-means. The dstance from the cluster centrod s gven after each word. Buckshot Table 14.4 shows the result of clusterng these words usng K-means wth k = 5. We used the data representaton from chapter 8 that s also the bass of table 8.8 on page 302. The frst four clusters correspond to the topcs government, fnance, sports, and research, respectvely. The last cluster contans names. The beneft of clusterng s obvous here. The clustered dsplay of the words makes t easer to understand what types of words occur n the sample and what ther relatonshps are. Intal cluster centers for K-means are usually pcked at random. It depends on the structure of the set of objects to be clustered whether the choce of ntal centers s mportant or not. Many sets are well-behaved and most ntalzatons wll result n clusterngs of about the same qualty. For ll-behaved sets, one can compute good cluster centers by frst runnng a herarchcal clusterng algorthm on a subset of the objects. Ths s the basc dea of the Buckshot algorthm. Buckshot frst apples groupaverage agglomeratve clusterng (GAAC) to a random sample of the data that has sze square root of the complete set. GAAC has quadratc tme complexty, but snce ( n) 2 = n, applyng GAAC to ths sample results n overall lnear complexty of the algorthm. The K-means reassgnment step s also lnear, so that the overall complexty s O(n). 14.2.2 The EM algorthm One way to ntroduce the EM algorthm s as a soft verson of K-means clusterng. Fgure 14.10 shows an example. As before, we start wth a set of random cluster centers, c 1 and c 2. In K-means clusterng we would

14.2 Non-Herarchcal Clusterng 519 3 2 1 0 0 1 2 3 c 1 c 2 3 2 1 0 c 1 c 2 0 1 2 3 3 2 1 0 c 1 c 2 0 1 2 3 ntal state after teraton 1 after teraton 2 Fgure 14.10 An example of usng the EM algorthm for soft clusterng. arrve at the fnal centers shown on the rght sde n one teraton. The EM algorthm nstead does a soft assgnment, whch, for example, makes the lower rght pont mostly a member of c 2, but also partly a member of c 1. As a result, both cluster centers move towards the centrod of all three objects n the frst teraton. Only after the second teraton do we reach the stable fnal state. An alternatve way of thnkng of the EM algorthm s as a way of estmatng the values of the hdden parameters of a model. We have seen some data X, and can estmate P(X p(θ)), the probablty of the data accordng to some model p wth parameters Θ. But how do we fnd the model whch maxmzes the lkelhood of the data? Ths pont wll be a maxmum n the parameter space, and therefore we know that the probablty surface wll be flat there. So for each model parameter θ,wewant to set θ log P(...) = 0 and solve for the θ. Unfortunately ths (n general) gves a non-lnear set of equatons for whch no analytcal methods of soluton are known. But we can hope to fnd the maxmum usng the EM algorthm. In ths secton, we wll frst ntroduce the EM algorthm for the estmaton of Gaussan mxtures, the soft clusterng algorthm that fgure 14.10 s an example of. Then we wll descrbe the EM algorthm n ts most general form and relate the general form to specfc nstances lke the nsde-outsde algorthm and the forward-backward algorthm.

520 14 Clusterng EM for Gaussan mxtures observable unobservable In applyng EM to clusterng, we vew clusterng as estmatng a mxture of probablty dstrbutons. The dea s that the observed data are generated by several underlyng causes. Each cause contrbutes ndependently to the generaton process, but we only see the fnal mxture wthout nformaton about whch cause contrbuted what. We formalze ths noton by representng the data as a par. There s the observable data X ={ x }, where each x = (x 1,...,x m ) T s smply the vector that corresponds to the th data pont. And then there s the unobservable data Z = { z }, where wthn each z = z 1,...,z k, the component z j s 1 f object s a member of cluster j (that s, t s assumed to be generated by that underlyng cause) and 0 otherwse. We can cluster wth the EM algorthm f we know the type of dstrbuton of the ndvdual clusters (or causes). When estmatng a Gaussan mxture, we make the assumpton that each cluster s a Gaussan. The EM algorthm then determnes the most lkely estmates for the parameters of the dstrbutons (n our case, the mean and varance of each Gaussan), and the pror probablty (or relatve promnence or weght) of the ndvdual causes. So n sum, we are supposng that the data to be clustered conssts of nm-dmensonal objects X={ x 1... x n } R m generated by k Gaussans n 1...n k. Once the mxture has been estmated we can vew the result as a clusterng by nterpretng each cause as a cluster. For each object x,wecan compute the probablty P(ω j x ) that cluster j generated. An object can belong to several clusters, wth varyng degrees of confdence. Gaussan covarance matrx (14.17) Multvarate normal dstrbutons. The (multvarate) m-dmensonal Gaussan famly s parameterzed by a mean or center µ j and an m m nvertble postve defnte symmetrc matrx, the covarance matrx Σ j. The probablty densty functon for a Gaussan s gven by: [ 1 n j ( x; m j, Σ j ) = (2π) m Σ j exp 1 ] 2 ( x µ j) T Σ 1 j ( x µ j ) Snce we are assumng that the data s generated by k Gaussans, we wsh to fnd the maxmum lkelhood model of the form: (14.18) k π j n( x; µ j, Σ j ) j=1

14.2 Non-Herarchcal Clusterng 521 Man P(w c j ) = n j ( x ; µ j Σ j ) cluster Word 1 2 3 4 5 1 ballot 0.63 0.12 0.04 0.09 0.11 1 polls 0.58 0.11 0.06 0.10 0.14 1 Gov 0.58 0.12 0.03 0.10 0.17 1 seats 0.55 0.14 0.08 0.08 0.15 2 proft 0.11 0.59 0.02 0.14 0.15 2 fnance 0.15 0.55 0.01 0.13 0.16 2 payments 0.12 0.66 0.01 0.09 0.11 3 NFL 0.13 0.05 0.58 0.09 0.16 3 Reds 0.05 0.01 0.86 0.02 0.06 3 Sox 0.05 0.01 0.86 0.02 0.06 3 nnng 0.03 0.01 0.93 0.01 0.02 3 quarterback 0.06 0.02 0.82 0.03 0.07 3 score 0.12 0.04 0.65 0.06 0.13 3 scored 0.08 0.03 0.79 0.03 0.07 4 researchers 0.08 0.12 0.02 0.68 0.10 4 scence 0.12 0.12 0.03 0.54 0.19 5 Scott 0.12 0.12 0.11 0.11 0.54 5 Mary 0.10 0.10 0.05 0.15 0.59 5 Barbara 0.15 0.11 0.04 0.12 0.57 5 Edward 0.16 0.18 0.02 0.12 0.51 Table 14.5 An example of a Gaussan mxture. The fve cluster centrods from table 14.4 are the means µ j of the fve clusters. A unform dagonal covarance matrx Σ = 0.05 I and unform prors π j = 0.2 were used. The posteror probabltes P(w c j ) can be nterpreted as cluster membershp probabltes. In ths model, we need to assume a pror or weght π j for each Gaussan, so that the ntegral of the combned Gaussans over the whole space s 1. Table 14.5 gves an example of a Gaussan mxture, usng the centrods from the K-means clusterng n table 14.4 as cluster centrods µ j (ths s a common way of ntalzng EM for Gaussan mxtures). For each word, the cluster from table 14.4 s stll the domnatng cluster. For example, ballot has a hgher membershp probablty n cluster 1 (ts cluster from the K- means clusterng) than n other clusters. But each word also has some non-zero membershp n all other clusters. Ths s useful for assessng the strength of assocaton between a word and a topc. Comparng two