Semi-Supervised Kernel Mean Shift Clustering

Size: px

Start display at page:

Download "Semi-Supervised Kernel Mean Shift Clustering"

Barbara Blair
6 years ago
Views:

1 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 1 Sem-Supervsed Kernel Mean Shft Clusterng Saket Anand, Student Member, IEEE, Sushl Mttal, Member, IEEE, Oncel Tuzel, Member, IEEE, and Peter Meer, Fellow, IEEE Abstract Mean shft clusterng s a powerful nonparametrc technque that does not requre pror knowledge of the number of clusters and does not constran the shape of the clusters. However, beng completely unsupervsed, ts performance suffers when the orgnal dstance metrc fals to capture the underlyng cluster structure. Despte recent advances n sem-supervsed clusterng methods, there has been lttle effort towards ncorporatng supervson nto mean shft. We propose a sem-supervsed framework for kernel mean shft clusterng (SKMS) that uses only parwse constrants to gude the clusterng procedure. The ponts are frst mapped to a hgh-dmensonal kernel space where the constrants are mposed by a lnear transformaton of the mapped ponts. Ths s acheved by modfyng the ntal kernel matrx by mnmzng a log det dvergence-based objectve functon. We show the advantages of SKMS by evaluatng ts performance on varous synthetc and real datasets whle comparng wth state-of-the-art sem-supervsed clusterng algorthms. Index Terms sem-supervsed kernel clusterng; log det Bregman dvergence; mean shft clusterng; 1 INTRODUCTION Mean shft s a popular mode seekng algorthm, whch teratvely locates the modes n the data by maxmzng the kernel densty estmate (KDE). Although the procedure was ntally descrbed decades ago [17], t was not popular n the vson communty untl ts potental uses for feature space analyss and optmzaton were understood [8], [12]. The nonparametrc nature of mean shft makes t a powerful tool to dscover arbtrarly shaped clusters present n the data. Addtonally, the number of clusters s automatcally determned by the number of dscovered modes. Due to these propertes, mean shft has been appled to solve several computer vson problems, e.g., mage smoothng and segmentaton [11], [41], vsual trackng [9], [19] and nformaton fuson [7], [10]. Mean shft clusterng was also extended to nonlnear spaces, for example, to perform clusterng on analytc manfolds [36], [38] and kernel nduced feature space [34], [39] under an unsupervsed settng. In many clusterng problems, n addton to the unlabeled data, often some addtonal nformaton s also easly avalable. Dependng on a partcular problem, ths addtonal nformaton could be avalable n dfferent forms. For example, the number of clusters or a few must-lnk (smlarty) and cannot-lnk (dssmlarty) pars could be known a-pror. In the last decade, sem-supervsed clusterng methods that am to ncorporate pror nformaton nto the clusterng algorthm as a gude, have receved S. Anand and P. Meer are wth the Department of Electrcal and Computer Engneerng, Rutgers Unversty, Pscataway, NJ E- mal: {anands@eden, meer@cronos}.rutgers.edu. S. Mttal s wth Scbler Technologes, Bangalore, Inda. E-mal: mttal@scbler.com. O. Tuzel s wth Mtsubsh Electrc Research Laboratores, Cambrdge, MA E-mal: oncel@merl.com. consderable attenton n machne learnng and computer vson [2], [6], [18]. Increasng nterest n ths area has trggered the adaptaton of several popular unsupervsed clusterng algorthms nto a sem-supervsed framework, e.g., background constraned k-means [40], constraned spectral clusterng [23], [28] and kernel graph clusterng [24]. It s shown that unlabeled data, when used n conjuncton wth a small amount of labeled data, can produce sgnfcant mprovement n clusterng accuracy. However, despte these advances, mean shft clusterng has largely been gnored under the sem-supervsed learnng framework. To leverage all the useful propertes of mean shft, we adapt the orgnal unsupervsed algorthm nto a sem-supervsed clusterng technque. The work n [37] ntroduced weak supervson nto the kernel mean shft algorthm where the addtonal nformaton was provded through a few parwse must-lnk constrants. In that framework, each par of must-lnk ponts was collapsed to a sngle pont through a lnear projecton operaton, guaranteeng ther assocaton wth the same cluster. In ths paper, we extend that work by generalzng the lnear projecton operaton to a lnear transformaton of the kernel space that permts us to scale the dstance between the constrant ponts. Usng ths transformaton, the must-lnk ponts are moved closer to each other, whle the cannot-lnk ponts are moved farther apart. We show that ths transformaton can be acheved mplctly by modfyng the kernel matrx. We also show that gven one constrant par, the correspondng kernel update s equvalent to mnmzng the log det dvergence between the updated and the ntal kernel matrx. For multple constrants, ths result proves to be very useful snce we can learn the kernel matrx by solvng a constraned log det mnmzaton problem [22]. Fg. 1 llustrates the basc approach. The orgnal data

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 2 (a) (c) (e) Fg. 1: Olympc Crcles. (a) Orgnal data n the nput space wth fve dfferent clusters.

The ponts are ordered accordng to class. (d) PDM usng the modes found after supervson was added. Sample clusterng results usng (e) unsupervsed and (f) sem-supervsed kernel mean shft clusterng.

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 2 (a) (c) (e) Fg. 1: Olympc Crcles. (a) Orgnal data n the nput space wth fve dfferent clusters. (b) Labeled ponts used to generate parwse constrants. (c) Parwse dstance matrx (PDM) usng the modes dscovered by unsupervsed mean shft clusterng performed n the kernel space. The ponts are ordered accordng to class. (d) PDM usng the modes found after supervson was added. Sample clusterng results usng (e) unsupervsed and (f) sem-supervsed kernel mean shft clusterng. wth fve crcles, each contanng 300 ponts, s shown n Fg. 1a. We assume that 20 labeled ponts from each cluster are selected at random (only 6.7% of the data) to generate parwse must-lnk and cannot-lnk constrants (Fg. 1b). We use all the 5 ( ) 20 2 unque ntra-class pars as must-lnk constrants and a same number of cannot-lnk constrants. Note that, although we generate parwse constrants usng a small fracton of labeled ponts, the algorthm does not requre the explct knowledge of class labels. The data s frst mapped to a kernel space usng a Gaussan kernel (σ = 0.5). In the frst case, kernel mean shft s drectly appled to the data ponts n the absence of any supervson. Fg. 1c shows the correspondng parwse dstance matrx (PDM) obtaned usng modes recovered by mean-shft. The lack of a block dagonal structure mples the absence of meanngful clusters dscovered n the data. In the second case, the constrant pars are used to transform the ntal kernel space by learnng an approprate kernel matrx va log det dvergence mnmzaton. Fg. 1d shows the PDM between the modes obtaned when kernel mean shft s appled to the transformed feature ponts. In ths case, the fve-cluster structure s clearly vsble n the PDM. Fnally, Fg. 1e and 1f show the correspondng clusterng (b) (d) (f) labels mapped back to the nput space. Through ths example, t becomes evdent that a small amount of supervson can mprove the clusterng performance sgnfcantly. In the followng, we lst our key contrbutons ( ) and present the overall organzaton of the paper. In Secton 2, we present an overvew of the past work n sem-supervsed clusterng. In Secton 3.1, we brefly revew the Eucldean mean shft algorthm and n Secton 3.2 we dscuss ts extenson to hgh-dmensonal kernel spaces. By employng the kernel trck, the explct representaton of the kernel mean shft algorthm n terms of the mappng functon s transformed nto an mplct representaton descrbed n terms of the kernel functon nducng that mappng. In Secton 4.1, we derve the expresson for performng kernel updates usng orthogonal projecton of the must-lnk constrant ponts. Wth lttle manpulaton of the dstances between the constrant ponts, n Secton 4.2, we extend ths projecton operaton to the more general lnear transformaton, that can also utlze cannot-lnk constrants. By allowng relaxaton on the constrants, we motvate the formulaton of learnng the kernel matrx as a Bregman dvergence mnmzaton problem. In Secton 5, we revew the optmzaton algorthm of [22], [25] for learnng the kernel matrx usng log det Bregman dvergences. More detals about ths algorthm are also furnshed n the supplementary materal. In Secton 6, we descrbe the complete semsupervsed kernel mean shft (SKMS) algorthm. In addton, several practcal enhancements lke automatcally estmatng varous algorthmc parameters and low-rank kernel learnng for sgnfcant computaton gans are also dscussed n detal. In Secton 7, we show expermental results on several synthetc and real data sets that encompass a wde spectrum of challenges encountered n practcal machne learnng problems. We also present detaled comparson wth state-of-the-art sem-supervsed clusterng technques. Fnally, n Secton 8, we conclude wth a dscusson around the key strengths and lmtatons of our work. 2 RELATED WORK Sem-supervsed clusterng has receved a lot of attenton n the past few years due to ts hghly mproved performance over tradtonal unsupervsed clusterng methods [3], [6]. As compared to fully-supervsed learnng algorthms, these methods requre a weaker form of supervson n terms of both the amount and the form of labeled data used. In clusterng, ths s usually acheved by usng only a few parwse must-lnk and cannot-lnk constrants. Snce generatng parwse constrants does not requre the knowledge of class labels or the number of clusters, they can easly be generated usng addtonal nformaton. For example, whle

3 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 3 clusterng mages of objects, two mages wth smlar text annotatons could be used as a must-lnk par. We dscuss some of the tradtonal unsupervsed clusterng methods and ther sem-supervsed varants below. Parttonng based clusterng: k-means s one of the oldest and most popular clusterng algorthm. See [21] for an extensve revew of all the varants of k-means algorthm. One of ts popular sem-supervsed extensons usng parwse constrants was proposed n [40]. The method parttons the data nto k non-overlappng regons such that must-lnk pars are enforced to le n the same cluster whle the cannot-lnk pars are enforced to le n dfferent clusters. However, snce the method enforces all constrants rgdly, t s senstve to labelng errors n parwse constrants. Basu et al. [2] proposed a more robust, probablstc model by explctly allowng relaxaton for a few constrants n k- means clusterng. Smlarly, Blenko et al. [4] proposed a metrc learnng based approach to ncorporate constrants nto k-means clusterng framework. Snce the clusterng n both these methods s performed n the nput space, these methods fal to handle non-lnear cluster boundares. More recently, Kuls et al. proposed sem-supervsed kernel k- means (SSKK) algorthm [24], that constructs a kernel matrx by addng the nput smlarty matrx to the constrant penalty matrx. Ths kernel matrx s then used to perform k-means clusterng. Graph based clusterng: Spectral clusterng [31] s another very popular technque that can also cluster data ponts wth non-lnear cluster boundares. In [23], ths method was extended to ncorporate weak supervson by updatng the computed affnty matrx for the specfed constrant ponts. Later, Lu et al. [28] further modfed the algorthm by propagatng the specfed constrants to the remanng ponts usng a Gaussan process. More recently, Lu and Ip [29] showed mproved clusterng performances by applyng exhaustve constrant propagaton and handlng soft constrants (E2CP). One of the fundamental problems wth ths method s that t can be senstve to labelng nose snce the effect of a mslabeled data pont par could easly get propagated throughout the affnty matrx. Moreover, n general, spectral clusterng methods suffer when the clusters have very dfferent scales and denstes [30]. Densty based clusterng: Clusterng methods n ths category make use of the estmated local densty of the data to dscover clusters. Gaussan Mxture Models (GMM) are often used for soft clusterng where each mxture represents a cluster dstrbuton [27]. Mean shft [11], [12], [37] was employed n computer vson for clusterng data n the feature space by locatng the modes of the nonparametrc estmate of the underlyng densty. There exst other densty based clusterng methods that are less known n the vson communty, but have been appled to data mnng applcatons. For example, DBSCAN [15] groups together ponts that are connected through a chan of hgh-densty neghbors that are determned by the two parameters: neghborhood sze and mnmum allowable densty. SSDBSCAN [26] s a sem-supervsed varant of DBSCAN that explctly uses the labels of a few ponts to determne the neghborhood parameters. C-DBSCAN [33] performs a herarchcal densty based clusterng whle enforcng the must-lnk and cannot-lnk constrants. 3 KERNEL MEAN SHIFT CLUSTERING Frst, we brefly descrbe the Eucldean mean shft clusterng algorthm n Secton 3.1 and then derve the kernel mean shft algorthm n Secton Eucldean Mean Shft Clusterng Gven n data ponts x on a d-dmensonal space R d and the assocated dagonal bandwdth matrces h I d d, = 1,..., n, the sample pont densty estmator obtaned wth the kernel profle k(x) s gven by f(x) = 1 n n 1 h d x x k( h We utlze multvarate normal profle 2). (1) k(x) = e 1 2 x x 0. (2) Takng the gradent of (1), we observe that the statonary ponts of the densty functon satsfy 2 n 1 x x (x x) g( 2) n = 0 (3) h h d+2 where g(x) = k (x). The soluton s a local maxmum of the densty functon whch can be teratvely reached usng mean shft procedure δx = n n x g( ) x x 2 h d+2 h 1 g( ) x (4) x x 2 h d+2 h where x s the current mean and δx s the mean shft vector. To recover from saddle ponts addng a small perturbaton to the current mean vector s usually suffcent. Comancu and Meer [11] showed that the convergence to a local mode of the dstrbuton s guaranteed. 3.2 Mean Shft Clusterng n Kernel Spaces We extend the orgnal mean shft algorthm from the Eucldean space to a general nner product space. Ths makes t possble to apply the algorthm to a larger class of nonlnear problems, such as clusterng on manfolds [38]. We also note that a smlar dervaton for fxed bandwdth mean shft algorthm was gven n [39]. Let X be the nput space such that the data ponts x X, = 1,..., n. Although, n general, X may not necessarly be a Eucldean space, for sake of smplcty, we assume X corresponds to R d. Every pont x s then mapped to a d φ -dmensonal feature space H by applyng the mappng functons φ l, l = 1,..., d φ, where φ(x) = [φ 1 (x) φ 2 (x)... φ dφ (x)]. (5)

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 4 (a) (b) (c) Fg. 2: Clusterng wth must-lnk constrants. (a) Input space. Red markers are the constrant par (x 1, x 2). (b) The nput space s mapped to the feature space va quadratc mappng φ(x) = [x x 2 ]. The black arrow s the constrant vector (φ(x 1) φ(x 2)), and the red dashed lne s ts null space. (c) The feature space s projected to the null space of the constrant vector. Constrant ponts collapse to a sngle pont and are guaranteed to be clustered together. Two clusters can be easly dentfed. Note that n many problems, ths mappng s suffcent to acheve the desred separablty between dfferent clusters. We frst derve the mean shft procedure on the feature space H n terms of the explct representaton of the mappng φ. The pont sample densty estmator at y H, wth the dagonal bandwdth matrces h I dφ d φ s f H (y) = 1 n n 1 h d φ k( y φ(x ) h 2). (6) Takng the gradent of (6) w.r.t. φ, lke (4), the soluton can be found teratvely usng the mean shft procedure ( n φ(x ) y φ(x ) g ) 2 h d φ +2 h δy = ( ) y. (7) y φ(x ) n 1 g h d φ +2 h 2 By employng the kernel trck, we now derve the mplct formulaton of the kernel mean shft algorthm. We defne K : X X R, a postve semdefnte, scalar kernel functon satsfyng for all x, x X K(x, x ) = φ(x) φ(x ). (8) K( ) defnes an nner product whch makes t possble to map the data mplctly to a hgh-dmensonal kernel space. Let Φ = [φ(x 1 ) φ(x 2 )... φ(x n )] (9) be the d φ n matrx of the mapped ponts and K = Φ T Φ be the n n kernel (Gram) matrx. We observe that at each teraton of the mean shft procedure (7), the estmate ȳ = y + δy always les n the column space of Φ. Any such pont ȳ can be wrtten as ȳ = Φαȳ (10) where αȳ s an n-dmensonal weghtng vector. The dstance between two ponts y and y n ths space can be expressed n terms of ther nner product and ther respectve weghtng vectors y y 2 = Φα y Φα y 2 (11) = α y Φ Φα y + α y Φ Φα y 2α y Φ Φα y = α y Kα y + α y Kα y 2α y Kα y. Let e denote the -th canoncal bass vector for R n. Applyng (11) to compute dstances n (7) by usng the equvalence φ(x ) = Φe, the mean shft algorthm teratvely updates the weghtng vector α y αȳ = n n e h d φ +2 ( g 1 g h d φ +2 α y Kαy+e Ke 2α y Ke h 2 ( α y Kα y+e Ke 2α y Ke h 2 ) ). (12) The clusterng starts on the data ponts on H, therefore the ntal weghtng vectors are gven by α y = e, such that, y = Φα y = φ(x ), = 1... n. At convergence, the mode ȳ can be recovered usng (10) as Φᾱ y. The ponts convergng close to the same mode are clustered together, followng the orgnal proof [11]. Snce any postve semdefnte matrx K s a kernel for some feature space [13], the derved method mplctly apples mean shft on the feature space nduced by K. Note that under ths formulaton, mean shft n the nput space can be mplemented by smply choosng the mappng functon φ( ) to be dentty,.e. φ(x) = x. An mportant pont s that the dmensonalty of the feature space d φ can be very large, for example t s nfnte n case of the Gaussan kernel functon. In such cases t may not be possble to explctly compute the pont sample densty estmator (6) and consequently the mean shft vectors (7). Snce the dmensonalty of the subspace spanned by the feature ponts s equal to rank K n, t s suffcent to use the mplct form of the mean shft procedure usng (12). 4 KERNEL LEARNING USING LINEAR TRANSFORMATIONS A nonlnear mappng of the nput data to a hgherdmensonal kernel space often mproves cluster separablty. By effectvely enforcng the avalable constrants, t s possble to transform the entre space and gude the clusterng to dscover the desred structure n the data. To llustrate ths ntuton, we present a smple two class example from [37] n Fg. 2. The orgnal data les along the x-axs, wth the blue ponts assocated wth one class and the black ponts wth the second class (Fg. 2a). Ths data

5 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 5 appears to have orgnated from three clusters. Let (x 1, x 2 ) be the par of ponts marked n red whch are constraned to be clustered together. We map the data explctly to a two-dmensonal feature space va a smple quadratc mappng φ(x) = [x x 2 ] (Fg. 2b). Although the data s lnearly separable, t stll appears to form three clusters. Usng the must-lnk constrant par, we enforce the two red ponts to be clustered together. The black arrow denotes the constrant vector φ(x 1 ) φ(x 2 ) and the dashed red lne s ts null space. By projectng the feature ponts to the null space of the constrant vector, the ponts φ(x 1 ) and φ(x 2 ) are collapsed to the same pont, guaranteeng ther assocaton wth the same mode. From Fg.2c, we can see that n the projected space, the data has the desred cluster structure whch s consstent wth ts class assocaton. Ths approach, although smple, does not scale well wth ncreasng number of constrants for a smple nonlnear mappng lke above. Gven m lnearly ndependent constrant vectors n a d φ -dmensonal space, the dmensonalty of the null space of the constrant matrx s d φ m. Ths mples that f d φ or more constrants are specfed, all the ponts collapse to a sngle pont and are therefore grouped together n one cluster. Ths problem can be allevated f a mappng functon φ( ) s chosen such that d φ s very large. Snce explctly desgnng the mappng functon φ(x) s not always practcal, we use a kernel functon K(x, x ) to mplctly map the nput data to a very hgh-dmensonal kernel space. As we show n Secton 4.1, the subsequent projecton to the null-space of the constrant vectors can also be acheved mplctly by approprately updatng the kernel matrx. In Secton 4.2, we further generalze the projecton operaton to a lnear transformaton that also utlzes cannot-lnk constrants. 4.1 Kernel Updates Usng Orthogonal Projecton Recall the matrx Φ n (9), obtaned by mappng the nput ponts to H va the nonlnear functon φ. Let (j 1, j 2 ) be a must-lnk constrant par such that φ(x j1 ) = Φe j1 and φ(x j2 )=Φe j2 are to be clustered together. Gven a set of m such must-lnk constrant pars M, for every (j 1, j 2 ) M, the d φ -dmensonal constrant vector can be wrtten as a j = Φ (e j1 e j2 ) = Φz j. We refer to the n-dmensonal vector z j as the ndcator vector for the j th constrant. The d φ m dmensonal constrant matrx A can be obtaned by column stackng all the m constrant vectors,.e., A = ΦZ, where Z = [z 1,..., z m ] s the n m matrx of ndcator vectors. Smlar to the example n Fg. 2, we mpose the constrants by projectng the matrx Φ to the null space of A usng the the projecton matrx ( ) + P = I dφ A A A A (13) where I dφ s the d φ -dmensonal dentty matrx and + denotes the matrx pseudonverse operaton. Let S = A A be the m m scalng matrx. The matrx S can be computed usng the ndcator vectors and the ntal kernel matrx K wthout knowng the mappng φ as S = A A = Z Φ ΦZ = Z KZ. (14) Gven the constrant set, the new mappng functon φ(x) s computed as φ(x) = Pφ(x). Snce all the constrants n M are satsfed n the projected subspace, we have PΦZ 2 F = 0 (15) where F denotes the Frobenus norm. The ntal kernel functon (8) correspondngly transforms to the projected kernel functon K(x, x ) = φ(x) φ(x ). Usng the projecton matrx P, t can be rewrtten n terms of the ntal kernel functon K(x, x ) and the constrant matrx A K(x, x ) = φ(x) P Pφ(x ) = φ(x) Pφ(x ) = φ(x) (I dφ AS + A )φ(x ) = K(x, x ) φ(x) AS + A φ(x ). (16) Note that the dentty P P = P follows from the fact that P s a symmetrc projecton matrx. The m-dmensonal vector A φ(x) can also be wrtten n terms of the ntal kernel functon as A φ(x) = Z Φ φ(x) = Z [K(x 1, x),..., K(x n, x)]. (17) Let the vector k x = [K(x 1, x),..., K(x n, x)]. From (16), the projected kernel functon can be wrtten as K(x, x ) = K(x, x ) k x ZS + Z k x (18) The projected kernel matrx can be drectly computed as K = K KZS + Z K (19) where the symmetrc n n ntal kernel matrx K has rank r n. The rank of the matrx S s equal to the number of lnearly ndependent constrants, and the rank of the projected kernel matrx K s r rank S. 4.2 Kernel Updates Usng Lnear Transformaton By projectng the feature ponts to the null space of the constrant vector a, ther components along the a are fully elmnated. Ths operaton guarantees that the two ponts belong to the same cluster. As proposed earler, by approprately choosng the kernel functon K( ) (such that d φ s very large), we can make the ntal kernel matrx K full-rank. However, a suffcently large number of such lnearly ndependent constrants could stll result n a projected space wth dmensonalty that s too small for meanngful clusterng. From a clusterng perspectve, t mght suffce to brng the two must-lnk constrant ponts suffcently close to each other. Ths can be acheved through a lnear transformaton of the kernel space that only scales down the component along the constrant vector. Such a lnear transformaton would preserve the rank of the kernel matrx and s potentally capable of handlng a very large number of mustlnk constrants. A smlar transformaton that ncreases the dstance between two ponts also enables us to ncorporate cannot-lnk constrants.

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 6 Gven a constrant vector a = Φz, a symmetrc transformaton matrx of the form T = I dφ s ( aa ) (20) allows us to control the scalng along the vector a usng the scalng factor s. When s = a 1 a the transformaton becomes a projecton to the null space of a. The transformaton decreases dstances along a for 0 < s < a 2 a, whle t s ncreased for s < 0 or s > a 2 a. We can set a target dstance d > 0 for the constrant pont par by applyng an approprate transformaton T. The constrant equaton can be wrtten as TΦz 2 F = z Kz = d (21) where K = Φ T TΦ = Φ T 2 Φ s the correspondng transformed kernel matrx. To handle must-lnk constrants, d should be small, whle t should be large for cannot-lnk constrant pars. Usng the specfed d, we can compute s and therefore the transformed kernel matrx K = Φ ( I dφ s aa ) 2 Φ. (22) Substtutng a = Φz the expresson for K s K = K 2s Kzz K + s 2 ( z Kz ) Kzz K. (23) From (21), we can solve for s ( ) s = 1 d 1 ± where p = z Kz > 0 (24) p p and the choce of s does not affect the followng kernel update such that the constrant (21) s satsfed K = K + βkzz K, β = ( d p 2 1 ). (25) p The case when p = 0 mples that the constrant vector s a zero vector and T = I dφ and a value of β = 0 should be used. When multple must-lnk and cannot-lnk constrants are avalable, the kernel matrx can be updated teratvely for each constrant by computng the update parameter β j usng correspondng d j and p j. However, by enforcng the dstance between constrant pars to be exactly d j, the lnear transformaton mposes hard constrants for learnng the kernel matrx whch could potentally have two adverse consequences: 1) The hard constrants could make the algorthm dffcult (or even mpossble) to converge, snce prevously satsfed constrants could easly get volated durng subsequent updates of the kernel matrx. 2) Even when the constrants are non-conflctng n nature, n the absence of any relaxaton, the method becomes senstve to labelng errors. Ths s llustrated through the followng example.. Ths update rule s equvalent to mnmzng the log det dvergence (29) for a sngle equalty constrant usng Bregman projectons (31). The nput data conssts of ponts along fve concentrc crcles as shown n Fg.3a. Each cluster comprses of 150 nosy ponts along a crcle. Four labeled ponts per class (shown ( by square markers) are used to generate a set of 4 ) 2 5 = 30 must-lnk constrants. The ntal kernel matrx K s computed usng a Gaussan kernel wth σ = 5, and updated by mposng the provded constrants (19). The updated kernel matrx K s used for kernel mean shft clusterng (Secton 3.2) and the correspondng results are shown n Fg. 3b. To test the performance of the method under labelng errors, we also add one mslabeled mustlnk constrant (shown n black lne). The clusterng performance deterorates drastcally wth just one mslabeled constrant as shown n Fg. 3c. In order to overcome these lmtatons, the learnng algorthm must accommodate for systematc relaxaton of constrants. Ths can be acheved by observng that the kernel update n (25) s equvalent to mnmzng the log det dvergence between K and K subject to constrant (21) [25, Sec ]. In the followng secton, we formulate the kernel learnng algorthm nto a log det mnmzaton problem wth soft constrants. 5 KERNEL LEARNING USING BREGMAN DI- VERGENCE Bregman dvergences have been studed n the context of clusterng, matrx nearness and metrc and kernel learnng [22], [25], [1]. We brefly dscuss the log det Bregman dvergence and ts propertes n Secton 5.1. We summarze the kernel learnng problem usng Bregman dvergence whch was ntroduced n [25], [22] n Secton The LogDet Bregman Dvergence The Bregman dvergence [5] between real, symmetrc n n matrces X and Y s a scalar functon gven by ( ) D ϕ (X, Y)=ϕ(X) ϕ(y) tr ϕ (Y) (X Y) (26) where ϕ s a strctly convex functon and denotes the gradent operator. For ϕ(x) = log (det(x)), the resultng dvergence s called the log det dvergence D ld (X, Y) = tr ( XY 1) log det ( XY 1) n (27) and s defned when X, Y are postve defnte. In [25], ths defnton was extended to rank defcent (postve semdefnte) matrces by restrctng the convex functon to the range spaces of the matrces. For postve semdefnte matrces X and Y, both havng rank r n and sngular value decomposton X = VΛV and Y = UΘU, the log det dvergence s defned as D ld (X, Y) =,j r ( v u j ) 2 ( λ θ j log λ θ j 1 ). (28) Moreover, the log det dvergence, lke any Bregman dvergence, s convex wth respect to ts frst argument X [1]. Ths property s useful n formulatng the kernel learnng as a convex mnmzaton problem n the presence of multple constrants.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 7 (a) (b) (c) Fg. 3: Fve concentrc crcles. Kernel learnng wthout relaxaton (a) Input data.

(b) Clusterng results when only the correctly labeled smlarty constrants were used. (c) Clusterng results when the mslabeled constrant was also used. 5.

7 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 7 (a) (b) (c) Fg. 3: Fve concentrc crcles. Kernel learnng wthout relaxaton (a) Input data. The square markers ndcate the data ponts used to generate parwse constrants. The black lne shows the only mslabeled smlarty constrant used. (b) Clusterng results when only the correctly labeled smlarty constrants were used. (c) Clusterng results when the mslabeled constrant was also used. 5.2 Kernel Learnng wth LogDet Dvergences Let M and C denote the sets of m must-lnk and c cannotlnk pars respectvely, such that m + c = n c. Let d m and d c be the target squared dstance thresholds for mustlnk and cannot-lnk constrants respectvely. The problem of learnng a kernel matrx usng lnear transformatons for multple constrants, dscussed n Secton 4.2, can be equvalently formulated as the followng constraned log det mnmzaton problem ( ) mn D ld K, K (29) K s.t. (e j1 e j2 ) K (ej1 e j2 ) = d m (j 1, j 2 ) M (e j1 e j2 ) K (ej1 e j2 ) = d c (j 1, j 2 ) C. The fnal kernel matrx K s obtaned by teratvely updatng the ntal kernel matrx K. In order to permt relaxaton of constrants, we rewrte the learnng problem usng a soft margn formulaton smlar to [22], where each constrant par (j 1, j 2 ) s assocated wth a slack varable ξ j, j = 1,..., n c mn K, ξ D ld ( K, K ) + γd ld (dag ( ξ ) ), dag (ξ) s.t. (e j1 e j2 ) K (ej1 e j2 ) ξ j (j 1, j 2 ) M (e j1 e j2 ) K (ej1 e j2 ) ξ j (j 1, j 2 ) C. (30) The n c -dmensonal vector ξ s the vector of slack varables and ξ s the vector of target dstance thresholds d m and d c. The regularzaton parameter γ controls the trade-off between fdelty to the orgnal kernel and the tranng error. Note that by changng the equalty constrants to nequalty constrants, the algorthm allows must-lnk pars to be closer and cannot-lnk pars to be farther than ther correspondng dstance thresholds. The optmzaton problem n (30) s solved usng the method of Bregman projectons [5]. A Bregman projecton (not necessarly orthogonal) s performed to update the current matrx, such that the updated matrx satsfes that constrant. In each teraton, t s possble that the current update volates a prevously satsfed constrant. Snce the problem n (30) s convex [25], the algorthm converges to the global mnmum after repeatedly updatng the kernel matrx for each constrant n the set M C. Fg. 4: Block dagram descrbng the sem-supervsed kernel mean shft clusterng algorthm. The bold boxes ndcate the user nput to the algorthm. For the log det dvergence, the Bregman projecton that mnmzes the objectve functon n (30) for a gven constrant (j 1, j 2 ) M C, can be wrtten as derved n [25] K t+1 = K t + β t Kt (e j1 e j2 ) (e j1 e j2 ) Kt. (31) For the t th teraton the parameter β t s computed n closed form as explaned n the supplementary materal. The algorthm converges when β t approaches zero for all (j 1, j 2 ) M C wth the fnal learned kernel matrx K = Φ Φ. Usng K, the kernel functon that defnes the nner product n the correspondng transformed kernel space can be wrtten as K (x, y) = K (x, y) + k x (K + ( K K ) K +) k y (32) where K( ) s the scalar ntal kernel functon (8), and the vectors k x = [K(x, x 1 ),..., K(x, x n )] and k y = [K(y, x 1 ),..., K(y, x n )] [22]. The ponts x, y X could be out of sample ponts,.e., ponts that are not n the sample set {x 1,..., x n } used to learn K. Note that (18) also generalzes the nner product n the projected kernel space to out of sample ponts. 6 SEMI-SUPERVISED KERNEL MEAN-SHIFT CLUSTERING ALGORITHM We present the complete algorthm for sem-supervsed kernel mean shft clusterng (SKMS) n ths secton. Fg. 4 shows a block dagram for the overall algorthm. We explan each of the modules usng two examples: Olympc crcles, a synthetc data set and a 1000 sample subset of USPS dgts, a real data set wth mages of handwrtten dgts. In Secton 6.1, we propose a method to select the scale parameter σ for the ntal Gaussan kernel functon usng the sets M, C and target dstances d m, d c. We show the speed-up acheved by performng the low-rank kernel matrx updates (as opposed to updatng the full kernel matrx) n Secton 6.2. For mean shft clusterng, we present a strategy to automatcally select the bandwdth parameter usng the parwse mustlnk constrants n Secton 6.3. Fnally, n Secton 6.4, we dscuss the selecton of the trade-off parameter γ.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 8 Fg. 5: Selectng the scale parameter ˆσ that mnmzes D ld (ξ, ξ σ ) usng grd search. Selected σ values: Olympc crcles, ˆσ = 0.75; USPS dgts, ˆσ = Intal Parameter Selecton Gven two nput sample ponts x, x j R d, the Gaussan kernel functon s gven by K σ (x, x j ) = exp ( x x j 2 ) 2σ 2 [0, 1] (33) where σ s the scale parameter. From (33) and (11) t s evdent that parwse dstances between sample ponts n the feature space nduced by K σ ( ) les n the nterval [0, 2]. Ths provdes us wth an effectve way of settng up the target dstances d m = mn(d 1, 0.05) and d c = max(d 99, 1.95), where d 1 and d 99 are the 1 st and 99 th percentle of dstances between all pars of ponts n the kernel space. We select the scale parameter σ such that, n the ntal kernel space, dstances between must-lnk ponts are small whle those between cannot-lnk ponts are large. Ths results n good regularzaton and faster convergence of the learnng algorthm. Recall that ξ s the n c -dmensonal vector of target squared dstances between the constrant pars { d m (j 1, j 2 ) M ξ j = (34) d c (j 1, j 2 ) C. Let ξ σ be the vector of dstances computed for the n c constrant pars usng the kernel matrx K σ. The scale parameter that mnmzes the log det dvergence between the vectors ξ and ξ σ s ˆσ = arg mn σ S D ld (dag (ξ), dag (ξ σ )) (35) and the correspondng kernel matrx s Kˆσ. The kernel learnng s nsenstve to small varatons n ˆσ, thus t s suffcent to do a search over a dscrete set S. The elements of the set S are roughly the centers of equal probablty bns over the range of all parwse dstances between the nput sample ponts x, = 1,..., n. For Olympc crcles, we used S = {0.025, 0.05, 0.1, 0.3, 0.5, 0.75, 1, 1.5, 2, 3, 5} and for USPS dgts, S = {1, 2,..., 10, 15, 20, 25}. A smlar set can be automatcally generated by mappng a unform probablty grd (over 0 1) va the nverse cdf of the parwse dstances. Fg. 5 shows the plot of the objectve functon n (35) aganst dfferent values of σ for the Olympc crcles data set and the USPS dgts data set. 6.2 Low Rank Kernel Learnng When the ntal kernel matrx has rank r n, the n n matrx updates (31) can be modfed to acheve a sgnfcant computatonal speed-up [25] (see supplementary materal for the low-rank kernel learnng algorthm). We learn a kernel matrx for clusterng the two example data sets: Olympc crcles (5 classes, n = 1500) and USPS dgts (10 classes, n = 1000). The must-lnk constrants are generated usng 15 labeled ponts from each class: 525 for Olympc crcles and 1050 for USPS dgts, whle an equal number of cannot-lnk constrants s used. The n n ntal kernel matrx Kˆσ s computed as descrbed n the prevous secton. Usng sngular value decomposton (SVD), we compute an n n low-rank kernel matrx K such that rank K = r n and K F Kˆσ F For Olympc crcles, ths results n a rank 58 approxmaton of the matrx K leadng to a computatonal speed-up from secs. to 0.92 secs. In case of USPS dgts, the matrx K has rank 499 and the run tme reduces from secs. to 68.2 secs. We also observed that n all our experments n Secton 7, the clusterng performance dd not deterorate sgnfcantly when the low-rank approxmaton of the kernel matrx was used. 6.3 Settng the Mean Shft Parameters For kernel mean shft clusterng, we defne the bandwdth for each pont as the dstance to ts k th nearest neghbor. In general, clusterng s an nteractve process where the bandwdth parameter for mean shft s provded by the user, or s estmated based on the desred number of clusters. In ths secton we propose an automatc recommendaton method for the bandwdth parameter k by usng only the must-lnk constrant pars. We buld upon the ntuton that n the transformed kernel space, the neghborhood of a must-lnk par comprses of ponts smlar to the constrant ponts. Gven the j th constrant par (j 1, j 2 ) M, we compute the parwse dstances n the transformed kernel space between the frst constrant pont Φe j1 and all other feature ponts as d = (e j1 e ) K(ej1 e ), = 1,..., n, j 1. These ponts are then sorted n the ncreasng order of d. The bandwdth parameter k j for the j th constrant corresponds to the ndex of j 2 n ths sorted lst. Therefore, the pont Φe j2 s the kj th neghbor of Φe j1. Fnally, the value of k s selected as the medan over the set {k j, j = 1,..., m}. For a correct choce of k, we expect the performance of mean shft to be nsenstve to small changes n the value of k. In Fg. 6, we plot the number of clusters recovered by kernel mean shft as the bandwdth parameter s vared. For learnng the kernel, we used 5 ponts from each class to generate must-lnk constrants: 50 for the Olympc crcles and 100 for USPS dgts. An equal number of cannot-lnk constrants s used. Fg. 6a shows the plot for Olympc crcles (5 classes), where the medan based value (k = 6) underestmates the bandwdth parameter. A good choce s

9 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 9 (a) (b) Fg. 6: Selectng the mean shft bandwdth parameter k, gven fve labeled ponts per class. (a) Olympc crcles. Number of recovered clusters s senstve at the medan based estmate k = 6, but not at k = 15 (see text). (b) USPS dgts. Number of recovered clusters s not senstve at the medan based estmate k = 4. The astersk ndcates the medan based estmate, whle the square marker shows a good estmate of k. k = 15, whch les n the range (10 23) where the clusterng output s nsenstve to changes n k. As seen n Fg. 6b, n case of USPS dgts (10 classes), the number of clusters recovered by mean shft s nsenstve to the medan based estmate (k =4). For all other data sets we used n our experments, the medan based estmates produced good clusterng output. Therefore, wth the excepton of Olympc crcles, we used the bandwdth parameter obtaned usng the medan based approach. However, for completeness, the choce of k should be verfed, and corrected f necessary, by analyzng the senstvty of the clusterng output to small perturbatons n k. For computatonal effcency, we run the mean shft algorthm n a lower dmensonal subspace spanned by the sngular vectors correspondng to at most the 25 largest sngular values of K. For all the experments n Secton 7, a 25-dmensonal representaton of the kernel matrx was large enough to represent the data n the kernel space. 6.4 Selectng the Trade-off Parameter The trade-off parameter γ s used to weght the objectve functon for the kernel learnng. We select γ by performng a two-fold cross-valdaton over dfferent values of γ and the clusterng performance s evaluated usng the scalar measure Adjusted Rand (AR) ndex [20]. The kernel matrx s learned usng half of the data wth 15 ponts used to generate the parwse constrants, 525 must-lnk constrants for the Olympc crcles and 1050 for the USPS dgts. An equal number of cannot-lnk constrant pars s also generated. Input: X - n d data matrx M, C - smlarty and dssmlarty constrant sets S - set of possble σ values γ - the trade-off parameter Procedure: Intalze slack varables ξ j = ξ j = d m for (j 1, j 2) M ξ j = ξ j = d c for (j 1, j 2) C Select ˆσ S by mnmzng (35) and compute ntal Gaussan kernel matrx Kˆσ Compute n n low-rank kernel matrx K such that rank K = r n and K F Kˆσ F 0.99 Learn K usng log det dvergences (See suppl. materal) Usng K and M, estmate bandwdth parameter k Compute bandwdths h as the k-th smallest dstance from the pont usng K and (11), d φ = rank K Repeat for all data ponts = 1,..., n Intalze ᾱ = α y = e Update ᾱ usng K n (12) untl convergence to local mode Group data ponts x and x j together f ᾱ Kᾱ + ᾱ j Kᾱj 2ᾱ Kᾱj = 0 Return cluster labels Fg. 8: Sem-supervsed kernel mean shft algorthm (SKMS). Each cross-valdaton step nvolves learnng the kernel matrx wth the specfed γ and clusterng the testng subset usng the kernel mean shft algorthm and the transformed kernel functon (32). Fg. 7 shows the average AR ndex plotted aganst log γ for the two data sets. In both the examples an optmum value of γ = 100 was obtaned. In general, ths value may not be optmum for other applcatons. However, n all our experments n Secton 7, we use γ = 100, snce we obtaned smlar curves wth small varatons n the AR ndex n the vcnty of γ = 100. Fg. 8 shows a step-by-step summary of the semsupervsed kernel mean shft (SKMS) algorthm. Fg. 7: Selectng the trade-off parameter γ. The AR ndex vs log γ for 50 cross-valdaton runs. The astersks mark the selected value of γ = EXPERIMENTS We show the performance of sem-supervsed kernel mean shft (SKMS) algorthm on two synthetc examples and four real-world examples. We also compare our method wth two state-of-the-art methods: the sem-supervsed kernel k- means (SSKK) [24] and the constraned spectral clusterng (E2CP) [29]. In the past, the superor performance of

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 10 (a) (b) (c) Fg. 9: Olympc crcles. (a) Input data. (b) AR ndex as the number of parwse constrants s vared. (c) AR ndex as the fracton of mslabeled constrants s vared. E2CP over other recent methods has been demonstrated. Please see [29] for more detals. In addton to ths, we also compare SKMS wth the kernelzed k-means (Kkm) and kernelzed spectral clusterng (KSC) algorthms. The learned kernel matrx s used to compute dstances n SKMS and Kkm, whle KSC uses t as the affnty matrx. By provdng the same learned kernel matrx to the three algorthms, we compare the clusterng performance of mean shft wth that of k-means and spectral clusterng. Note that, unlke Kkm and KSC, the methods SSKK and E2CP do not use the learned kernel matrx and supervson s suppled wth alternatve methods. Wth the excepton of SKMS, all the methods requre the user to provde the correct number of clusters. Comparson metrc. The clusterng performance of dfferent algorthms s compared usng the Adjusted Rand (AR) ndex [20]. It s an adaptaton of the rand ndex that penalzes random cluster assgnments, whle measurng the agreement between the clusterng output and the ground truth labels. The AR ndex s a scalar and takes values between zero and one, wth perfect clusterng yeldng a value of one. Expermental setup. To generate parwse constrants, we randomly select b labeled ponts from each class. All possble must-lnk constrants are generated for each class such that m= ( b 2) and a subset of all cannot-lnk constrant pars s selected at random such that c=m= nc 2. For each expermental settng, we average the clusterng results over 50 ndependent runs, each wth randomly selected parwse constrants. The ntal kernel matrx s computed usng a Gaussan kernel (33) for all the methods. We hand-pcked the scale parameter σ for SSKK and E2CP from a wde range of values such that ther fnal clusterng performance on each data set was maxmzed. For Kkm, KSC and SKMS, the values of σ and the target dstances d m and d c are estmated as descrbed n Secton 6.1. Fnally, the mean shft bandwdth parameter k, s estmated as descrbed n Secton 6.3. For each experment, we specfy the scale parameter σ we used for SSKK and E2CP. For Kkm, KSC and SKMS, σ s chosen usng (35) and the most frequent value s reported. We also lst the range of k, the bandwdth parameter for SKMS for each applcaton. For the log det dvergence based kernel learnng, we set the maxmum number of teratons to and the trade-off parameter γ to Synthetc Examples Olympc Crcles. As shown n Fg. 9a, the data conssts of nosy ponts along fve ntersectng crcles each comprsng 300 ponts. For SSKK and E2CP algorthms, the value of the ntal kernel parameter σ was 0.5. For Kkm, KSC and SKMS the most frequently selected σ was 0.75 and the range of the bandwdth parameter k was We performed two sets of experments. In the frst experment, the performance of all the algorthms s compared by varyng the total number of parwse constrants. The number of labeled ponts per class vary as {5, 7, 10, 12, 15, 17, 20, 25} and are used to generate mustlnk and cannot-lnk constrants that vary between 100 and Fg. 9b demonstrates SKMS performs better than E2CP and KSC whle ts performance s smlar to SSKK and Kkm. For 100 constrants, SKMS recovered 5 8 clusters, makng a mstake 22% of the tmes. For all other settngs together, t recovered an ncorrect number of clusters (4 6) only 6.3% of the tmes. In the second experment, we ntroduced labelng errors by randomly swappng the labels of a fracton of the parwse constrants. We use 20 labeled sample ponts per class to generate 1900 parwse constrants and vary the fracton of mslabeled constrants between 0 and 0.6 n steps of 0.1. Fg. 9c shows the clusterng performance of all the methods. The performance of SKMS degrades only slghtly even when half the constrant ponts are labeled wrongly, but t starts deteroratng when 60% of the constrant pars are mslabeled. Concentrc Crcles. The data conssts of ten concentrc crcles each comprsng 100 nosy ponts (Fg. 10a). For Kkm, KSC and SKMS the most frequently selected σ was 1 and the range of the bandwdth parameter for SKMS was Both algorthms, SSKK and E2CP have the value of σ = 0.2. In the frst experment, we vary the number of labeled ponts per class between 5 and 25 as n the prevous example to generate parwse constrants between 200 and For 200 constrants, SKMS ncorrectly recovered nne clusters 50% of the tmes, whle for all the other settngs t correctly detected 10 clusters every tme. In the second experment, we use 25 labeled ponts to generate 6000 parwse constrants and mslabeled a

$(b) AR ndex as the number of parwse constrants s vared. (c) AR ndex as the fracton of mslabeled constrants s vared. fracton of randomly selected constrants.$ $Ths mslabeled fracton was vared between 0 to 0.6 n steps of 0.1 and the performance of all the algorthms n the two experments s shown n Fg. 10b-c. 7.$

11 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 11 (a) (b) (c) Fg. 10: Ten concentrc crcles. (a) Input data. (b) AR ndex as the number of parwse constrants s vared. (c) AR ndex as the fracton of mslabeled constrants s vared. fracton of randomly selected constrants. Ths mslabeled fracton was vared between 0 to 0.6 n steps of 0.1 and the performance of all the algorthms n the two experments s shown n Fg. 10b-c. 7.2 Real-World Applcatons In ths secton, we demonstrate the performance of our method on two real applcatons havng a small number of classes; USPS dgts: 10 classes and MIT scene: 8 classes; and two real applcatons wth a large number of classes; PIE faces: 68 classes and Caltech-101 subset: 50 classes. We also show the performance of SKMS whle clusterng out of sample ponts usng (32) on the USPS dgts and the MIT scene data sets. Snce the sample sze per class s much smaller for PIE faces and Caltech-101 subset, results for generalzaton are not shown. We observe that the superorty of SKMS over other competng algorthms s clearer when the number of clusters s large. USPS Dgts. The USPS dgts data set s a collecton of grayscale mages of natural handwrtten dgts and s avalable from rowes/data.html. Each class contans 1100 mages of one of the ten dgts. Fg. 11 shows sample mages from each class. Each mage s then represented wth a 256-dmensonal vector where the columns of the mage are concatenated. We vary the number of labeled ponts per class between 5 and 25 as n the prevous example to generate parwse constrants from 200 to The maxmum number of labeled ponts per class used comprses only 2.27% of the total data. Fg. 11: Sample mages from the USPS dgts data set. Snce the whole data set has ponts, we select 100 ponts from each class at random to generate a 1000 sample subset. The labeled ponts are selected at random from ths subset for learnng the kernel matrx. The value of σ used was 5 for SSKK and 2 for E2CP. For Kkm, KSC and SKMS the most frequently selected σ was 7 and the range of the bandwdth parameter k was In the frst experment we compare the performance of all the algorthms on ths subset of 1000 ponts. Fg. 12a shows the clusterng performance of all the methods. Once the number of constrants were ncreased beyond 500, SKMS outperformed the other algorthms. For 200 constrants, the SKMS dscovered 9 11 clusters, makng a mstake 18% of the tmes, whle t recovered exactly 10 clusters n all the other cases. In the second experment, we evaluated the performance of SKMS for the entre data set usng 25 labeled ponts per class. The AR ndex averaged over 50 runs was ± Note that from Fg. 12a t can be observed that there s only a margnal decrease n clusterng accuracy. The parwse dstance matrx (PDM) after performng mean shft clusterng s shown n Fg. 12b. For llustraton purpose, the data s ordered such that the mages belongng to the same class appear together. The block dagonal structure ndcates good generalzaton of SKMS for out of sample ponts wth lttle confuson between classes. Nether SSKK nor E2CP could be generalzed to out of sample ponts because these methods need to learn a new kernel or affnty matrx ( ). MIT Scene. The data set s avalable from MIT /spatalenvelope/ and contans 2688 labeled mages. Each mage s pxels n sze and belongs to one of the eght outdoor scene categores, four natural and four man-made. Fg. 13 shows one example mage from each of the eght categores. Usng the code provded wth the data, we extracted the GIST descrptors [32] whch were then used to test all the algorthms. We vary the number of labeled ponts per class lke n the prevous example, but only between 5 and 20 to generate the parwse constrants from 160 to The maxmum number of labeled ponts used comprses only 7.44% of the total data. We select 100 ponts at random from each of the eght classes to generate the ntal kernel matrx. The parwse constrants are obtaned from the randomly chosen labeled ponts from each class. The value of σ used was 1 for SSKK and 0.5 for E2CP. For Kkm, KSC and SKMS the most frequently selected σ was 1.75 and the range of the bandwdth parameter k was Fg. 14a shows the clusterng performance of all the algorthms as the number of constrant ponts are vared.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 12 (a) (b) Fg. 12: USPS dgts.

(b) The 11000 11000 parwse dstance matrx (PDM) after performng mean shft clusterng. (a) (b) Fg. 14: MIT Scene data set.

(b) AR ndex on entre data as the number of parwse constrants s vared.

ponts (32). Both SSKK and E2CP were used to learn the full 2688 2688 kernel and affnty matrx respectvely. Fg.

Note that n [29] for smlar experments, the superor performance of E2CP was probably because of the use of spatal Markov kernel

From the CMU PIE face data set [35], we use only the frontal pose and neutral expresson of all 68 subjects under 21 dfferent lghtng

15, we show eght llumnaton condtons for three dfferent subjects.

Each mage s then represented wth a 16384-dmensonal vector where the columns of the mage are concatenated.

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 12 (a) (b) Fg. 12: USPS dgts. (a) AR ndex as the number of parwse constrants s vared. (b) The parwse dstance matrx (PDM) after performng mean shft clusterng. (a) (b) Fg. 14: MIT Scene data set. (a) AR ndex on the kernel matrx as the number of parwse constrants s vared. (b) AR ndex on entre data as the number of parwse constrants s vared. For 160 constrant pars, SKMS ncorrectly dscovered 7 10 clusters about 22% of the tme, whle t correctly recovered eght clusters n all other settngs. In the second experment, the whole data set was clustered usng the learned kernel matrx and generalzng to the out of sample ponts (32). Both SSKK and E2CP were used to learn the full kernel and affnty matrx respectvely. Fg. 14b shows the performance of all the algorthms as the number of parwse constrants was vared. Note that n [29] for smlar experments, the superor performance of E2CP was probably because of the use of spatal Markov kernel nstead of the Gaussan kernel. PIE Faces. From the CMU PIE face data set [35], we use only the frontal pose and neutral expresson of all 68 subjects under 21 dfferent lghtng condtons. We coarsely algned the mages wth respect to eye and mouth locatons and reszed them to be In Fg.15, we show eght llumnaton condtons for three dfferent subjects. Due to sgnfcant llumnaton varaton, nterclass varablty s very large and some of the samples from dfferent subjects appear to be much closer to each other than wthn classes. We convert the mages from color to gray scale and normalze the ntenstes between zero and one. Each mage s then represented wth a dmensonal vector where the columns of the mage are concatenated. We vary the number of labeled ponts per class as {3, 4, 5, 6, 7} to generate parwse constrants between 408 and The maxmum number of labeled ponts comprses 30% of the total data. We generate the ntal kernel matrx usng all the data. The value of σ used was 10 for SSKK and 22 for E2CP. For Kkm, KSC and SKMS the most frequently selected σ was 25 and the range of the bandwdth parameter k was 4 7. Fg. 16 shows the performance comparson and t can be observed that SKMS outperforms all the other algorthms. Note that SKMS approaches near perfect clusterng for more than 5 labeled ponts per class. When three labeled ponts per class were used, the SKMS Fg. 13: Sample mages from each of the eght categores of the MIT scene data set. Fg. 15: PIE faces data set. Sample mages showng eght dfferent llumnatons for three subjects.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 13 Fg. 16: PIE Faces data set. AR ndex as the number of parwse constrants s vared.

For all other settngs, SKMS correctly recovered 68 clusters about 62% of the tmes, whle t recovered 67 clusters about 32% of the tme and between 69 71 clusters n the remanng runs.

The Caltech-101 data set [16] s a collecton of varable szed mages across 101 object categores. Ths s a partcularly hard data set wth large ntraclass varablty. Fg.

For each sample mage, we extract GIST descrptors [32] and use them for evaluaton of all the competng clusterng algorthms.

We use a larger number of constrants n order to overcome the large varablty n the data set. We generate the 1959 1959 ntal kernel matrx usng all the data. The value of σ used s 0.2 for E2CP and 0.

It can be seen that SKMS outperforms all the other methods. For fve labeled ponts per class, SKMS detected 50 52 clusters, makng mstakes 75% of the tmes.

AR ndex as the number of parwse constrants s vared. 9% of the tmes. A tabular summary of all comparsons across dfferent data sets s provded n the supplementary materal.

The data s nonlnearly mapped to a hgher-dmensonal kernel space where the constrants are effectvely mposed by applyng a lnear transformaton.

13 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 13 Fg. 16: PIE Faces data set. AR ndex as the number of parwse constrants s vared. dscovered clusters, makng a mstake about 84% of the tme. For all other settngs, SKMS correctly recovered 68 clusters about 62% of the tmes, whle t recovered 67 clusters about 32% of the tme and between clusters n the remanng runs. The Kkm and KSC methods perform poorly nspte of explctly usng the number of clusters and the same learned kernel matrx as SKMS. Caltech-101 Objects. The Caltech-101 data set [16] s a collecton of varable szed mages across 101 object categores. Ths s a partcularly hard data set wth large ntraclass varablty. Fg. 17 shows sample mages from eght dfferent categores. We randomly sampled a subset of 50 categores, as lsted n Table 1, wth each class contanng 31 to 40 samples. For each sample mage, we extract GIST descrptors [32] and use them for evaluaton of all the competng clusterng algorthms. We vary the number of labeled ponts per class as {5, 7, 10, 12, 15} to generate parwse constrants between 500 and The maxmum number of labeled ponts comprses 38% of the total data. We use a larger number of constrants n order to overcome the large varablty n the data set. We generate the ntal kernel matrx usng all the data. The value of σ used s 0.2 for E2CP and 0.3 for SSKK. For Kkm, KSC and SKMS the most frequently selected σ was 0.5 and the range of the bandwdth parameter k was Fg. 18 shows the comparson of SKMS wth the other competng algorthms. It can be seen that SKMS outperforms all the other methods. For fve labeled ponts per class, SKMS detected clusters, makng mstakes 75% of the tmes. For all the other settngs together, SKMS recovered the ncorrect number (48 51) of clusters only Fg. 17: Sample mages from eght of the 50 classes used from Caltech-101 data set. Fg. 18: Caltech-101 data set. AR ndex as the number of parwse constrants s vared. 9% of the tmes. A tabular summary of all comparsons across dfferent data sets s provded n the supplementary materal. 8 DISCUSSION We presented the sem-supervsed kernel mean shft (SKMS) clusterng algorthm where the nherent structure of the data ponts s learned usng a few user suppled parwse constrants. The data s nonlnearly mapped to a hgher-dmensonal kernel space where the constrants are effectvely mposed by applyng a lnear transformaton. Ths transformaton s learned by mnmzng a log det Bregman dvergence between the ntal and the learned kernels. The method also estmates the parameters for the mean shft algorthm and recovers an unknown number of clusters automatcally. We evaluate the performance of SKMS on challengng real and synthetc data sets and compare t wth state-of-the-art methods. We compared the SKMS algorthm wth kernelzed k- means (Kkm) and kernelzed spectral clusterng (KSC) algorthms, whch used the same learned kernel matrx. The lnear transformaton appled to the ntal kernel space mposes soft dstance (nequalty) constrants, and may result n clusters that have very dfferent denstes n the transformed space. Ths explans the relatvely poor performance of KSC n most experments because spectral clusterng methods perform poorly when the clusters have sgnfcantly dfferent denstes [30]. The k-means method s less affected by cluster densty, but t s senstve to ntalzaton, shape of the clusters and outlers,.e., sparse ponts far from the cluster center. Unlke the other methods, mean shft clusterng does not need the number of clusters as nput and can dentfy clusters of dfferent shapes, szes and densty. Snce localty s mposed by the bandwdth parameter, mean shft s more robust to outlers. As shown n our experments, ths advantage gets pronounced when the number of clusters n the data s large. Clusterng, n general, becomes very challengng when the data has large ntra-class varablty. For example, Fg. 19 shows three mages from the hghway category of the MIT scene data set that were msclassfed. The msclassfcaton error rate n Caltech-101 data set was even hgher. On large scale data sets wth over 10000

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 14 TABLE 1: Object classes used from Caltech-101 data set.

sgn lotus dalmatan gramophone camera trlobte dragonfly grand pano headphone sunflower ketch wld cat crayfsh nautlus buddha yn yang dolphn mnaret anchor car sde rooster wheelchar octopus joshua tree

categores [14], all methods typcally perform poorly, as an mage may qualfy for multple categores. For example, the mages n Fg. 19 were classfed n the street category, whch s semantcally correct.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 14 TABLE 1: Object classes used from Caltech-101 data set. elephant flamngo head emu faces gerenuk stegosaurus accordon ferry cougar face mayfly char scssors menorah platypus butterfly tck metronome nlne skate bass pyramd leopards sea horse cougar body stop sgn lotus dalmatan gramophone camera trlobte dragonfly grand pano headphone sunflower ketch wld cat crayfsh nautlus buddha yn yang dolphn mnaret anchor car sde rooster wheelchar octopus joshua tree ant umbrella crocodle Fg. 19: Three mages from the class hghway of the MIT Scene data set that were msclassfed. categores [14], all methods typcally perform poorly, as an mage may qualfy for multple categores. For example, the mages n Fg. 19 were classfed n the street category, whch s semantcally correct. To deal wth a large number of categores, clusterng (classfcaton) algorthms should ncorporate the ablty to use hgher level semantc features that connect an mage to ts possble categores. The code for SKMS s wrtten n MATLAB and C and s avalable for download at /code/skms/ndex.html REFERENCES [1] A. Banerjee, S. Merugu, I. S. Dhllon, and J. Ghosh. Clusterng wth Bregman dvergences. Journal of Machne Learnng Research, 6: , [2] S. Basu, M. Blenko, and R. Mooney. A probablstc framework for sem-supervsed clusterng. In Proc. 10th Intl. Conf. on Knowledge Dscovery and Data Mnng, [3] S. Basu, I. Davdson, and K. Wagstaff. Constraned Clusterng: Advances n Algorthms, Theory, and Applcatons. Chapman & Hall/CRC Press, [4] M. Blenko, S. Basu, and R. J. Mooney. Integratng constrants and metrc learnng n sem-supervsed clusterng. In Intl. Conf. Machne Learnng, pages 81 88, [5] L. M. Bregman. The relaxaton method of fndng the common pont of convex sets and ts applcaton to the soluton of problems n convex programmng. USSR Computatonal Mathematcs and Mathematcal Physcs, 7: , [6] O. Chapelle, B. Schlkopf, and A. Zen. Sem-Supervsed Learnng. MIT Press, [7] H. Chen and P. Meer. Robust fuson of uncertan nformaton. IEEE Sys. Man Cyber. Part B, 35: , [8] Y. Cheng. Mean shft, mode seekng, and clusterng. IEEE PAMI, 17: , [9] R. Collns. Mean shft blob trackng through scale space. In CVPR, volume 2, pages , [10] D. Comancu. Varable bandwdth densty-based fuson. In CVPR, volume 1, pages 56 66, [11] D. Comancu and P. Meer. Mean shft: A robust approach toward feature space analyss. IEEE PAMI, 24: , [12] D. Comancu, V. Ramesh, and P. Meer. Real-tme trackng of nonrgd objects usng mean shft. In CVPR, volume 1, pages , [13] N. Crstann and J. Shawe-Taylor. Support Vector Machnes and Other Kernel-Based Learnng Methods. Cambrdge U. Press, [14] J. Deng, A. C. Berg, K. L, and L. Fe-Fe. What does classfyng more than 10,000 mage categores tell us? In ECCV, pages 71 84, [15] M. Ester, H.-P. Kregel, J. Sander, and X. Xu. A densty-based algorthm for dscoverng clusters n large spatal databases wth nose. In Second Int. Conf. on Knowledge Dscovery and Data Mnng, pages , [16] L. Fe-Fe, R. Fergus, and P. Perona. Learnng generatve vsual models from few tranng examples: An ncremental bayesan approach tested on 101 object categores. In IEEE CVPR Workshop of Generatve Model Based Vson (WGMBV), [17] K. Fukunaga and L. D. Hostetler. The estmaton of the gradent of a densty functon wth applcatons n pattern recognton. IEEE Informaton Theory, 21:32 40, [18] R. Ghan. Combnng labeled and unlabeled data for multclass text categorzaton. In Proc. 19th Intl. Conf. on Machne Learnng, pages , [19] G. Hager, M. Dewan, and C. Stewart. Multple kernel trackng wth SSD. In CVPR, volume 1, pages , [20] L. Hubert and P. Arabe. Comparng parttons. Journal of Classfcaton, 2: , [21] A. K. Jan. Data clusterng: 50 years beyond k-means. Pattern Recognton Letters, 31: , [22] P. Jan, B. Kuls, J. V. Davs, and I. S. Dhllon. Metrc and kernel learnng usng a lnear transformaton. Journal of Machne Learnng Research, 13: , [23] S. D. Kamvar, D. Klen, and C. D. Mannng. Spectral learnng. In Intl. Conf. on Artfcal Intellgence, pages , [24] B. Kuls, S. Basu, I. S. Dhllon, and R. J. Mooney. Sem-supervsed graph clusterng: A kernel approach. Machne Learnng, 74:1 22, [25] B. Kuls, M. A. Sustk, and I. S. Dhllon. Low-rank kernel learnng wth Bregman matrx dvergences. Journal of Machne Learnng Research, 10: , [26] L. Lels and J. Sander. Sem-supervsed densty-based clusterng. In IEEE Proc. Int. Conf. of Data Mnng, pages , [27] M. Lu, B. Vemur, S.-I. Amar, and F. Nelsen. Shape retreval usng herarchcal total bregman soft clusterng. PAMI, 34: , [28] Lu Zhengdong and M. A. Carrera-Perpñán. Constraned spectral clusterng through affnty propagaton. In CVPR, [29] Lu Zhwu and H. H.-S. Ip. Constraned spectral clusterng va exhaustve and effcent constrant propagaton. In ECCV, pages 1 14, [30] B. Nadler and M. Galun. Fundamental lmtatons of spectral clusterng. In NIPS, pages , [31] A. Y. Ng, M. I. Jordan, and Y. Wess. On spectral clusterng: Analyss and an algorthm. In NIPS, pages , [32] A. Olva and A. Torralba. Modelng the shape of the scene: A holstc representaton of the spatal envelope. IJCV, 42: , [33] C. Ruz, M. Splopoulou, and E. M. Ruz. Densty-based semsupervsed clusterng. Data Mn. Knowl. Dscov., 21: , [34] Y. Shekh, E. Khan, and T. Kanade. Mode-seekng by medodshfts. In ICCV, pages 1 8, [35] T. Sm, S. Baker, and M. Bsat. The CMU Pose, Illumnaton, and Expresson (PIE) Database. In Proceedngs of the IEEE Internatonal Conference on Automatc Face and Gesture Recognton, May [36] R. Subbarao and P. Meer. Nonlnear mean shft for clusterng over analytc manfolds. In CVPR, pages , [37] O. Tuzel, F. Porkl, and P. Meer. Kernel methods for weakly supervsed mean shft clusterng. In ICCV, pages 48 55, [38] O. Tuzel, R. Subbarao, and P. Meer. Smultaneous multple 3D moton estmaton va mode fndng on Le groups. ICCV, pages 18 25, [39] A. Vedald and S. Soatto. Quck shft and kernel methods for mode seekng. In ECCV, pages , [40] K. Wagstaff, C. Carde, S. Rogers, and S. Schrödl. Constraned k- means clusterng wth background knowledge. In Intl. Conf. Machne Learnng, pages , [41] J. Wang, B. Thesson, Y. Xu, and M. Cohen. Image and vdeo segmentaton by ansotropc kernel mean shft. In ECCV, volume 2, pages , 2004.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 15 Saket Anand Saket Anand receved hs B.E. degree n Electroncs Engneerng from Unversty of Pune, Inda, n 2003.

From 2007 to 2009, he worked n handwrtng recognton as a Research Engneer at Read-Ink Technologes, Bangalore, Inda.

He s a student member of the IEEE. Sushl Mttal Sushl Mttal receved hs B.Tech degree n Electroncs and Communcaton Engneerng from Natonal Insttute of Technology, Warangal, Inda n 2004.

He completed hs MS and PhD degrees n Electrcal and Computer Engneerng from Rutgers Unversty, New Jersey n 2008 and 2011, respectvely.

15 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. XX, JANUARY XXXX 15 Saket Anand Saket Anand receved hs B.E. degree n Electroncs Engneerng from Unversty of Pune, Inda, n He completed hs MS n Electrcal and Computer Engneerng from Rutgers Unversty, New Jersey, n From 2007 to 2009, he worked n handwrtng recognton as a Research Engneer at Read-Ink Technologes, Bangalore, Inda. He s currently pursung a PhD degree n Electrcal and Computer Engneerng at Rutgers Unversty, New Jersey. Hs research nterests nclude computer vson, statstcal pattern recognton and machne learnng. He s a student member of the IEEE. Sushl Mttal Sushl Mttal receved hs B.Tech degree n Electroncs and Communcaton Engneerng from Natonal Insttute of Technology, Warangal, Inda n From 2004 to 2005, he was n the Department of Electrcal Engneerng at Indan Insttute of Scence as a research assocate. He completed hs MS and PhD degrees n Electrcal and Computer Engneerng from Rutgers Unversty, New Jersey n 2008 and 2011, respectvely. After hs PhD, he worked n the Department of Statstcs at Columba Unversty as a postdoctoral research scentst untl He s currently a co-founder at Scbler Technologes, Bangalore. Hs research nterests nclude computer vson, machne learnng and appled statstcs. He s a member of the IEEE. Peter Meer Peter Meer receved the Dpl. Engn. degree from the Bucharest Polytechnc Insttute, Romana, n 1971, and the D.Sc. degree from the Technon, Israel Insttute of Technology, Hafa, n 1986, both n electrcal engneerng. From 1971 to 1979 he was wth the Computer Research Insttute, Cluj, Romana, workng on R&D of dgtal hardware. Between 1986 and 1990 he was Assstant Research Scentst at the Center for Automaton Research, Unversty of Maryland at College Park. In 1991 he joned the Department of Electrcal and Computer Engneerng, Rutgers Unversty, Pscataway, NJ and s currently a Professor. He has held vstng appontments n Japan, Korea, Sweden, Israel and France, and was on the organzng commttees of numerous nternatonal workshops and conferences. He was an Assocate Edtor of the IEEE Transacton on Pattern Analyss and Machne Intellgence between 1998 and 2002, a member of the Edtoral Board of Pattern Recognton between 1990 and 2005, and was a Guest Edtor of Computer Vson and Image Understandng for a specal ssue on robustness n computer vson n He s coauthor of an award wnnng paper n Pattern Recognton n 1989, the best student paper n the 1999, the best paper n the 2000 and the runner-up paper n 2007 of the IEEE Conference on Computer Vson and Pattern Recognton (CVPR). Wth coauthors Dorn Comancu and Vsvanathan Ramesh he receved at the 2010 CVPR the Longuet-Hggns prze for fundamental contrbutons n computer vson n the past ten years. Hs research nterest s n applcaton of modern statstcal methods to mage understandng problems. He s an IEEE Fellow. Oncel Tuzel Oncel Tuzel receved the BS and the MS degrees n computer engneerng from the Mddle East Techncal Unversty, Ankara, Turkey, n 1999 and 2002, and PhD degree n computer scence from the Rutgers Unversty, Pscataway, New Jersey, n He s a prncpal member of research staff at Mtsubsh Electrc Research Labs (MERL), Cambrdge, Massachusetts. Hs research nterests are n computer vson, machne learnng, and robotcs. He coauthored more than 30 techncal publcatons, and has appled for more than 20 patents. He was on the program commttee of several nternatonal conferences such as CVPR, ICCV, and ECCV. He receved the best paper runner-up award at the IEEE Computer Vson and Pattern Recognton Conference n He s a member of the IEEE.

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points; Subspace clusterng Clusterng Fundamental to all clusterng technques s the choce of dstance measure between data ponts; D q ( ) ( ) 2 x x = x x, j k = 1 k jk Squared Eucldean dstance Assumpton: All features