On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution

Size: px

Start display at page:

Download "On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution"

Alice Underwood
6 years ago
Views:

1 On Information-Maximization Custering: Tuning Parameter Seection and Anaytic Soution Masashi Sugiyama Makoto Yamada Manabu Kimura Hirotaka Hachiya Department of Computer Science, Tokyo Institute of Technoogy, Tokyo, Japan. Abstract Information-maximization custering earns a probabiistic cassifier in an unsupervised manner so that mutua information between feature vectors and custer assignments is maximized. A notabe advantage of this approach is that it ony invoves continuous optimization of mode parameters, which is substantiay easier to sove than discrete optimization of custer assignments. However, existing methods sti invove nonconvex optimization probems, and therefore finding a good oca optima soution is not straightforward in practice. In this paper, we propose an aternative informationmaximization custering method based on a squared-oss variant of mutua information. This nove approach gives a custering soution anayticay in a computationay efficient way via kerne eigenvaue decomposition. Furthermore, we provide a practica mode seection procedure that aows us to objectivey optimize tuning parameters incuded in the kerne function. Through experiments, we demonstrate the usefuness of the proposed approach. 1. Introduction The goa of custering is to cassify data sampes into disjoint groups in an unsupervised manner. K-means is a cassic but sti popuar custering agorithm. However, since k-means ony produces ineary separated custers, its usefuness is rather imited in practice. Appearing in Proceedings of the 8 th Internationa Conference on Machine Learning, Beevue, WA, USA, 011. Copyright 011 by the author(s)/owner(s). To cope with this probem, various non-inear custering methods have been deveoped. Kerne k-means (Giroami, 00) performs k-means in a feature space induced by a reproducing kerne function. Spectra custering (Shi & Maik, 000) first unfods non-inear data manifods by a spectra embedding method, and then performs k-means in the embedded space. Burring mean-shift (Fukunaga & Hosteter, 1975) uses a non-parametric kerne density estimator for modeing the data-generating probabiity density and finds custers based on the modes of the estimated density. Discriminative custering (Xu et a., 005; Bach & Harchaoui, 008) earns a discriminative cassifier for separating custers, where cass abes are aso treated as parameters to be optimized. Dependencemaximization custering (Song et a., 007; Faivishevsky & Godberger, 010) determines custer assignments so that their dependence on input data is maximized. These non-inear custering techniques woud be capabe of handing highy compex rea-word data. However, they suffer from ack of objective mode seection strategies 1. More specificay, the above non-inear custering methods contain tuning parameters such as the width of Gaussian functions and the number of nearest neighbors in kerne functions or simiarity measures, and these tuning parameter vaues need to be heuristicay determined in an unsupervised manner. The probem of earning simiarities/kernes was addressed in earier works, but they considered supervised setups, i.e., abeed sampes are assumed to be given. Zenik-Manor & Perona (005) provided a usefu unsupervised heuristic to determine the simiarity in a data-dependent way. However, it sti requires the number of nearest neighbors to be determined man- 1 Mode seection in this paper refers to the choice of tuning parameters in kerne functions or simiarity measures, not the choice of the number of custers.

2 On Information-Maximization Custering uay (athough the magic number 7 was shown to work we in their experiments). Another ine of custering framework caed information-maximization custering (Agakov & Barber, 006; Gomes et a., 010) exhibited the state-of-the-art performance. In this informationmaximization approach, probabiistic cassifiers such as a kerneized Gaussian cassifier (Agakov & Barber, 006) and a kerne ogistic regression cassifier (Gomes et a., 010) are earned so that mutua information (MI) between feature vectors and custer assignments is maximized in an unsupervised manner. A notabe advantage of this approach is that cassifier training is formuated as continuous optimization probems, which are substantiay simper than discrete optimization of custer assignments. Indeed, cassifier training can be carried out in computationay efficient manners by a gradient method (Agakov & Barber, 006) or a quasi-newton method (Gomes et a., 010). Furthermore, Agakov & Barber (006) provided a mode seection strategy based on the common information-maximization principe. Thus, kerne parameters can be systematicay optimized in an unsupervised way. However, in the above MI-based custering approach, the optimization probems are non-convex, and finding a good oca optima soution is not straightforward in practice. The goa of this paper is to overcome this probem by providing a nove informationmaximization custering method. More specificay, we propose to empoy a variant of MI caed squaredoss MI (SMI), and deveop a new custering agorithm whose soution can be computed anayticay in a computationay efficient way via eigenvaue decomposition. Furthermore, for kerne parameter optimization, we propose to use a non-parametric SMI estimator caed east-squares MI (LSMI) (Suzuki et a., 009), which was proved to achieve the optima convergence rate with anaytic-form soutions. Through experiments on various rea-word datasets such as images, natura anguages, acceerometric sensors, and speech, we demonstrate the usefuness of the proposed custering method.. Information-Maximization Custering with Squared-Loss Mutua Information In this section, we describe our nove custering agorithm..1. Formuation of Information-Maximization Custering Suppose we are given d-dimensiona i.i.d. feature vectors of size n, {x i x i R d } n, which are assumed to be drawn independenty from a distribution with density p (x). The goa of custering is to give custer assignments, {y i y i {1,..., c}} n, to the feature vectors {x i } n, where c denotes the number of casses. Throughout this paper, we assume that c is known. In order to sove the custering probem, we take the information-maximization approach (Agakov & Barber, 006; Gomes et a., 010). That is, we regard custering as an unsupervised cassification probem, and earn the cass-posterior probabiity p (y x) so that information between feature vector x and cass abe y is maximized. The dependence-maximization approach (Song et a., 007; Faivishevsky & Godberger, 010) is reated to, but substantiay different from the above information-maximization approach. In the dependence-maximization approach, custer assignments {y i } n are directy determined so that their dependence on feature vectors {x i } n is maximized. Thus, the dependence-maximization approach intrinsicay invoves combinatoria optimization with respect to {y i } n. On the other hand, the informationmaximization approach invoves continuous optimization with respect to the parameter α incuded in a cass-posterior mode p(y x; α). This continuous optimization of α is substantiay easier to sove than discrete optimization of {y i } n. Another advantage of the information-maximization approach is that it naturay aows out-of-sampe custering based on the discriminative mode p(y x; α), i.e., a custer assignment for a new feature vector can be obtained based on the earned discriminative mode... Squared-Loss Mutua Information As an information measure, we adopt squared-oss mutua information (SMI). SMI between feature vector x and cass abe y is defined by SMI := 1 ( p p (x)p ) (x, y) (y) p (x)p (y) 1 dx, (1)

3 On Information-Maximization Custering where p (x, y) denotes the joint density of x and y, and p (y) is the margina probabiity of y. SMI is the Pearson divergence (Pearson, 1900) from p (x, y) to p (x)p (y), whie the ordinary MI (Cover & Thomas, 006) is the Kuback-Leiber divergence (Kuback & Leiber, 1951) from p (x, y) to p (x)p (y): MI := p (x, y) og p (x, y) p (x)p dx. () (y) The Pearson divergence and the Kuback-Leiber divergence both beong to the cass of Ai-Sivey-Csiszár divergences (which is aso known as f-divergences, see (Ai & Sivey, 1966; Csiszár, 1967)), and thus they share simiar properties. For exampe, SMI is nonnegative and takes zero if and ony if x and y are statisticay independent, as the ordinary MI. In the existing information-maximization custering methods (Agakov & Barber, 006; Gomes et a., 010), MI is used as the information measure. On the other hand, in this paper, we adopt SMI because it aows us to deveop a custering agorithm whose soution can be computed anayticay in a computationay efficient way via eigenvaue decomposition, as described beow..3. Custering by SMI Maximization Here, we give a computationay-efficient custering agorithm based on SMI (1). We can express SMI as SMI = 1 = 1 p (x, y) p (x, y) p (x)p (y) dx 1 (3) p (y x)p (x) p (y x) p (y) dx 1. (4) Suppose that the cass-prior probabiity p (y) is set to be uniform: p (y) = 1/c. Then Eq.(4) is expressed as c p (y x)p (x)p (y x)dx 1. (5) Let us approximate the cass-posterior probabiity p (y x) by the foowing kerne mode: p(y x; α) := α y,i K(x, x i ), (6) where K(x, x ) denotes a kerne function with a kerne parameter t. In the experiments, we wi use a sparse variant of the oca-scaing kerne (Zenik-Manor & Perona, 005): exp ( x i x j ) σ i σ j K(x i, x j ) = (7) if x i N t (x j ) or x j N t (x i ), 0 otherwise, where N t (x) denotes the set of t nearest neighbors for x (t is the kerne parameter), σ i is a oca scaing factor defined as σ i = x i x (t) i, and x (t) i is the t-th nearest neighbor of x i. Further approximating the expectation with respect to p (x) incuded in Eq.(5) by the empirica average of sampes {x i } n, we arrive at the foowing SMI approximator: ŜMI := c n α y K α y 1, (8) where denotes the transpose, α y := (α y,1,..., α y,n ), and K i,j := K(x i, x j ). For each custer y, we maximize α y K α y under α y = 1. Since this is the Rayeigh quotient, the maximizer is given by the normaized principa eigenvector of K (Horn & Johnson, 1985). To avoid a the soutions {α y } c to be reduced to the same principa eigenvector, we impose their mutua orthogonaity: α y α y = 0 for y y. Then the soutions are given by the normaized eigenvectors ϕ 1,..., ϕ c associated with the eigenvaues λ 1 λ n 0 of K. Since the sign of ϕ y is arbitrary, we set the sign as ϕ y = ϕ y sign(ϕ y 1 n ), where sign( ) denotes the sign of a scaar and 1 n denotes the n-dimensiona vector with a ones. On the other hand, since p (y)= p (y x)p (x)dx 1 n p(y x i ; α)=α y K1 n, and the cass-prior probabiity p (y) was set to be uniform, we have the foowing normaization condition: α y K1 n = 1/c. Furthermore, probabiity estimates shoud be nonnegative, which can be achieved by rounding up negative outputs to zero. Taking these issues into account, Note that this unit-norm constraint is not essentia since the obtained soution is renormaized ater.

4 On Information-Maximization Custering custer assignments {y i } n for {x i} n are determined as y i = argmax y [max(0 n, ϕ y )] i max(0 n, ϕ y ) 1 n, where the max operation for vectors is appied in the eement-wise manner and [ ] i denotes the i-th eement of a vector. Note that we used K ϕ y = λ y ϕy in the above derivation. We ca the above method SMI-based custering (SMIC)..4. Kerne Parameter Choice by SMI Maximization Since the above custering approach was deveoped in the framework of SMI maximization, it woud be natura to determine the kerne parameters so that SMI is maximized. A direct approach is to use the above SMI estimator ŜMI aso for kerne parameter choice. However, this direct approach is not favorabe because ŜMI is an unsupervised SMI estimator (i.e., SMI is estimated ony from unabeed sampes {x i } n ). In the mode seection stage, however, we have aready obtained abeed sampes {(x i, y i )} n, and thus supervised estimation of SMI is possibe. For supervised SMI estimation, a non-parametric SMI estimator caed east-squares mutua information (LSMI) (Suzuki et a., 009) was shown to achieve the optima convergence rate. For this reason, we propose to use LSMI for mode seection, instead of ŜMI (8). LSMI is an estimator of SMI based on paired sampes {(x i, y i )} n. The key idea of LSMI is to earn the foowing density-ratio function, r (x, y) := p (x, y) p (x)p (y), (9) without going through density estimation of p (x, y), p (x), and p (y). More specificay, et us empoy the foowing density-ratio mode: r(x, y; θ) := θ L(x, x ), (10) :y =y where L(x, x ) is a kerne function with kerne parameter γ. In the experiments, we wi use the Gaussian kerne: L(x, x ) = exp ( x x ) γ. (11) The parameter θ in the above density-ratio mode is earned so that the foowing squared error is minimized: 1 ( p r(x, y; θ) r (x, y)) (x)p (y)dx. (1) Among n custer assignments {y i } n, et n y be the number of sampes in custer y. Let θ y be the parameter vector corresponding to the kerne bases {L(x, x )} :y =y, i.e., θ y is the n y -dimensiona subvector of θ = (θ 1,..., θ n ) consisting of indices { y = y}. Then an empirica and reguarized version of the optimization probem (1) is given for each y as foows: min θ y [ 1 θ y Ĥ(y) θy θ y ĥ(y) + δθ y θ y ], (13) where δ ( 0) is the reguarization parameter. Ĥ (y) is the n y n y matrix and ĥ(y) is the n y -dimensiona vector defined as Ĥ (y), := n y n L(x i, x (y) )L(x i, x (y) ), ĥ (y) := 1 n i:y i=y L(x i, x (y) ), where x (y) is the -th sampe in cass y (which corresponds to θ (y) ). A notabe advantage of LSMI is that the soution can be computed anayticay as θ (y) = (Ĥ(y) + δi) 1 ĥ (y). θ (y) Then a density-ratio estimator is obtained anayticay as foows: n y r(x, y) = =1 θ (y) L(x, x (y) ). The accuracy of the above east-squares densityratio estimator depends on the choice of the kerne parameter γ and the reguarization parameter δ. They can be systematicay optimized based on crossvaidation as foows (Suzuki et a., 009). The sampes Z = {(x i, y i )} n are divided into M disjoint subsets {Z m } M m=1 of approximatey the same size. Then a density-ratio estimator r m (x, y) is obtained using Z\Z m (i.e., a sampes without Z m ), and its out-ofsampe error (which corresponds to Eq.(1) without irreevant constant) for the hod-out sampes Z m is computed as CV m := 1 Z m m (x, y) x,y Z m r 1 m (x, y). Z m (x,y) Z m r

5 On Information-Maximization Custering This procedure is repeated for m = 1,..., M, and the average of the above hod-out error over a m is computed. Finay, the kerne parameter γ and the reguarization parameter δ that minimize the average hodout error are chosen as the most suitabe ones. Based on the expression of SMI given by Eq.(3), an SMI estimator caed LSMI is given as foows: LSMI := 1 n r(x i, y i ) 1, (14) where r(x, y) is a density-ratio estimator obtained above. Since r(x, y) can be computed anayticay, LSMI can aso be computed anayticay. We use LSMI for mode seection of SMIC. More specificay, we compute LSMI as a function of the kerne parameter t of K(x, x ) incuded in the custer-posterior mode (6), and choose the one that maximizes LSMI. MATLAB impementation of the proposed custering method is avaiabe from sugi/software/smic. 3. Existing Methods In this section, we quaitativey compare the proposed approach with existing methods Spectra Custering The basic idea of spectra custering (Shi & Maik, 000) is to first unfod non-inear data manifods by a spectra embedding method, and then perform k- means in the embedded space. More specificay, given sampe-sampe simiarity W i,j 0, the minimizer of the foowing criterion with respect to {ξ i } n is obtained under some normaization constraint: i,j 1 W i,j ξ i 1 ξ j, Di,i Dj,j where D is the diagona matrix with i-th diagona eement given by D i,i := n j=1 W i,j. Consequenty, the embedded sampes are given by the principa eigenvectors of D 1 W D 1, foowed by normaization. Note that spectra custering was shown to be equivaent to a weighted variant of kerne k-means with some specific kerne (Dhion et a., 004). The performance of spectra custering depends heaviy on the choice of sampe-sampe simiarity W i,j. Zenik-Manor & Perona (005) proposed a usefu unsupervised heuristic to determine the simiarity in a data-dependent manner, caed oca scaing: W i,j = exp ( x i x j σ i σ j ), where σ i is a oca scaing factor defined as σ i = x i x (t) i, and x (t) i is the t-th nearest neighbor of x i. t is the tuning parameter in the oca scaing simiarity, and t = 7 was shown to be usefu (Zenik-Manor & Perona, 005; Sugiyama, 007). However, this magic number 7 does not seem to work aways we in genera. If D 1 W D 1 is regarded as a kerne matrix, spectra custering wi be simiar to the proposed SMIC method described in Section.3. However, SMIC does not require the post k-means processing since the principa components have cear interpretation as parameter estimates of the cass-posterior mode (6). Furthermore, our proposed approach provides a systematic mode seection strategy, which is a notabe advantage over spectra custering. 3.. Burring Mean-Shift Custering Burring mean-shift (Fukunaga & Hosteter, 1975) is a non-parametric custering method based on the modes of the data-generating probabiity density. In the burring mean-shift agorithm, a kerne density estimator (Siverman, 1986) is used for modeing the data-generating probabiity density: p(x) = 1 n ( K x x i /σ ), where K(ξ) is a kerne function such as a Gaussian kerne K(ξ) = e ξ/. Taking the derivative of p(x) with respect to x and equating the derivative at x = x i to zero, we obtain the foowing updating formua for sampe x i (i = 1,..., n): n j=1 x i W i,jx j n j =1 W, i,j where W i,j := K ( x i x j /σ ) and K (ξ) is the derivative of K(ξ). Each mode of the density is regarded as a representative of a custer, and each data point is assigned to the custer which it converges to. Carreira-Perpiñán (007) showed that the burring mean-shift agorithm can be interpreted as an EM agorithm (Dempster et a., 1977), where W i,j /( n j =1 W i,j ) is regarded as the posterior probabiity of the i-th sampe beonging to the j-th custer. Furthermore, the above update rue can be expressed in a matrix form as X XP, where X = (x 1,..., x n ) is a sampe matrix and P := W D 1 is a stochastic matrix of the random wak in a graph with adjacency W (Chung, 1997). D is defined as

6 On Information-Maximization Custering D i,i := n j=1 W i,j and D i,j = 0 for i j. If P is independent of X, the above iterative agorithm corresponds to the power method (Goub & Loan, 1996) for finding the eading eft eigenvector of P. Then, this agorithm is highy reated to the spectra custering which computes the principa eigenvectors of D 1 W D 1 (see Section 3.1). Athough P depends on X in reaity, Carreira-Perpiñán (006) insisted that this anaysis is sti vaid since P and X quicky reach a quasi-stabe state. An attractive property of burring mean-shift is that the number of custers is automaticay determined as the number of modes in the probabiity density estimate. However, this choice depends on the kerne parameter σ and there is no systematic way to determine σ, which is restrictive compared with the proposed method. Another critica drawback of the burring mean-shift agorithm is that it eventuay converges to a singe point (Cheng, 1995), and therefore a sensibe stopping criterion is necessary in practice. Athough Carreira-Perpiñán (006) gave a usefu heuristic for stopping the iteration, it is not cear whether this heuristic aways works we in practice. 4. Experiments In this section, we experimentay evauate the performance of the proposed and existing custering methods Iustration First, we iustrate the behavior of the proposed method using artificia datasets described in the top row of Figure 1. The dimensionaity is d = and the sampe size is n = 00. As a kerne function, we used the sparse oca-scaing kerne (7) for SMIC, where the kerne parameter t was chosen from {1,..., 10} based on LSMI with the Gaussian kerne (11). The top graphs in Figure 1 depict the custer assignments obtained by SMIC, and the bottom graphs in Figure 1 depict the mode seection curves obtained by LSMI. The resuts show that SMIC combined with LSMI works we for these toy datasets. 4.. Performance Comparison Next, we systematicay compare the performance of the proposed and existing custering methods using various rea-word datasets such as images, natura anguages, acceerometric sensors, and speech. We compared the performance of the foowing methods, which a do not contain open tuning parame Kerne parameter t Kerne parameter t Kerne parameter t SMI estimate SMI estimate SMI estimate SMI estimate Kerne parameter t Figure 1. Iustrative exampes. Custer assignments obtained by SMIC (top) and mode seection curves obtained by LSMI (bottom). ters and therefore experimenta resuts are fair and objective: K-means (KM), spectra custering with the sef-tuning oca-scaing simiarity (SC) (Zenik-Manor & Perona, 005), mean nearest-neighbor custering (MNN) (Faivishevsky & Godberger, 010), MI-based custering for kerne ogistic modes (MIC) (Gomes et a., 010) with mode seection by maximumikeihood MI (Suzuki et a., 008), and the proposed SMIC. The custering performance was evauated by the adjusted Rand index (ARI) (Hubert & Arabie, 1985) between inferred custer assignments and the ground truth categories. Larger ARI vaues mean better performance, and ARI takes its maximum vaue 1 when two sets of custer assignments are identica. In addition, we aso evauated the computationa efficiency of each method by the CPU computation time. We used various rea-word datasets incuding images, natura anguages, acceerometric sensors, and speech: The USPS hand-written digit dataset ( digit ), the Oivetti Face dataset ( face ), the 0-Newsgroups dataset ( document ), the SENSEVAL- dataset ( word ), the ALKAN dataset ( acceerometry ), and the in-house speech dataset ( speech ). Detaied expanation of the datasets is omitted due to ack of space. For each dataset, the experiment was repeated 100 times with random choice of sampes from a poo. Sampes were centraized and their variance was normaized in the dimension-wise manner, before feeding them to custering agorithms. The experimenta resuts are described in Tabe 1. For the digit dataset, MIC and SMIC outperform KM, SC, and MNN in terms of ARI. The entire computation time of SMIC incuding mode seection is faster than KM, SC, and MIC, and is comparabe to MNN which does not incude a mode seection procedure. For the

7 On Information-Maximization Custering Tabe 1. Experimenta resuts on rea-word datasets (with equa custer size). The average custering accuracy (and its standard deviation in the bracket) in terms of ARI and the average CPU computation time in second over 100 runs are described. The best method in terms of the average ARI and methods judged to be comparabe to the best one by the t-test at the significance eve 1% are described in bodface. Computation time of MIC and SMIC corresponds to the time for computing a custering soution after mode seection has been carried out. For references, computation time for the entire procedure incuding mode seection is described in the square bracket. Digit (d = 56, n = 5000, and c = 10) ARI 0.4(0.01) 0.4(0.0) 0.44(0.03) 0.63(0.08) 0.63(0.05) Time [3631.7] 14.4[359.5] Face (d = 4096, n = 100, and c = 10) ARI 0.60(0.11) 0.6(0.11) 0.47(0.10) 0.64(0.1) 0.65(0.11) Time [30.8] 0.0[19.3] Document (d = 50, n = 700, and c = 7) ARI 0.00(0.00) 0.09(0.0) 0.09(0.0) 0.01(0.0) 0.19(0.03) Time [530.5] 0.3[115.3] Word (d = 50, n = 300, and c = 3) ARI 0.04(0.05) 0.0(0.01) 0.0(0.0) 0.04(0.04) 0.08(0.05) Time [369.6] 0.[03.9] Acceerometry (d = 5, n = 300, and c = 3) ARI 0.49(0.04) 0.58(0.14) 0.71(0.05) 0.57(0.3) 0.68(0.1) Time [410.6] 0.[9.6] Speech (d = 50, n = 400, and c = ) ARI 0.00(0.00) 0.00(0.00) 0.04(0.15) 0.18(0.16) 0.1(0.5) Time [413.4] 0.3[179.7] face dataset, SC, MIC, and SMIC are comparabe to each other and are better than KM and MNN in terms of ARI. For the document and word datasets, SMIC tends to outperform the other methods. For the acceerometry dataset, MNN and SMIC work better than the other methods. Finay, for the speech dataset, MIC and SMIC work comparaby we, and are significanty better than KM, SC, and MNN. Overa, MIC was shown to work reasonaby we, impying that mode seectoin by maximum-ikeihood MI is practicay usefu. SMIC was shown to work even better than MIC, with much ess computation time. The accuracy improvement of SMIC over MIC was gained by computing the SMIC soution in a cosedform without any heuristic initiaization. The computationa efficiency of SMIC was brought by the anaytic computation of the optima soution and the cass-wise optimization of LSMI (see Section.4). The performance of MNN and SC was rather unstabe because of the heuristic averaging of the number of nearest neighbors and the heuristic choice of oca scaing. In terms of computation time, they are rea- Tabe. Experimenta resuts on rea-word datasets under imbaanced setup. ARI vaues are described in the tabe. Cass-imbaance was reaized by setting the sampe size of the first cass m times arger than other casses. The resuts for m = 1 are the same as the ones reported in Tabe 1. Digit (d = 56, n = 5000, and c = 10) m = 1 0.4(0.01) 0.4(0.0) 0.44(0.03) 0.63(0.08) 0.63(0.05) m = 0.5(0.01) 0.1(0.0) 0.43(0.04) 0.60(0.05) 0.63(0.05) Document (d = 50, n = 700, and c = 7) m = (0.00) 0.09(0.0) 0.09(0.0) 0.01(0.0) 0.19(0.03) m = 0.01(0.01) 0.10(0.03) 0.10(0.0) 0.01(0.0) 0.19(0.04) m = (0.01) 0.10(0.03) 0.09(0.0) -0.01(0.03) 0.16(0.05) m = 4 0.0(0.01) 0.09(0.03) 0.08(0.0) -0.00(0.04) 0.14(0.05) Word (d = 50, n = 300, and c = 3) m = (0.05) 0.0(0.01) 0.0(0.0) 0.04(0.04) 0.08(0.05) m = 0.00(0.07) -0.01(0.01) 0.01(0.0) -0.0(0.05) 0.03(0.05) Acceerometry (d = 5, n = 300, and c = 3) m = (0.04) 0.58(0.14) 0.71(0.05) 0.57(0.3) 0.68(0.1) m = 0.48(0.05) 0.54(0.14) 0.58(0.11) 0.49(0.19) 0.69(0.16) m = (0.05) 0.47(0.10) 0.4(0.1) 0.4(0.14) 0.66(0.0) m = (0.06) 0.38(0.11) 0.31(0.09) 0.40(0.18) 0.56(0.) tivey efficient for sma- to medium-sized datasets, but they are expensive for the argest dataset, digit. KM was not reiabe for the document and speech datasets because of the restriction that the custer boundaries are inear. For the digit, face, and document datasets, KM was computationay very expensive since a arge number of iterations were needed unti convergence to a oca optimum soution. Finay, we performed simiar experiments under imbaanced setup, where the the sampe size of the first cass was set to be m times arger than other casses. The resuts are summarized in Tabe, showing that the performance of a methods tends to be degraded as the degree of imbaance increases. Thus, custering becomes more chaenging if the custer size is imbaanced. Among the compared methods, the proposed SMIC sti worked better than other methods. Overa, the proposed SMIC combined with LSMI was shown to be a usefu aternative to existing custering approaches. 5. Concusions In this paper, we proposed a nove informationmaximization custering method, which earns cassposterior probabiities in an unsupervised manner so that the squared-oss mutua information (SMI) between feature vectors and custer assignments is maximized. The proposed agorithm caed SMI-based custering (SMIC) aows us to obtain custering soutions anayticay by soving a kerne eigenvaue probem. Thus, unike the previous information-maximization

8 On Information-Maximization Custering custering methods (Agakov & Barber, 006; Gomes et a., 010), SMIC does not suffer from the probem of oca optima. Furthermore, we proposed to use an optima non-parametric SMI estimator caed eastsquares mutua information (LSMI) for data-driven parameter optimization. Through experiments, SMIC combined with LSMI was demonstrated to compare favoraby with existing custering methods. Acknowedgments We woud ike to thank Ryan Gomes for providing us his program code of information-maximization custering. MS was supported by SCAT, AOARD, and the FIRST program. MY and MK were supported by the JST PRESTO program, and HH was supported by the FIRST program. References Agakov, F. and Barber, D. Kerneized infomax custering. NIPS 18, pp MIT Press, 006. Ai, S. M. and Sivey, S. D. A genera cass of coefficients of divergence of one distribution from another. Journa of the Roya Statistica Society, Series B, 8(1):131 14, Bach, F. and Harchaoui, Z. DIFFRAC: A discriminative and fexibe framework for custering. NIPS 0, pp , 008. Carreira-Perpiñán, M. Á. Fast nonparametric custering with Gaussian burring mean-shift. ICML, pp , 006. Carreira-Perpiñán, M. Á. Gaussian mean shift is an EM agorithm. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 9: , 007. Cheng, Y. Mean shift, mode seeking, and custering. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 17: , Chung, F. R. K. Spectra Graph Theory. American Mathematica Society, Providence, Cover, T. M. and Thomas, J. A. Eements of Information Theory. John Wiey & Sons, Inc., nd edition, 006. Csiszár, I. Information-type measures of difference of probabiity distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, :9 318, Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum ikeihood from incompete data via the EM agorithm. Journa of the Roya Statistica Society, series B, 39(1): 1 38, Dhion, I. S., Guan, Y., and Kuis, B. Kerne k-means, spectra custering and normaized cuts. ACM SIGKDD, pp , 004. Faivishevsky, L. and Godberger, J. A nonparametric information theoretic custering agorithm. ICML, pp , 010. Fukunaga, K. and Hosteter, L. D. The estimation of the gradient of a density function, with appication in pattern recognition. IEEE Transactions on Information Theory, 1(1):3 40, Giroami, M. Mercer kerne-based custering in feature space. IEEE Transactions on Neura Networks, 13(3): , 00. Goub, G. H. and Loan, C. F. Van. Matrix Computations. Johns Hopkins University Press, Gomes, R., Krause, A., and Perona, P. Discriminative custering by reguarized information maximization. NIPS 3, pp Horn, R. A. and Johnson, C. A. Matrix Anaysis. Cambridge University Press, Hubert, L. and Arabie, P. Comparing partitions. Journa of Cassification, (1):193 18, Kuback, S. and Leiber, R. A. On information and sufficiency. Annas of Mathematica Statistics, :79 86, Pearson, K. On the criterion that a given system of deviations from the probabe in the case of a correated system of variabes is such that it can be reasonaby supposed to have arisen from random samping. Phiosophica Magazine, 50: , Shi, J. and Maik, J. Normaized cuts and image segmentation. IEEE Transactions on Pattern Anaysis and Machine Inteigence, (8): , 000. Siverman, B. W. Density Estimation for Statistics and Data Anaysis. Chapman and Ha, Song, L., Smoa, A., Gretton, A., and Borgwardt, K. A dependence maximization view of custering. ICML, pp , 007. Sugiyama, M. Dimensionaity reduction of mutimoda abeed data by oca Fisher discriminant anaysis. Journa of Machine Learning Research, 8: , 007. Suzuki, T., Sugiyama, M., Sese, J., and Kanamori, T. Approximating mutua information by maximum ikeihood density ratio estimation. JMLR Workshop and Conference Proceedings, 4:5 0, 008. Suzuki, T., Sugiyama, M., Kanamori, T., and Sese, J. Mutua information estimation reveas goba associations between stimui and bioogica processes. BMC Bioinformatics, 10(1):S5, 009. Xu, L., Neufed, J., Larson, B., and Schuurmans, D. Maximum margin custering. NIPS 17, pp Zenik-Manor, L. and Perona, P. Sef-tuning spectra custering. NIPS 17, pp , 005.

On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution

ICML2011 Jun. 28-Jul. 2, 2011 On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution Masashi Sugiyama, Makoto Yamada, Manabu Kimura, and Hirotaka Hachiya Department of