A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Size: px

Start display at page:

Download "A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network"

Violet Barber
5 years ago
Views:

1 A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Moular Networ Motoi Shiga Bioinformatics Center Kyoto University Goasho Uji 6-, Japan shiga@uicryotouacjp Ichigau Taigawa Bioinformatics Center Kyoto University Goasho Uji 6-, Japan taigawa@uicryotouacjp Hiroshi Mamitsua Bioinformatics Center Kyoto University Goasho Uji 6-, Japan mami@uicryotouacjp ABSTRACT We aress the issue of clustering numerical vectors with a networ The problem setting is basically equivalent to constraine clustering by Wagstaff an Carie [] an semisupervise clustering by Basu et al [], but our focus is more on the optimal combination of two heterogeneous ata sources An application of this setting is web pages which can be numerically vectorize by their contents, eg term frequencies, an which are hyperline to each other, showing a networ Another typical application is genes whose behavior can be numerically measure an a gene networ can be given from another ata source We first efine a new graph clustering measure which we call normalize networ moularity, by balancing the cluster size of the original moularity We then propose a new clustering metho which integrates the cost of clustering numerical vectors with the cost of maximizing the normalize networ moularity into a spectral relaxation problem Our learning algorithm is base on spectral clustering which maes our issue an eigenvalue problem an uses -means for final cluster assignments A significant avantage of our metho is that we can optimize the weight parameter for balancing the two costs from the given ata by choosing the minimum total cost We evaluate the performance of our propose metho using a variety of atasets incluing synthetic ata as well as real-worl ata from molecular biology Experimental results showe that our metho is effective enough to have goo results for clustering by numerical vectors an a networ Categories an Subject Descriptors H8 [Database Management]: Database Applications Data mining; I53 [Pattern Recognition]: Clustering Algorithms Permission to mae igital or har copies of all or part of this wor for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page To copy otherwise, to republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee KDD 7, August 5, 7, San Jose, California, USA Copyright 7 ACM /7/8 $5 General Terms Algorithms an Experimentation Keywors Networ moularity, spectral clustering, heterogeneous ata sources, -means, eigenvalue problem INTRODUCTION Clustering, a major research subject in ata mining, has been successfully applie to a wie variety of areas in the real worl In this paper, we aress the issue of clustering numerical vectors with a networ This is a general setting which can be foun in a lot of applications an basically equivalent to constraine clustering by Wagstaff an Carie [] an semi-supervise clustering by Basu et al [], but our focus is more on the optimal combination of two heterogeneous ata sources, numerical vectors an a networ A typical application is web pages This case, web pages will be clustere by their contents, say term frequencies, base on the assumption that if the contents of a page are very similar to those of another, these pages can be in the same cluster In contrast, web pages are line together, forming a networ in which noes an eges correspon to web pages an hyperlins between them, respectively, an can be clustere base on the hyperlin connectivity Clustering web pages base on hyperlins is exactly graph partitioning Stanar criteria for graph partitioning are ratio cut [8] an normalize cut [5] Simply speaing, these criteria are to minimize the number of inter-cluster eges relevant to the size of a cluster The number of inter-cluster eges which is calle graph min-cut is to chec the partitioning performance while the size of a cluster is to avoi a small cluster which might be generate by outliers In other wors, these criteria are obtaine by normalizing the graph min-cut by the cluster size Given a networ, minimizing normalize cut (or ratio cut) is a trace optimization problem which is NP-har Thus usually this problem is converte by spectral relaxation into an optimization problem with a constraint which can be solve by Lagrange multipliers, an the solution is given by an eigenvalue problem This is calle spectral graph clustering which is ifficult to assign a cluster label to each noe efinitely Thus usually -means is finally applie to cluster assignments using the resultant eigenvectors In the problem setting of graph partitioning, a numerical

2 weight can be attache to each ege of a given graph, an you may thin that the similarity between two web pages can be a weight for the hyperlin between them However, we assume that a given graph an a given set of numerical vectors are inepenently observe This assumption is natural For example, the contents of web pages an their lins are inepenently generate More concretely, there is a case that two web pages are very similar to each other, even if there are no lins between them Thus we note that our problem cannot be solve irectly by using an existing graph partitioning metho only Another application is genes Genes are expresse an function in a cell Currently the quantitative expression of thousans of genes can be measure simultaneously by using the latest technology in genetic engineering, calle cdna microarray We can have a numerical vector (generally calle a profile) for each gene by repeating the experiment of cdna microarray However cdna microarray ata is very noisy an unreliable Thus naturally we often nee another ata source in clustering genes, since precise gene clustering is important in preicting gene function [6] We can have more reliable information on genes as a gene networ, although they are confine to relatively well-stuie genes For example, literature information provies us with the co-occurrence frequencies of genes in meical ocuments which can be turne into a networ of genes by setting a cut-off value against the frequencies Similarly, metabolic or gene regulatory networs which are generate from literature are much more reliable than microarray ata A potential approach for our problem setting woul be to integrate the two ata sources, numerical vectors an a networ A typical example of this irection is semisupervise clustering base on a hien Marov ranom fiel (HMRF) [] Semi-supervise clustering by HMRF is clustering numerical vectors by minimizing the objective function containing square Eucliean istance as well as weighte networ constraints This metho was extene to a more general framewor in which the objective function can be expresse as a trace optimization problem for which an efficient weighte ernel -means algorithm was propose [] This wor is base on the iea that minimizing the cost (or the objective function) of semi-supervise clustering by HMRF can be a trace optimization problem, which is true of minimizing the objective function of the weighte ernel -means an more generally, minimizing a graph cut criterion such as normalize cut or ratio cut [3] This wor inspire us to combine numerical vectors with a networ, but we note that our criterion for graph partitioning is clearly ifferent from that in [] an our focus is more on optimally combining the two ata sources in terms of clustering Recent analysis on networs in the real-worl ata have reveale that they have some common global characteristics such as small-worl phenomena [], scale-free property [], self-similarity [7] an hierarchical moularity [4] We can expect that the performance of clustering in our problem setting might be improve by using some global networ property than the vicinity information lie the Marov property In light of the above, we propose a new metho for clustering numerical vectors with a networ We focus on networ moularity, a global feature foun in a lot of real-worl networssuchasgenenetwors[4]anmustbeapowerful criterion for clustering by the graph connectivity The original networ moularity [3, 7, 6] is, intuitively, the number of intra-cluster eges minus the square of the number of inter-cluster eges That is, this measure is given by using only the eges of a graph an is not balance by the cluster size Thus we first efine normalize networ moularity which is obtaine by iviing the original networ moularity by the cluster size We then integrate the normalize networ moularity with the cost of clustering numerical vectors into the framewor of a trace optimization problem Our clustering algorithm is base on spectral clustering by which our issue is relaxe into an eigenvalue problem an the final clusters are assigne by -means clustering algorithm from the resultant eigenvectors We stress that our wor is an approach for clustering with not only numerical vectors but the networ moularity A significant merit of our metho is that we can optimize the weight parameter for balancing the two ata sources by choosing that which minimizes the total cost We evaluate the performance of our metho using three types of atasets incluing both synthetic an real-worl ata Our first ataset was synthetic both in numerical vectors an a networ Numerical vectors were generate ranomly accoring to a mixture of von Mises-Fisher istributions, an a networ was generate by selecting noe pairs ranomly We confirme the effectiveness of our metho by checing normalize mutual information (NMI), a measure to chec the performance of clustering methos, an the total cost of clustering, with varying the value of the weight parameter for balancing the two ata sources In particular, we foun that NMI was mostly maximize at the weight parameter value which minimize the total cost, meaning that our strategy of choosing the weight parameter value of the minimum total cost wore successfully The secon ataset was synthetic numerical vectors an a real metabolic networ having the scale-free property an unbalance cluster sizes From the experimental results on this ataset, we showe that our metho of optimally combining numerical vectors with a given networ wore favorably against even a real scale-free networ with unbalance cluster sizes The thir ataset was real microarray expression profiles corresponing to numerical vectors an the real gene networ use in the secon ataset We note that gene expression measure by microarray is heavily noisy an unreliable while the networ we use is from a atabase which is manually curate an trustworthy Interestingly the resultant weight parameter value was extremely biase to the networ information, being consistent with the above reliability fact of the two input ata sources METHOD Preliminaries an Notations We escribe the notations that will be use throughout this paper Let N be the number of given numerical vectors (ata points) Let E be the N N matrix whose entries are all one Let X := (x,, x N ) be given numerical vectors Each x n has p entries, an let x n(i) bethei-th entry of x n Let X := ( x, x,, x N)where x n := x pp p n/ i= xn(i) Y = X T X Let G be a given networ with N noes an eges Let W R N N be a non-negative, symmetric matrix whose (i, j) entry,w ij is a non-negative weight

3 between noes i an j If there is no ege between noes i an j, w ij is zero We note that in our problem setting, W is an input having all information on a given graph G an is often calle a weight matrix or an affinity matrix Let W be a matrix whose (i, j) entry wij satisfies that w ij = w P N ij/ j= wij Let D be a N N iagonal matrix whose (i, i) entry i satisfies that i = P N j= wij Let D := T where := (,, N ) Let M be a matrix which satisfies M := D W Let K be the number of clusters which is an input Let I K be the ientity matrix of size K LetZ := (z,, z K)be an unsigne cluster assignment in which z T =(z,,,z,n ) where z,n ( {, }) isifx n is in cluster, otherwisezero Z := Z Let := ( Z T Z,, K )where be the representative (or the cluster center) of cluster Let Z be a set of noes in a given graph (or numerical vectors) in cluster, anlet Z be the number of all noes in cluster Z := K =Z Let V be a iagonal matrix whose (, ) entryis Z L(Z, Z ):= P i Z Pj Z wij, an L := L(Z, Z) Let J be a cost of clustering numerical vectors X (or/an noes in networ G) Let be a numerical parameter which taes a value between zero an one, an balances the two ata sources, ie numerical vectors X an networ G -means We first briefly review the -means clustering algorithm which is wiely use in a lot of applications The cost (or the objective function) of the -means algorithm is given as follows: J num(x, Z; ) = KX X Dist(x i, N ), () = i Z where Dist(x i, ) is a istance between numerical vector x i an cluster representative of cluster Wecanuse any istance such as Eucliean istance, cosine similarity an Pearson correlation coefficient In our experiments, we use cosine similarity which is use for clustering highimensional ata such as text ocuments [5]:! Dist(x i, )= xt n px T n x n The -means algorithm minimizes the cost of Eq () by repeating the following two steps alternately until convergence: ) upating the cluster representative an ) upating cluster labels Figure shows the pseuocoe of the -means clustering algorithm 3 Maximizing Normalize Networ Moularity 3 Spectral Graph Partitioning We briefly review -way graph partitioning which ivies noes of a given graph (networ) into clusters The stanar criteria to be minimize in -way graph partitioning are ratio cut an normalize cut which are given as follows: Ratio cut: Normalize cut: X X L(Z, Z\Z ) Z L(Z, Z\Z ) L(Z, Z) Input : X, K, Z (), () Output : Z,, J -means (X,K,Z (), () ) : Z Z (), () : X X/ X T X 3: while J( X, Z; ) is not converge o 4: (t+) x j ( =,,K) Z (t) P j Z (t) 5: Z (t+) arg min Z J( X, Z; (t+) ) 6: Z Z (t+), (t+), J J( X, Z; ) 7: en while Figure : Pseuocoe of -means The numerator which is common to the above two cuts is the so-calle graph min-cut If we use the graph min-cut only, the clustering result is very sensitive to outliers That is, a small cluster might be forme, if this cluster is relatively isolate from other noes So we nee to normalize the graph min-cut by the number of noes (ratio cut) or the total weight (normalize cut) in each cluster We can see that the above criteria can be rewritten as follows: X z T (D W )z Ratio Cut: z T z X z T (D W )z Normalize Cut: z T Dcz Fining a set of clusters which minimizes this criterion is an NP-har problem, an we then solve this problem by relaxing the iscrete cluster inicator matrix to a real value one We can then have the following optimization problem: minimize subject to tr( Z T (D W ) Z) ZT Z = IK (Ratio Cut) ( Z T D Z = IK (Normalize Cut)) By using Lagrange multipliers, we can easily fin that the solution of this optimization can be an eigenvalue problem In a stanar manner of spectral clustering, after solving the eigenvalue problem, we first select the resultant K eigenvectors with the minimum eigenvalues Then, since we cannot irectly assign a cluster label to each numerical vector (or each noe in a given graph) by the resultant eigenvectors, we apply the -means clustering algorithm to the selecte K eigenvectors after their normalization 3 Maximizing Normalize Networ Moularity with Spectral Graph Partitioning Networ moularity is originally efine as follows [6]: ( KX «) e (G) g (G) Q(G) =, () L L = where e (G) is the number of eges in cluster an g (G) is the total sum of egrees over all noes in cluster As shown in the above, the weight attache to each ege was not consiere in the original efinition of networ moularity

4 Hereafter, we incorporate the ege weight into the networ moularity, meaning that the binary efinition on each ege is turne into a numerical one We note that the property of the networ moularity is totally ept in this extension Eq()canthenberewrittenbyusingonlyL as follows: ( KX «) L(Z, Z ) L(Z, Z) Q(W )= L L = This measure is efine by the (weighte) number of eges only, meaning that the original moularity consiers the number of eges only More importantly, the original networ moularity is not balance by the cluster size, meaning that a cluster might become small when affecte by outliers Thus we efine the new measure which we call normalize networ moularity whose cost (or the objective function) is given as follows: ( KX «) N L(Z, Z) J net(w, Z) = L(Z, Z ) Z L L = This equation inicates that the larger negative value of this cost a clustere networ has, the higher normalize networ moularity this networ has The problem of fining the set of clusters which minimizes this cost is NP-har We then apply the spectral clustering approach to minimize the cost of normalize networ moularity, J(W, Z) In the following erivation, we partly borrow the iea of the spectral graph clustering by White an Smith which was evelope for the original networ moularity [] We can first moify the problem of minimizing J(W, Z) into the problem of minimizing the following trace: Z T N ` D L J net(w, Z) = tr W! Z L (3) Z T Z As shown in the spectral graph clustering using ratio (or normalize) cut, matrix Z must satisfy the following constraint since each noe falls into one cluster only: Z T Z = V We can then replace Z with Z an rewrite the trace optimization problem in the following by relaxing the iscrete cluster inicators into real-value inicators: minimize tr Z T L D L ««W Z subject to ZT Z = IK The solution can be foun via Lagrange multipliers in the following typical eigenvalue problem: L D «L W Z = ZΛ, where Λ is a iagonal matrix of Lagrange multipliers This can be further moifie using W, the normalize W,into: L W «L E Z = ZΛ (4) When n, the secon term of Eq (4) approaches zero faster than the first term In aition, we can approximate W by D W We then approximate Eq (4) by the following simple eigenvalue problem: M Z = ZΛ (5) Input : W, K, Z (), () Output : Z : Compute D from W : M D W 3: Compute H of M 4: Normalize H into H 5: [Z,,J] -means( H,K,Z (), () ) Figure : Pseuocoe of spectral clustering for normalize networ moularity This moification allows M to be a very sparse matrix, meaning that we can reuce the computational cost of solving the eigenvalue problem in Eq (5) After solving Eq (5), we have the resultant K eigenvectors with the minimum eigenvalues That is, we can have N (K ) matrix, H =(h,, h K ) whereh i be the i-th eigenvector of the selecte K eigenvectors We then normalize this matrix into N (K ) matrix, H in which (n, ) entry h n satisfies h q PK n = h n / = h n where h n is (n, ) entryofh This eigenmatrix H can be an input of -means Figure shows the pseuocoe of this process 4 Propose Algorithm for Optimally Combining Numerical Vectors with Normalize Networ Moularity 4 Spectral Clustering with Numerical Values an anetwor We escribe our propose algorithm for combining two ata sources: numerical vectors an a given weighte networ We first set the cost (the objective function) of clustering numerical vectors as follows: J num( X, Z; ) = N KX X ( xt i ), i Z = where the cluster representative of cluster is given as follows: X = Z j Z x j The cost of -means can be rewritten as follows: J num( X, Z) = N = N KX X x T i = i Z KX x T i = i Z = N KX = Z X i Z Z X X j Z x j j Z x T i x j an it can then be a trace optimization problem: J num( X, Z) = «Z T tr (N) YZ Z T Z A (6)

5 We then combine the cost of clustering numerical vectors which is shown in Eq (6) with the cost of the normalize networ moularity which is given in Eq (3), using for balancing the two costs: J total ( X, W, Z) = J net(w, Z)+( )J num( X, Z) ( ` Z T N D N L = tr W Y ) Z L N Z T Z j = tr Z T N L D N L W «ff N Y Z Fining a set of clusters minimizing the integrate cost is also an NP-har problem, an so we solve this problem by relaxing iscrete cluster inicators into real-value ones By oing so, we can have the following optimization problem: j minimize tr Z T N L D N L W «ff N Y Z subject to ZT Z = IK (7) This optimization problem can be also turne by Lagrange multipliers into the following eigenvalue problem for the total cost: M Z = ZΛ, (8) where M = N L D N L W N Y Once we have the above eigenvalue problem, we can use the same manner of clustering as one in spectral graph clustering That is, given N N matrix M, we can obtain the resultant K eigenvectors with the minimum eigenvalues, to generate N (K ) matrix S =(s,, s K ) where s i is the i-th eigenvector of the selecte K eigenvectors We then use the -means clustering algorithm, after normalizing S into S which is the N (K q ) matrix in which PK (n, ) entry s n satisfies s n = s n / = s n where s n is (n, ) entryofs As we saw in spectral graph clustering, the constraint given in Eq (7) can be moifie into another form For example, we can use Z T D Z = IK which was use for normalize cut in graph partitioning This case, M is given as follows: M = D ( N L D N L W N Y )D We use this constraint in our experiments, since normalize cut is more often use in graph partitioning than ratio cut In aition, White an Smith [] also use this moification when they practically applie their spectral graph clustering with the original networ moularity to the real worl atasets 4 Estimating The parameter epens on the spectral space which is generate by combining the two ata sources, meaning that the choice of will heavily affect the clustering result So we propose a metho to estimate the optimal value from given two ata sources In this metho, varying from zero to one, we choose the which gives the minimum total cost, J total Figure 3 shows the pseuocoe of the whole proceure of our propose algorithm Input : X, W, K, Z (), () Output : Z : X X/ X T X : Y X T X 3: Compute D an D from W 4: for =too 5: M D ( N D N W Y L L N )D 6: Compute S of M 7: Normalize S into S 8: [Z,,J ] -means( S,K,Z (), () ) 9: en for : min arg min J : Z Z min Figure 3: Pseuocoe of the propose algorithm 3 EXPERIMENTS 3 Dataset : Synthetic Numerical Vectors an Synthetic Ranom Networ 3 Data Synthetic Numerical Vectors: We first assume that numerical vectors are ranomly generate accoring to a mixture of von Mises-Fisher istributions [] in which each component correspons to a cluster: p(x) = KX α c p(κ)e κ T x, = where α ( =,,K) are mixture proportions satisfying that P K = α =an α, an c p(κ) isgivenas follows: κ p/ c p(κ) = (π) p/ I p/ (κ), where I p/ (κ) is the type Bessel function of orer p which is given as follows: Z π I p(κ) = e κ cos t cos(pt)t π Intereste reaers shoul refer [4] on the metho for ranomly generating numerical values accoring to this istribution We use the following settings: K = 4, p = 3, α =5 ( =,,4), =(,, ) T, =(,, 5) T, 3 = ( 5, 3, 5)T, 4 = ( 5, 3, 5)T Another parameter, κ which is calle concentration parameter, behaves lie the inverse of the variance of numerical vectors in a cluster Thus numerical vectors which are generate with a larger κ will be more concentrate on cluster representatives an be more easily clustere Thus we change κ in our experiments to chec the effect of κ on clustering results Synthetic Ranom Networ: To generate a networ, we first assigne a cluster label to each noe We generate a networ in which the number of noes an the number of eges in each cluster are ept the same for all clusters Thus the generate networ was efine by three parameters: N v (the number of noes in a cluster), N e (the

6 s 3 s 3 s s - - s s (a) (b) (c) s s s Figure 4: Dataset : The istribution of eigenvectors s n (n =,,4) which are shown by four ifferent symbols (,, +, ) corresponing to four ifferent clusters at N in = 5, κ =5,(a) =an J =93, (b) =3 an J =538 an (c) =an J =89 total number of eges) an N in (the number of eges in a cluster) which is equal to L(Z, Z )( =,,K) We thenchoseavalueofn v to generate a set of noes an ranomly chose noe pairs which are connecte by eges to satisfy the values of N e an N in In Dataset, we use N v = (meaning that the total number of noes is 4) an N e =, 6, while N in was change to chec how the clustering performance is affecte by N in 3 Performance Results To evaluate our clustering results using true cluster labels, we use normalize mutual information (NMI) [8] which has been use in a lot of applications to measure the performance of clustering methos [4] A larger NMI value inicates a better clustering result Intereste reaers shoul see [8] for the etail of NMI We first chece the istribution of s n (n =,, 4), ie the resultant eigenvectors of the eigenvalue problem in Eq (8), when we fixe κ =5anN in = 5 Figure 4 shows the three istributions of eigenvectors which are shown in four ifferent symbols (,, +, ) corresponing to four ifferent clusters, when was at, 3 an This figure shows that the istribution of eigenvectors changes with varying In particular, we can see that when = 3, the eigenvectors were separate most clearly among the three cases In fact, when =, 3 an, J was 93, 538 an 89, respectively, inicating that eigenvectors at =3 were the most concentrate on the cluster representatives among the three cases of We then chece the effectiveness of our metho of optimally combining two ifferent ata sources by using the cost J an NMI, when is change We use each ata set of all combinations of κ =, 5 an 5 an N in = 5, 8 an 3 Figure 5 shows J of all these cases, an Figure 6 shows NMI of all these cases From these figures, first of all, we can see that with increasing N in ((a) (b) (c)), the moularity became higher, resulting with ecreasing the cost (J) an increasing NMI This might be clearer if we focus on the of one On the other han, we can see that with increasing κ ( 5 5), numerical vectors were more concentrate on their cluster representatives, resulting with ecreasing the cost (J) an increasing NMI This might be also clearer if we focus on the of zero In all 8 curves in Figures 5 an 6, the best value is obtaine when <<, inicating that combining two ata sources improve the cost J an NMI of =an = In particular, we emphasize that the value of the minimum J was mostly consistent with that of the maximum NMI For example, when N in = 5 an κ =,J was minimize at =5 an the maximum NMI was at =4 Similarly, at, the minimum J was at =4 wherethe maximum NMI was obtaine This was true of N in = 8 an κ =where =5 provie the best in both NMI an J These results imply that our metho of selecting wore effectively for optimally combining numerical vectors with a given networ An interesting fining is that in a balance case in which the cost (an NMI) of =is almost the same as that of =, the curve became concave (an convex for NMI), inicating that the minimum cost (an the maximum NMI) can be easily foun On the other han, in an unbalance case such as κ =ann in = 3 in (c), the curve was not necessarily concave (an convex for NMI), meaning that only one ata source (ie networ information), is much more informative for clustering than the other (ie numerical vectors) This is natural, because numerical vectors at κ = were wiely istribute, not concentrating on the cluster representatives, an it woul beifficulttooclusteringbythem,comparingtothecase of N in = 3 where clustering coul be relatively easy by using networ information only Thus in such a case, our metho might choose =(or = ), but this woul be the right selection, since the NMI at =(or =)must be the maximum or very close to the maximum in this case Using the same parameter setting as that in Figure 6, we finally chece NMI by repeating ranomly generating atasets 4 times an averaging the performance over them Figure 7 shows the average NMI obtaine by this experiment The curves in this figure were almost similar to those in Figure 6, implying that our results were very stable Totally we can say that our metho optimally combine the two synthetic ata sources

7 Cost (J) (a) κ = (b) κ = (c) κ = Figure 5: Dataset : J at N in = (a) 5, (b) 8 an (c) NMI 6 4 (a) κ = (b) κ = (c) κ = Figure 6: Dataset : NMI at N in = (a) 5, (b) 8 an (c) 3 3 Dataset : Synthetic Numerical Vectors an Real Scale-free Networ 3 Data Synthetic Numerical Vectors: Numerical values of a gene, such as gene expression, are usually measure experimentally, meaning that these values are noisy, comparing to the networ which we can erive from a curate atabase in molecular biology Thusinthisexperiment,wegenerate synthetic numerical vectors to chec the performance of our metho of combining two ata sources As we use a real metabolic networ of 636 Saccharomyces cerevisiae genes (See below the way to generate this networ), we first assigne a cluster label to each noe of the networ in the following way: ) We fixe the number of clusters an repeate running spectral graph clustering by White an Smith [], times on the real metabolic networ, measuring the original networ moularity on the resultant clusters at each time ) Out of the, runs, we then chose the clusters with the highest networ moularity, an these clusters were use to assign a cluster label We show this fact more clearly in the experiment using Dataset 3 We note that the clusters with the highest moularity canto each noe That is, these clusters were use as stanar ata for evaluation We then generate numerical vectors (corresponing to noes in the metabolic networ) in each cluster, assuming that they can be generate accoring to a component of the von Mises-Fisher istribution mixture as in Section 3 We use the following settings: K =, p = 5, α = ( =,, ), =(,,,, ) T, =(,,,, ) T, 3 =(,,,, ) T, 4 =(,,,, ) T, 5 =(,,,, ) T, 6 =(,,,, ) T, 7 =(,,,, ) T, 8 =(,,,, ) T, 9 =(,,,, ) T, =(,,,, ) T We change κ to chec how clustering results are affecte by κ Real Metabolic Networ: Metabolism is represente by a irecte graph, calle a metabolic pathway which shows biochemical processes of synthesizing small molecules in a cell Each noe is labele by a chemical compoun A irecte ege from a noe, say noe A, to another, say noe B, is a chemical reaction, meaning that the compoun corresponing to noe B is synthesize from that for noe A, an each ege is labele by a (enzyme) gene which catalyzes the corresponing chemical reaction From a metabolic pathnot be obtaine so easily by our metho of =, since in spectral graph clustering, the final solution is obtaine by -means an is always an approximation

8 Average NMI κ = κ = κ = 5 (a) 5 (b) 5 (c) Figure 7: Dataset : Average NMI over 4 replicates at N in = (a) 5, (b) 8 an (c) 3 Cost (J) (a) κ = κ = κ = 3 NMI (b) Figure 8: Dataset : Cost (J) an NMI κ = κ = κ = 3 way which is store in the KEGG (Kyoto Encyclopeia of Genes an Genomes) atabase [], we generate a networ by connecting two genes by an unirecte ege, if they catalyze two neighboring chemical reactions in the metabolic pathway The generate networ which we call metabolic networ ha 636 noes an 3,4 eges Most importantly, this networ has two important networ features: the scalefree property [] as well as hierarchical networ moularity [4] 3 Performance Results We use NMI again to valiate the clustering results obtaine by our metho, with the stanar ata for evaluation Figure 8 (a) shows the total cost J with changing, an Figure 8 (b) shows NMI with changing The results we can erive from these figures were mostly consistent with those obtaine by Dataset For example, at κ = an 3 where the istribution of numerical vectors was relatively concentrate on their cluster representatives, the curves coul be concave for J (an convex for NMI), an the value of the minimum cost an the value of the maximum NMI were easily foun Furthermore they were mostly consistent with each other In aition, at κ = where numerical vectors were istribute broaly, the curve was not necessarily a strong concave for the cost J (an convex for NMI), implying that this case, the networ information woul be more useful for clustering than the numerical vectors In fact, NMI at = reache aroun 9, a very high value, whereas that at =staysatlessthan6 Figure 9 shows the clustering results obtaine by our metho at three ifferent values when κ = Ten ifferent colors correspon to ten ifferent clusters Figure shows the true cluster labels we use for both evaluation an generating numerical vectors From these figures, we can easily see that the clustering result at =5 was the most similar to the true cluster labels among the three networs in Figure 9 For example, the center part of the true clusters were colore orange an ar blue, an this was consistent with the networ at =5 in Figure 9 (b) On the other han, this center part has a lot of ifferent colors at = in Figure 9 (a) an is colore ar blue only at =in Figure 9 (c) From these results, we can say that our metho wore effectively for Dataset which contains a real metabolic networ with the scale-free property as well as unbalance cluster sizes 33 Dataset 3: Real Numerical Vectors an Real Scale-free Networ 33 Data Real Numerical Vectors: Microarray Gene Expression We use a microarray gene expression ataset [9] which has 3 expression profiles (numerical vectors) while aroun missing values only This ataset has been often use in the literature of microarray expression analysis [6, 3] All missing values were interpolate by using the - nearest least square metho by [9] Real Metabolic Networ: We use the real metabolic networ in Dataset 33 Performance Results In orer to evaluate the clustering result by our metho, we use ten categories in metabolism which were store in the KEGG atabase At least one of the ten categories coul be assigne to each metabolic gene We note that these ten categories are not efine irectly from the metabolic pathways in the KEGG atabase, an so these ten clusters cannot be ientifie by using the metabolic networ in our

9 (a) (b) (c) Figure 9: Dataset : Clustere metabolic networs at κ =an = (a), (b) 5 an (c) Cost (J) NMI Propose metho Normalize cut Ratio cut (a) (b) Figure : Dataset 3: Cost (J) an NMI Figure : Dataset : True cluster labels experiment only We then use NMI again to valiate the performance of our clustering result Figure (a) shows the cost J of our metho with changing Figure (b) shows NMI of our metho with changing The curves in these figures were not strong concave for J (an convex for NMI), meaning that the two ata sources are heavily unbalance as pointe out in the experimental results of Dataset an Dataset In fact, by looing carefully, we can see that the minimum cost was at aroun =8 to 9 where the best NMI was obtaine, an the ifference between the best NMI an NMI at =was insignificant Also Figure (b) shows the average result over runs of each of the two spectral graph partitioning methos using normalize cut an ratio cut We note that these methos use graph information only, an the results of them are shown as otte straight lines in the figure Our metho significantly outperforme these two methos even at =, implying that normalize networ moularity is more effective for clustering than normalize cut an ratio cut We repeate the same experiment replacing the above microarray ataset with atasets erive from a large microarray atabase [5], an foun that the results were almost similar to the above case From this result, we can say that the metabolic networ erive from a curate atabase is more reliable in clustering metabolic genes than microarray expression This result is consistent with existing unerstaning on ata reliability in molecular biology For Dataset-3, the for the minimum cost too a value which was very close to one, aroun where NMI was maximum or very close to the maximum Thus we thin that our metho succeee for this ataset As well in Dataset-, our metho wore favorably for optimally combining the real metabolic networ with more reliable numerical vectors Thus if we have more reliable numerical vectors on genes,

10 the clustering result woul be improve much more Over all we can say that our metho itself is very promising 4 CONCLUDING REMARKS We have presente a new spectral approach to clustering numerical vectors with a networ The focus of our metho was on networ moularity, a ey networ property in clustering, an we efine a new criterion, normalize networ moularity, for combining the two ifferent ata sources in the framewor of spectral clustering A significant avantage of our metho is that we can optimize the weight parameter for balancing the two ata sources, ie numerical vectors an a networ Also our algorithm is time-efficient, an practical computation time was less than one minute for any ataset in our experiment Experimental results obtaine by using three ifferent types of atasets showe that our metho wore favorably for optimally combining numerical vectors an a networ In this paper, we have focuse on two ata sources, ie numerical vectors an a networ, both in methoology an experiments Our metho can be easily extene to a more general framewor for combining multiple heterogeneous ata sources for clustering, an by oing so, the resultant clustering performance might be improve by aing another ifferent ata source Thus interesting future wor is to apply the propose metho to a variety of real-worl atasets to characterize more systematically uner which the propose metho wors well An this woul be performe by not only two ifferent ata sources but also more than two heterogeneous ata sources, say numerical vectors, a networ an another ifferent type of ataset It woul also be interesting to evelop, in the context of clustering, a general criterion which can cover a lot of istances for numerical values as well as graph partitioning criteria such as normalize cut, ratio cut an networ moularity 5 ACKNOWLEDGMENTS This wor is supporte in part by Bioinformatics Eucation Program Eucation an Research Organization for Genome Information Science an Kyoto University st Century COE Program Knowlege Information Infrastructure for Genome Science with support from MEXT, Japan 6 REFERENCES [] A-L Barabási an A Rea Emergence of scaling in ranom networs Science, 86:59 5, 999 [] S Basu, M Bileno, an R J Mooney A probabilistic framewor for semi-supervise clustering In KDD, pages 59 68, August 4 [3] I S Dhillon, Y Guan, an B Kulis Kernel -means, spectral clustering an normalize cuts In KDD, pages , 4 [4] I S Dhillon an S Sra Moeling ata using irectional istributions Technical Report TR-6-3, University of Texas, Dept of Computer Sciences, 3 [5] R Egar, M Domrachev, an A E Lash Gene expression omnibus: NCBI gene expression an hybriization array ata repository NAR, 3():7, [6] R Guimera an L A Nunes Amaral Functional cartography of complex metabolic networs Nature, 433(78):895 9, 5 [7] R Guimera, M Sales-Paro, an L A N Amaral Moularity from fluctuations in ranom graphs an complex networs Phys Rev E, 7:5, 4 [8] L Hagen an A B Kahng New spectral methos for ratio cut partitioning an clustering IEEE TCAD, :74 85, 99 [9] T R Hughes et al Functional iscovery via a compenium of expression profiles Cell, ():9 6, [] M Kanehisa et al From genomics to chemical genomics: new evelopments in KEGG NAR, 34:D , 6 [] B Kulis, S Basu, I Dhillon, an R J Mooney Semi-supervise graph clustering: A ernel approach In ICML, pages , 5 [] K V Maria an P E Jupp Directional Statistics John Wiley & Sons, secon eition, [3] M E J Newman an M Girvan Fining an evaluating community structure in networs Phys Rev E, 69:63, 4 [4] E Ravasz et al Hierarchical organization of moularity in metabolic networs Science, 97(5589):55 555, [5] J Shi an J Mali Normalize cuts an image segmentation IEEE PAMI, (8):888 95, [6] M Shiga, I Taigawa an H Mamitsua Annotating gene function by combining expression ata with a moular gene networ To appear in ISMB, 7 [7] C Song, S Havlin, an H A Mase Self-similarity of complex networs Nature, 433:39 395, 5 [8] A Strehl an J Ghosh Relationship-base clustering an visualization for high-imensional ata mining INFORMS Journal on Computing, 5():8 3, 3 [9] O Troyansaya et al Missing value estimation methos for DNA microarrays Bioinformatics, 7(6):5 55, [] K Wagstaff an C Carie Clustering with instance-level constraints In ICML, pages 3, [] D J Watts an S H Strogatz Collective ynamics of small-worl networs Nature, 393:44 44, 998 [] S White an P Smyth A spectral clustering approach to fining communities in graphs In SDM, pages 76 84, 5 [3] L F Wu et al Large-scale preiction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters Nat Genet, 3(3):55 65, [4] S Zhong an J Ghosh A unifie framewor for moel-base clustering JMLR, 4: 37, 3 [5] S Zhong an J Ghosh Generative moel-base ocument clustering: A comparative stuy KAIS, 8(3): , 5 [6] X Zhou, M C Kao, an W H Wong Transitive functional annotation by shortest-path analysis of gene expression ata PNAS, 99(): ,

Skyline Community Search in Multi-valued Networks

Skyline Community Search in Multi-valued Networks Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h