Context-Specific Bayesian Clustering for Gene Expression Data

Size: px

Start display at page:

Download "Context-Specific Bayesian Clustering for Gene Expression Data"

Amos Sullivan
5 years ago
Views:

1 Context-Specfc Bayesan Clusterng for Gene Expresson Data Yoseph Barash School of Computer Scence & Engneerng Hebrew Unversty, Jerusalem, 91904, Israel Nr Fredman School of Computer Scence & Engneerng Hebrew Unversty, Jerusalem, 91904, Israel Abstract The recent growth n genomc data and measurements of genome-wde expresson patterns allows us to apply computatonal tools to examne gene regulaton by transcrpton factors. In ths work, we present a class of mathematcal models that help n understandng the connectons between transcrpton factors and functonal classes of genes based on genetc and genomc data. Such a model represents the jont dstrbuton of transcrpton factor bndng stes and of expresson levels of a gene n a unfed probablstc model. Learnng a combned probablty model of bndng stes and expresson patterns enables us to mprove the clusterng of the genes based on the dscovery of putatve bndng stes and to detect whch bndng stes and experments best characterze a cluster. To learn such models from data, we ntroduce a new search method that rapdly learns a model accordng to a Bayesan score. We evaluate our method on synthetc data as well as on real lfe data and analyze the bologcal nsghts t provdes. Fnally, we demonstrate the applcablty of the method to other data analyss problems n gene expresson data. 1 Introducton A central goal of molecular bology s to understand the regulaton of proten synthess. Wth the advent of genome sequencng projects, we have access to DNA sequences of the promoter regons that contan the bndng stes of transcrpton factors that regulate gene expresson. In addton, the development of mcroarrays allows researchers to measure the abundance of thousands of mrna targets smultaneously provdng a genomc vewpont on gene expresson. As a consequence, ths technology facltates new expermental approaches for understandng gene expresson and regulaton (Iyer et al. 1999, Spellman et al. 1998). The combnaton of these two mportant data sources can lead to better understandng of gene regulaton (Bttner et al. 1999, Brazma & Vlo 2000). The man bologcal hypothess underlyng most of these analyses s Genes wth a common functonal role have smlar expresson patterns across dfferent experments. Ths smlarty of expresson patterns s due to co-regulaton of genes n the same functonal group by specfc transcrpton factors. Clearly, ths assumpton s only a frst-order approxmaton of bologcal realty. There are gene functons for whch ths assumpton defntely does not hold, and there are co-expressed genes that are not co-regulated. Nonetheless, ths assumpton s useful n fndng the strong sgnals n the data. 1

2 Based on the above assumpton, one can cluster genes by ther expresson levels, and then search for short DNA strngs that appear n sgnfcant over-abundance n the promoter regons of these genes (Roth et al. 1998, Tavazoe et al. 1999, Vlo et al. 2000). Such an approach can dscover new bndng stes n promoter regons. Our am here s complmentary to ths approach. Instead of dscoverng new bndng stes, we focus on characterzng groups of genes based on ther expresson levels n dfferent experments and the presence of putatve bndng stes wthn ther promoter regons. The bologcal hypothess we descrbed suggests that genes wthn a functonal group wll be smlar wth respect to both types of attrbutes. We treat expresson level measurements and nformaton on promoter bndng stes n a symmetrc fashon, and cluster genes based on both types of data. In dong so, our method characterzes the attrbutes that dstngush each cluster. Due to the stochastc nature of bologcal processes and expermental artfacts, gene expresson measurements are nherently nosy. In addton, dentfcaton of putatve bndng stes s also nosy, and can suffer from both false postve and false negatve errors. All of ths ndcates that usng a probablstc model n whch we treat both expresson and pattern dentfcaton as random varables, mght lead to a better understandng of the bologcal mechansm as well as mprove gene functonal characterzaton and transcrpton stes dentfcaton. Usng ths probablstc approach, we develop a class of clusterng models that cluster genes based on random varables of two types. Random varables of the frst type descrbe the expresson level of the gene, or more precsely ts mrna transcrpt n an experment (mcroarray hybrdzaton). Each experment s denoted by a dfferent random varable whose value s the expresson level of the gene n that partcular experment. Random varables of the second type descrbe occurrences of putatve bndng stes n the promoter regon of the genes. Agan, each bndng ste s denoted by a random varable, whose value s the number of tmes the bndng ste was detected n the gene s promoter regon. Our method clusters genes wth smlar expresson patterns and promoter regons. In addton, the learned model provdes nsght on the regulaton of genes wthn each cluster. The key features of our approach are: (1) automatc detecton of the number of clusters; (2) automatc detecton of random varables that are rrelevant to the clusters; (3) robust clusterng n the presence of many such random varables, (4) context-depended representaton that descrbes whch clusters each attrbute depends on. Ths allows us to dscover the attrbutes (random varables) that characterze each cluster and dstngush t from the rest. We learn these cluster models usng a Bayesan approach that uses structural EM (Fredman 1997, Fredman 1998), an effcent search method over dfferent models. We evaluate the resultng method on synthetc data, and apply t to real-lfe data. Fnally, we also demonstrate the applcablty and generalty of the method to other problems and data sources by ntroducng nto the model data from phylogenetc proflng and clusterng experments by ther gene expresson profles. In Secton 2 we ntroduce the class of probablstc models that we call Context-Specfc Clusterng models. In Secton 3 we dscuss how to score such models based on data. In Secton 4 we descrbe our approach for fndng a hgh-scorng clusterng model. In Secton 5 we evaluate the learnng procedure on synthetc and real-lfe data. We conclude n a dscusson of related work and possble extensons n Secton 6. 2

3 2 Context-Specfc Clusterng In ths secton we descrbe the class of probablstc models we to learn from data. We develop the models n a sequence of steps startng from a farly well known model for Bayesan clusterng, and refnng the representaton to explctly capture the structures we want to learn. We stress that at ths stage we are focusng on what can be represented by the class of models, and we examne how to learn them n subsequent sectons. 2.1 Nave Bayesan Clusterng Let X 1 ::: X N be random varables. In our man applcaton, these random varables denote the attrbute of a partcular gene: the expresson level of ths gene n each of the experments, and the numbers of occurrences of each bndng stes n the promoter regon. Suppose that we receve a dataset D that conssts of M jont nstances of the random varables. The m th nstance s a jont assgnment x 1 [m] ::: x N [m] to X 1 ::: X N. In our applcaton, nstances correspond to genes: each gene s descrbed by the values of the random varables. In modelng such data we assume that there s an underlyng jont dstrbuton P (X 1 ::: X N ) from whch the tranng nstances were sampled. The estmaton task s to approxmate ths jont dstrbuton based on the data set D. Such an estmate can help us understand the nteractons between the varables. A typcal approach for estmatng such a jont dstrbuton s to defne a probablstc model that defnes a set of dstrbutons that can be descrbed n a parametrc form, and then fnd the partcular parameters for the model that best ft the data n some sense. A smple model that s often used n data analyss s the nave Bayes model. In ths model we assume that there s an unobserved random varable C that takes values 1 ::: K, and descrbes whch cluster the example belongs to. We then assume that f we know the value of C, all the observed varables become ndependent of each another. That s, the form of the dstrbuton s: P (X 1 ::: X N )= X P (C = k)p (X 1 j C = k) P (X N j C = k) (1) k In other words, we estmate a mxture of product dstrbutons. One must also bare n mnd that such models are not necessarly a representaton of real bologcal structure but rather a mathematcal model that can gve us nsghts nto the bologcal connectons between varables. The ndependence assumptons we make are condtonal ones. For example, we assume that gven the model, the genes are ndependent. That s, after we know the model, observng the expresson levels of a sngle gene does not help predct better the expresson levels of another gene. Smlarly, we assume that expresson level of the same gene n dfferent condton are ndependent gven the cluster the gene belongs to. Ths assumpton states that the cluster captures the frst order descrpton of the gene s behavor, and we treat (n the model) all other fluctuatons as nose that s ndependent n each measurement. We attempt to be precse and explct about the ndependence assumptons we make. However, we note that most clusterng approaches we know of treat (explctly or mplctly) genes as beng ndependent of each other, and qute often also treat dfferent measurement of the same gene as ndependent observatons of the cluster. The nave Bayes model s attractve for several reasons. Frst, from estmaton pont of vew we need to estmate relatvely few parameters: the mxture coeffcents P (C = k), and the parameters 3

4 of the condtonal dstrbutons P (X j C = k). Second, the estmated model can be nterpreted as modelng the data by K clusters (one for each value k = 1 ::: K), such that the dstrbuton of dfferent varables wthn each cluster are ndependent. Thus, dependences between the observed varables are represented by the cluster varable. Fnally, ths model allows us to use farly effcent learnng algorthms, such as expectaton maxmzaton (EM) (Dempster et al. 1977). The dstrbuton form n Eq. (1) specfes the global structure of the nave Bayesan dstrbuton. In addton, we also have to specfy how to represent the condtonal dstrbutons. For ths purpose we use parametrc famles. There are several of famles of condtonal dstrbutons we can use for modelng P (X j C = k). In ths paper, we focus on two such famles. If X s a dscrete varable that takes a fnte number of values (e.g., a varable that denotes number of bndng stes n a promoter regon), we represent the condtonal probablty as a multnomal dstrbuton P (X j C = k) Multnomal(f x jk : x 2 Val(X )g for each value x of X we have a parameter x jk that denotes the probablty that X = x when C = k. These parameters must be non-negatve, and satsfy P x x jk =1, for each k. If X s a contnuous varable (e.g., a varable that denotes the expresson level of a gene n a partcular experment), we use a Gaussan dstrbuton P (X j C = k) N ( X jk 2 X jk ) such that P (x j C = k) = 1 p exp 2X jk ) (; (x ; X jk) X jk : We use the Gaussan model n ths stuaton for two reasons. Frst, as usual, the Gaussan dstrbuton s one of the smplest contnuous densty models and allow effcent estmaton. Second, when we use as observatons the logarthm of the expresson level (or logarthms of ratos of expresson between a sample and a common control sample), gene expresson has roughly nose characterstcs. We note however, that most of the developments n ths paper can be acheved wth more detaled (and realstc) nose models for gene expresson. Once we have estmated the condtonal probabltes, we can compute the probablty of an example belongng to a cluster: P (C = k j x 1 ::: x N ) / P (C = k)p (x 1 j C = k) P (x N j C = k) If the clusters are well-separated, then ths condtonal probablty wll assgn each example to one cluster wth hgh probablty. However, t s possble that clusters overlap, and some examples are assgned to several clusters. If we compare the probablty of two clusters, then log P (C = k j x 1 ::: x N ) P (C = k) P (C = k 0 = log j x 1 ::: x N ) P (C = k 0 ) + X log P (x j C = k) P (x j C = k 0 ) Thus, we can vew the decson boundary between any two clusters as the sum of terms that represent the contrbuton of each attrbute to ths decson. The rato P (x j C = k)=p (x j C = k 0 ) s the relatve support that x gves to k versus k 0. (2) 4

5 2.2 Selectve Nave Bayesan Models The nave Bayes model gves all varables equal status. Ths s a potental source of problems for two reasons. Frst, some varables should be consdered as nose snce they have no real nteractons wth the other varables. Suppose that X 1 s ndependent from rest of the varables. By learnng K condtonal probablty models P (X 1 j C = 1) ::: P(X 1 j C = K), we are ncreasng the varablty of the estmated model. Second, snce we are dealng wth a relatvely small number of tranng examples, f we fal to recognze that X 1 s ndependent of the rest, the observatons of X 1 can bas our choce of clusters. Thus, a combnaton of many rrelevant varables mght lead us to overlook the relevant ones. As a consequence, the learned model dscrmnate clusters by the values of the rrelevant varables. Such clusters suffer from hgh varablty (because of ther nosy character). If we know that X 1 s ndependent from the rest, we can use the fact that P (X 1 j C) =P (X 1 ) and rewrte the model n a smpler form: P (X 1 ::: X N )=P (X 1 ) X P (C = k)p (X 2 j C = k) P (X N j C = k): k Ths representaton of the jont probablty requres less parameters and thus the estmaton of these parameters s more robust. More mportantly, the structure of ths model explctly captures the fact that X 1 s ndependent of the other varables ts dstrbuton does not depend on the cluster varable. Note that n ths model, as expected, the value of X 1 does not mpact the probablty of the class C. In our bologcal doman, we expect to see many varables that are ndependent (or almost ndependent) of the classfcaton. For example, not all bndng stes of transcrpton factors play an actve role n the condtons n whch expresson levels were measured. Another example, s a putatve bndng ste (suggested by some search method or other) that does not correspond to a bologcal functon. Thus, learnng that these stes are ndependent of the measured expresson levels s an mportant aspect of the data analyss process. Based on ths dscusson, we want to consder models where several of the varables do not depend on the hdden class. Formally, we can descrbe these dependences by specfyng a set G fx 1 ::: X N g that represents the set of varables that depend on the cluster varable C. The jont dstrbuton then takes the form of P (X 1 ::: X N j G) = Y 1 P (X ) A X 62G k P (C = k) Y 2G P (X j C = k) We note that ths class of models s essentally a specal subclass of Bayesan networks (Pearl 1988). Smlar models were consdered for a somewhat dfferent applcaton n supervsed learnng by Langley and Sage (1994). We note agan, that when we compare the posteror probablty of two clusters, as n Eq. (2), we only need to consder varables that are not ndependent of C. That s, P (C log = k j x 1 ::: x N ) P (C = k) X P (x P (C = k 0 =log j x 1 ::: x N ) P (C = k 0 ) + j C = k) log P (x j C = k 0 : ) Ths formally demonstrates the ntuton that varables outsde of G do not nfluence the choce of clusters. 5 2G!

6 X =0 X =1 X =2 X =3 C =1 0:1 0:1 0:5 0:3 C =2 0:1 0:1 0:2 0:6 C =3 0:7 0:2 0:05 0:05 C =4 0:7 0:2 0:05 0:05 C =5 0:7 0:2 0:05 0:05 C =6 0:7 0:2 0:05 0:05 (a) explct table representaton X =0 X =1 X =2 X =3 C =1 0:1 0:1 0:5 0:3 C =2 0:1 0:1 0:2 0:6 C = 0:7 0:2 0:05 0:05 (b) default table Fgure 1: Example of two representatons of the same condtonal dstrbuton P (X j C). 2.3 Context-Specfc Independence Suppose that a certan bndng ste, whose presence s denoted by the varable X 1, s regulatng genes n two functonal categores. We would then expect ths ste to be present wth hgh probablty n promoter regons of genes n these two categores, and to have low probablty of appearng n the promoter regon of all other genes. Snce X 1 s relevant to the expresson level of (some) genes, t s not ndependent of the other varables, and so we would prefer models where X 1 2 G. In such a model, we need to specfy P (X 1 j C = k) for k =1 ::: K. That s, for each functonal category, we learn a dfferent probablty dstrbuton over X 1. However, snce X 1 s relevant only for two classes, say 1 and 2, ths ntroduces unnecessary complexty: once we know that C s not one of the two relevant functon classes (.e., C>2), we can predct P (X 1 j C) usng a sngle dstrbuton. To capture such dstnctons, we need to ntroduce a language that refnes the deas of selectve nave Bayesan models. More precsely, we want to descrbe addtonal structure wthn the condtonal dstrbuton P (X 1 j C). The ntuton here s that we need to specfy context-specfc ndependences (CSI): once, we know that C 62f1 2g, then X 1 s ndependent of C. Ths ssue has receved much attenton n the probablstc reasonng communty (Boutler et al. 1996, Chckerng et al. 1997, Fredman & Goldszmdt 1998). Here, we choose a farly smple representaton of CSI that Fredman & Goldszmdt (1998) term default tables. Ths representaton s as follows. The structure of the dstrbuton P (X j C) s represented by an object L = fk 1 ::: k l g where k j 2f1 ::: Kg. Each k j represents a case that has an explct condtonal probablty. All other cases are treated by a specal default condtonal probablty. Formally, the condtonal probablty has the form: P (X j C = k) =( P (X j C = k j ) k = k j 2L P (X j k 62L ) otherwse It wll be convenent for us to thnk of L as defnng a random varable, whch we wll denote L, wth l +1values. Ths random varable s the characterstc functon of C, such that L = j f C = k j 2 L, and L = l +1f C = k 62 L. Then, P (X j C) s replaced by P (X j L ). Ths representaton requres l +1dfferent dstrbutons rather than K dfferent ones. Note that each of these condtonal dstrbutons can be multnomal, Gaussan, or any other parametrc famly we mght choose to use. Returnng to our example above, Instead of representng the probablty P (X 1 j C) as a complete table, as n Fgure 1(a), we can represent t usng a more succnct table wth the cases 1, 2 and 6

7 the default f3 ::: Kg as shown n Fgure 1(b). Ths requres estmatng a dfferent probablty of X 1 n each of the frst two clusters, and one probablty of X 1 n the remanng clusters. We note that n the extreme case, when L s empty, then we are renderng X ndependent of C. To see ths, note that L has a sngle value n ths stuaton, and thus P (X j C) s the same for all values C. Thus, snce CSI s a refnement of selectve Bayesan models, t suffces to specfy the choce L for each varable. Fnally, we consder classfyng a gene gven a model. As n Eq. (2), the decson between two clusters s a sum of terms of the form P (x j C = k)=p (x j C = k 0 ). Now, f both k and k 0 fall n the default category of L, then they map to the same value of L, and thus defne the same condtonal probablty over X. In such a stuaton, the observaton x does not contrbute to the dstncton between k and k 0. On the other hand We wll say that X dstngushes a cluster k j,f k j 2L ndcatng a unque condtonal dstrbuton for X gven the cluster k j. 3 Scorng CSI Clusterng We want to learn CSI Clusterng from data. By learnng, we mean selectng the number of clusters K, the set of dependent random varables G, the correspondng local structures L, and n addton, estmatng the parameters of the condtonal dstrbutons n the model. We reterate that CSI clusterng s a specal sub-class of Bayesan networks wth default tables. Thus, we adopt standard learnng approaches for Bayesan networks (Fredman 1998, Fredman & Goldszmdt 1998, Heckerman 1998) and specalze them for ths class of models. In partcular, we use a Bayesan approach for learnng probablstc models. In ths approach learnng s posed as an optmzaton problem of some scorng functon. In ths secton we revew the scorng functons over dfferent choces of clusterng models (ncludng both structure and parameters). In the next secton, we consder methods for fndng the hgh-scorng clusterng models. That s, we descrbe computatonal procedures for searchng the vast space of possble structures effcently. 3.1 The Bayesan Score We assume that the set of varables X 1 ::: X N s fxed. We defne a CSI Clusterng model to be a tuple M = hk fl g, where K specfes the number of values of the latent class and L specfes the choce of local structure for X. (Recall that X does not depend on C f L =.) A model M s parameterzed by a vector ~ M of parameters. These nclude the mxture parameters ~ k = P (C = k), and the parameters ~ X jl of P (X j L = l). As nput for the learnng problem, we are gven a dataset D that conssts of M samples, the m th sample specfes a jont assgnment x 1 [m] ::: x N [m] to X 1 ::: X N. In the Bayesan approach, we compute the posteror probablty of a model, gven the partcular data set D: P (MjD) / P (D jm)p (M) The term P (M) s the pror probablty of the model M, and P (D jm) s the margnal lkelhood of the data, gven the model M. 7

8 In ths paper, we use a farly smple class of prors over models, n whch the model pror decomposes nto several ndependent components, as suggested by Fredman & Goldszmdt (1998)) P (M) / P (K)P (G) Y P (L ): We assume that P (K) / K s a geometrc dstrbuton wth parameter whch s farly close to 1. The pror over G s desgned to penalze dependences. Thus P (G) / jgj for some parameter < 1. (Recall that G = f : L 6= g.) Fnally, the pror dstrbuton over local models s set to P (L ) / K; 1 K jl j;1 : Thus, we set a unform pror over the number of cases n L, and then put a unform pror over all local structures wth ths cardnalty. We choose these prors for ther mathematcal smplcty (whch makes some of the computatons below easer) and snce they slghtly favor smpler models. We now consder the margnal lkelhood term. Ths term evaluates the probablty of generatng the data set D from the model M. Ths probablty requres averagng over all possble parameterzatons of M: Z P (D jm)= P (D jm M)P ~ ( ~ M jm)d ~ M (3) where P ( ~ M jm) s the pror densty over the parameters ~ M, and P (D jm ~ M) s the lkelhood of the data P (D jm ~ Y X M )= m k P (C = k jm ~ Y M ) P (x [m] j l (k) M ~ M ) where l (k) s the value of L when C = k. In ths work we follow a standard approach to learnng graphcal models and use decomposable prors for a gven model parameters ~ M that have the form P ( ~ M jm)=p ( C ) Y Y l2l P ( X jl) For multnomal X and for C, we use a Drchlet (DeGroot 1970) pror over the parameters, and for normal X, we use a normal-gamma pror (DeGroot 1970). We revew the detals of both famles of prors n Appendx A. We stress that the Bayesan method s dfferent from the maxmum lkelhood method. In the latter, one evaluates each model by the lkelhood t acheves wth the best parameters. That can be msleadng snce poor models mght have specfc parameters that gve the data hgh lkelhood. Bayesan approaches avod such over-fttng by averagng over all possble parameterzatons. Ths averagng regularzes the score. In fact, a general theorem (Schwarz 1978) shows that for large data sets (.e., as M!1) log P (D jm) = log P (D jm ^~ M) ; 1 2 where ^~ M are the maxmum aposteror probablty (MAP) parameters that maxmze P (D jm M)P ~ ( ~ M j M), and dm(m) s the dmensonalty of the model M (the number of degrees of freedom n the! (4) log M dm(m) +O(1) (5) parameterzaton of M). Thus, n the lmt the Bayesan score behaves lke a penalzed maxmum lkelhood score, where the penalty depends on the complexty of the model. 1 Note that ths approxmaton s closely related to the mnmum descrpton length (MDL) prncple (Rssanen 1978). 1 Note that as M!1the maxmum lkelhood parameters and the MAP parameters converge to the same values. 8

9 3.2 Complete Data We brefly dscuss the evaluaton of the margnal lkelhood n the case where the data s complete. Ths settng s easer than the settng we need to deal wth, however, the developments here are needed for the ones below. In the complete data case we assume that we are learnng from a data set D c that contans M samples, each of these specfes values x 1 [m] ::: x N [m] c[m] for X 1 ::: X N and C. (In ths case, we also fx n advance the number of values of C:) For such data sets, the lkelhood term P (D c jm M) ~ can be decomposed nto a product of local terms: P (D c jm M) ~ =L local (C S C ~ Y Y C ) l2l L local (X S X jl ~ X jl) (6) where the L local terms denote the lkelhood that depends on each condtonal probablty dstrbuton and the assocated suffcent statstcs vectors S C and S X jl. These statstcs are cumulatve functons over the tranng samples. These nclude counts of the number of tmes a certan event occurred, or sum of the values of X,orX 2 n the samples where L = l. The partcular detal of these lkelhoods and suffcent statstcs are less crucal for the developments below, and so we defer them to Appendx A. An mportant property of the suffcent statstcs s that once we compute the statstcs for the case n whch jl j = jcj,.e. we have a separate condtonal dstrbuton for each cluster n a node X, we can easly get statstcs for other local structures, as a sum over the relevant statstcs for each l 2L : X S X jl = P (L = l j C = k)s X jc k k (Note that snce L s a determnstc functon of C, P (L = l j C = c) s ether 0 or 1.) The mportant consequence of the decomposton of Eq. 6 and the correspondng decomposton of the pror, s that the margnal lkelhood term also decomposes (see (Fredman & Goldszmdt 1998, Heckerman 1998)) where P (D c jm)=s local (C S C ) S local (X S X jl) = Z Y Y l2val(l ) S local (X S X jl ) (7) L local (X S X jl ~ X jl)p ( ~ X jl jm)d ~ X jl The decomposton of margnal lkelhood suggests that we can easly fnd the best model n the case of complete data. The ntuton s that the observaton of C decouples the modelng choces for each X from the other varables. Formally, we can easly see that changng L for X changes only the pror assocated wth that L and the margnal lkelhood term Q l2val(l ) S local(x S X jl). Thus, we can optmze the choce of each L separately of the others. Note that there are 2 K possble choces of L. For each such choce we compute the suffcent statstcs, and evaluate the score of the model. When K s small we can exhaustvely evaluate all these choces. In such a stuaton we are fnd the optmal model gven the data. In most learnng scenaros, however, K s large enough to make such enumeraton unfeasble. Thus, nstead, we construct L by a greedy procedure (Fredman & Goldszmdt 1998) that at each teraton fnds the best k to separate from the default case, untl no mprovement s made to the score. To summarze, when we have complete data the problem of learnng a CSI clusterng model s straghtforward: We collect the suffcent statstcs S X jc k for every X and k =1 ::: K, and then 9

10 we can effcently evaluate every possble model. Moreover, we can choose the one wth the hghest posteror wthout explctly enumeratng all possble models. Instead, we smply decde what s the best L for each X, ndependently of the decsons made for the other varables. 3.3 Incomplete Data We now return to the case of nterest to us, where we do not observe the class labels. Such a learnng problem s sad to have ncomplete data. In ths learnng scenaro, the evaluaton of the margnal lkelhood Eq. (3) s problematc as we need to summarze over all completons of the mssng data. We denote the mssng part of the data as D H. In our case, ths consst of assgnment to clusters for the M samples. Usng ths notaton, we can wrte Eq. (3) as: P (D jm)= Z X D H P (D D H jm ~ M)P ( ~ M jm)d ~ M Although P (D D H j M M) ~ s a product of local terms, we cannot decompose the margnal lkelhood. Moreover, unlke the complete data term, we cannot learn the structure of P (X j C) ndependently of learnng the structure of other condtonal probabltes. Snce we do not observe the values of the cluster varables, these choces nteract. As a consequence, we cannot compute the margnal lkelhood n an analytcal form. Instead, we need to resort to approxmatons. We refer the reader to Chckerng & Heckerman (1997) for an overvew of methods for approxmatng the margnal lkelhood. In ths paper we use two such approxmatons to the logarthm of the margnal lkelhood. The frst s the Bayesan Informaton Crteron (BIC) approxmaton of Schwarz (1978)(see Eq. (5)). BIC(M ~ M )=logp (D jm ~ M ) ; 1 2 log M dm(m) To evaluate ths score, we perform expected maxmzaton (EM) teratons to fnd the MAP parameters (Laurtzen 1995); see also (Chckerng & Heckerman 1997, Heckerman 1998). The beneft of ths score s that once we fnd the MAP parameters, t s farly easy to evaluate. Unfortunately, ths score s only asymptotcally correct, and can over-penalze models for complexty n practce. Another possble approxmaton s the Cheeseman-Stutz (CS) score (Cheeseman & Stutz 1995); see also (Chckerng & Heckerman 1997). Ths score approxmates the margnal lkelhood as: CS(M ~ M ) = log P (D jm ~ M ) ; log P (D c jm ~ M )+logp (D c jm) where D c s a fcttous data set that s represented by a set of suffcent statstcs. The computaton of P (D ^ c jm ~ M ) and P (D c jm) s then performed as though the data s complete. Ths smply amounts to evaluatng Eq. (6) and Eq. (7) usng the suffcent statstcs for D c. The choce of D c s such that ts suffcent statstcs wll match the expected suffcent statstcs gven M and ~ M. These are defned by averagng over all possble completons D c of the data E h SX jc k jm ~ M = X D c S Dc X jc k P (D c j D M ~ M) (8) where D c represents a potental completon of the data (.e., assgnment of cluster value to each example) and S Dc X jc k s the suffcent statstcs for X gven C = k evaluated on D c. Usng the 10

11 lnearty of expectaton, ths term can be effcently computed (Chckerng & Heckerman 1997, Fredman 1998). Thus, to compute D c, we fnd the MAP parameters ~ M, and then compute the expected suffcent statstcs gven M ~ M. We then use these wthn Eq. (6) and Eq. (7) as the suffcent statstcs of the fctonal data set D c. 4 Learnng CSI Clusterng 4.1 Structural EM Once we set our pror probabltes, and decde on the type of approxmaton we use (ether BIC or CS), we mplctly nduce a score over all possble models. Our goal s to dentfy the model M that attans the hghest score. Unfortunately, for a fxed K, there are O(2 NK ) choces of models wth K clusters and N varables, therefore we cannot exhaustvely evaluate the score on all models. The typcal way to handle ths dffculty s by resortng to a heurstc search procedure. Local search procedures traverse the space of models by performng local changes (e.g., changng one of the L by addng or removng a case n the default table). The man computatonal cost of such a search s evaluatng canddate model. Remember that snce we have ncomplete data, we cannot drectly score a canddate models. Instead, for each canddate model we want to score, we perform another search n the parameter space (usng technques such as EM) to fnd the MAP parameters and then use these parameters for computng the score. Thus, the search procedure spends nonneglgble computaton per canddate. Ths severely lmts the set of canddates that t can explore. To avod such expensve evaluatons of canddates, we use the framework of Bayesan structural EM (Fredman 1998). In ths framework, we use our current canddate to complete the mssng values (.e., cluster assgnments). We then perform structure learnng as though we have complete data, searchng (effcently) for a better model. Ths results n a new best model (wth t s optmzed parameters). Ths new model, forms the bass for the next teraton, and so on. Ths procedure has the beneft that structure selecton s done n a stuaton that resembles complete data. In addton, each teraton can fnd a model that s qute dfferent from the model at the begnnng of the teraton. In ths sense, the local moves of standard search procedure are replaced by global moves. Fnally, the procedure s proven to mprove the structure n each teraton. More specfcally, the Structural EM procedure conssts of repeated teratons. We ntalze the process wth a model M 0 ~ 0. We dscuss below the choce of ths startng pont. Then at the ` +1 th teraton we start wth the par M` ~ ` of the prevous teraton and construct a new par M`+1 ~ `+1. Ths teraton conssts of three steps. E-Step: Compute expected suffcent statstcs ~S`X jc j = E h SX jc k jm` ~ ` for each =1 ::: N and each k =1 ::: K usng Eq. (8). M-Step: Learn a model M`+1 and parameters ~ `+1 usng these expected suffcent statstcs, as though they were observed n a complete data set. For each X choose the scorng CSI model L that maxmzes the score wth respect to the suffcent statstcs. Ths s done ndependently for each of the varables. Postprocessng-Step: Maxmze the parameters for M`+1 by runnng parametrc EM. Ths optmzaton s ntalzed by the MAP parameters gven the expected suffcent statstcs. 11

12 These teratons are remnscent of the standard EM algorthm. The man dfference s that n the standard approach the M-Step nvolves re estmatng parameters, whle n Structural EM we also relearn the structure. More precsely, Structural EM enables us to evaluate each possble new L based on the suffcent statstcs computed wth the current L nstead of dong an expensve EM procedure for each such canddate. In applyng ths procedure, we can use dfferent scores n choosng models at the M-Step. Ths depends on the approxmaton we set out to use on the ncomplete data. Above we dscussed 2 dfferent scores. The frst one s the BIC approxmaton. In ths case, we smply evaluate structures n the M-step usng BIC on complete data (the lkelhood n ths case decomposes, and the complexty penalty remans the same). The second one s the CS approxmaton. In ths case, note that CS appled to complete data s smply the Bayesan score (snce log P (D jm ~M) ^ and ^ ~ M ) cancel out). Thus, n ths case we use the exact Bayesan score wth respect to log P (D c jm the expected suffcent statstcs. These teratons are guaranteed to mprove the score n the followng sense. Each teraton fnds a canddate that has better score (wth respect to the ncomplete tranng data) than the prevous one. More precsely, f we use the BIC score (wth respect to the expected suffcent statstcs) n the M-step, then results of Fredman (1997) show that the BIC score of M`+1 `+1 ~ s greater than the BIC score M` `, ~ unless the procedure converged n whch case the two scores wll be equal. Thus, each step mproves the score we set out to maxmze, and at some pont the procedure wll reach a (local) maxma. When we use the CS score, the stuaton s more complcated. The results of Fredman (1998) show that each teraton s an approxmate verson of a procedure that does mprove the Bayesan score on the ncomplete data. In practce, most teratons do mprove the CS score. We use two dfferent methods for ntalzng the structural EM procedure. In the frst one, we start wth the full model (where jl j = jcj for every varable X node). Ths model s the most expressve n the class we consder, and thus allows the startng pont to capture any type of trend n the data. The second ntalzaton method, s by usng a random model, where G (.e. the set of varables dependent on the hdden cluster varable) s chosen at random. In both cases, we apply aggressve parametrc optmzaton to fnd the ntal parameters. Ths s done by usng 100 random startng ponts for parametrc EM, and returnng the parameter vector that acheves the hghest score. 4.2 Escapng Local Maxma The structural EM procedure, as descrbed above, can get trapped n local maxma. That s, t can reach sub-optmal convergence ponts. Ths can be a serous problem, snce some of these convergence ponts are much worse than the optmal model, and thus lead to a poor clusterng. A nave way to avod ths problem s by multple restarts. However, when the number of local maxma s large, such multple restarts have lmted utlty. Instead, we want strateges for escapng local maxma that mprove on the soluton found by earler teratons. We mplemented two approaches for escapng local maxma. In the frst approach, we apply a drected search once the structural EM procedure converges. More specfcally, assume that M` s the convergence pont of structural EM. Startng from ths model, we apply a local search procedure that attempts to add and remove varables to the model. As explaned above, such a procedure s costly snce t has to separately evaluate each canddate t proposes. To avod evaluatng all moves from the current model, we apply randomzed moves 12

13 and evaluate each one. Once a canddate wth a score hgher than that of M` s found, we restart structural EM from that pont. If after a fxed amount of random trals no mprovement was found, the procedure termnates the search and returns the best model found so far. In the second approach, we use annealng-lke procedure to ntroduce randomness at each step of the process. Ths randomness needs to serve two purposes. On the one hand, a certan amount of randomness wll allow the procedure to escape convergence ponts of structural EM. On the other hand, we want our steps to explot the suffcent statstc computed n the E-step to choose models that buld on nformaton learned n prevous teratons. We acheve ths goal by usng a varant of Structural EM recently suggested by Eldan et al. (2001) and Fredman et al. (2002). The dea s smple: at each teraton of Structural EM, we perform a random reweghtng of the tranng samples. More precsely, for each sample m, we sample a weght w`m from a Gamma dstrbuton wth mean 1 and varance `, where ` s an addtonal parameter that controls the temperature of the search. In the modfed E-step we compute weghted suffcent statstcs E h SX jc k j W ` M ~ M = X D c w`ms X jc k D c P (D c j D M ~ M ) We then apply the M-Step wth respect to these reweghted expected suffcent statstcs. Addtonally, we set `+1 to be ` where < 1 s a decay factor. The search s termnated once ` reaches a certan predetermned threshold. In our experments, the annealed approach domnated n performance the approach descrbed above. 5 Evaluaton 5.1 Smulaton Studes To evaluate the applcablty of our clusterng method, we started by performng tests on synthetc data sets. These data sets were sampled from a known clusterng model (whch determned the number of clusters, whch varables depend on whch cluster value, and the condtonal probabltes). Snce we know the model that orgnated the data, we can measure the performance of our procedure. We examned two aspects. Frst, how well the procedure recovers the structure of the real model (number of clusters, false postve and false negatve edges n the model). Second, how well the procedure recovers the orgnal clusterng. That s, how well the model classfes a new sample (gene). The am of these tests s to understand how the performance of the method depends on varous parameters of the learnng problem. We wll revew our technques for evaluatng the learnng process results and then turn to descrbe the detals of our artfcal data set generaton, followed wth a summary of the results. We frst address the ssue of evaluatng the classfcaton success, whch can be measured n many dfferent technques. We use the followng crteron. A clusterng model M defnes a condtonal probablty dstrbuton over clusters gven a sample. Let M t denote the true model, and let M e denote the estmated model. Both defne condtonal dstrbutons over clusters. We want to compare these two condtonal dstrbutons. We wll denote the clusters of the true model as C t and the clusters of the estmated model as C e. Then, we can defne a jont dstrbuton over these two clusterngs: P (C t C e )= X P (x jm t )P (C t j x M t )P (C e j x M e ) x 13

14 Table 1: Summary of results on synthetc data. The results summarze performance of the procedure on data generated from a model wth 5 true clusters and addtonal background nose. We report: the number of clusters learned, logarthm of the lkelhood rato between learned model and true model on tranng data (wth nosy samples) and test data (unseen samples, wthout nose), the nformaton the learned clusters contan about the orgnal clusters (see text), the fracton of edges not recovered (# false negatve edges / # edges n the true models), and the fracton of false edges recovered (# false postve edges / # edges n learned model). For each fgure of mert, we report the mean value and the standard devaton from results from 10 datasets (see text). Nose Score N Cluster # Lkelhood (tran) Lkelhood (test) I(C t C e) H(C t) #FalseNegatves #TrueEdges #FalsePostves #LearnedEdges Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std 10% BIC CS % BIC CS Score Vs. Number of Clusters 128 Score Vs. Number of Clusters 800 Samples 154 Weghts Annealng Search Samples 156 Score 157 Drect Search Score Samples Number of Clusters (a) Number of Clusters (b) Fgure 2: Graphs comparng the scores for dfferent cluster numbers. The x-axs denotes the number of clusters, and the y-axs denote the score per sample (logarthm of BIC score dvded by number of samples). (a) Comparson of drected search and weghts annealng search on tranng data wth 30% nose and 500 tranng samples. (b) Comparson of weght annealng search on tranng data wth 10% nose, wth 200, 500, and 800 tranng samples. Each pont s the average of 5 data sets, and error bars denote one standard devatons. 14

15 where the sum s over all possble jont assgnments to X. In practce we cannot sum over all these jont assgnments, and thus we estmate ths dstrbuton by samplng from P (X jm t ). Once we have the jont dstrbuton we can compute the mutual nformaton, I(C t C e )= X c t c e P (c t c e ) log P (c t c e) P (c t )P (c e ) between the two clusterng varables (Cover & Thomas 1991). Ths term denotes the number of bts one clusterng carres about the other. In the table below we report the nformaton rato I(C t C e )=H(C t ), whch measures how much nformaton C e provdes about C t relatve to the maxmum possble (whch s the entropy of C t as I(C t C t )=H(C t )). We now turn to the second ssue of evaluatng the structure learnng. We measure several aspects of the learned structure. To evaluate the selected number of clusters, we record both the number of clusters n the model as well as the number of dentfed clusters n the model. These are clusters for whch there s at least one tranng sample that s assgned to t. For the CSI structure evaluaton we recorded the number of false postve and false negatve edges n the mpled graph. Recall that an edge corresponds to an nformatve attrbute n the dscusson above. We generated synthetc data from a model learned from Gasch et al dataset we descrbe below. Ths model had 5 clusters, 93 contnuous varables, and 25 dscrete nodes. As descrbed n Secton 5.2, ths model (as most of the ones we learned from real data) had several characterstcs. The contnuous attrbutes were mostly nformatve (usually about several clusters). On the other hand, most dscrete attrbutes were unnformatve and the remanng ones dstngushed mostly one cluster. From ths model, we sampled 5 tranng sets of szes 200, 500, and 800 samples (15 tranng sets n total), and a test set of 1000 samples. We expect that bologcal data sets to contan many samples that do not ft nto clusters. Thus, we want to ensure that our procedure s robust to the presence of such nose. To estmate ths robustness, we njected addtonal nose nto the tranng sets. Ths was done by addng samples, whose values were sampled unformly from the range of values each attrbute had n the real samples we already had at hand. These obscure the clusterng n the orgnal model. We ran our procedure on the sampled data sets that we obtaned by addng 10% or 30% addtonal nose samples to the orgnal tranng data. The procedure was ntalzed wth random startng ponts, and for each tranng data we searched for the best scorng model wth the number of clusters n the range K =3 ::: 7. We then chose the model wth the best score among these. Table 1 summarzes the average performance over the 5 tranng sets n each parameter settng (200,500, or 800 samples wth 10% or 30% nose ) usng the learnng procedure wth two scorng methods. We brefly summarze the hghlghts of the results. Search procedure: We compared the performance of the two varants of the search procedure. The drected approach apples structural EM teratons, and attempt to escape from local maxma by attemptng stochastc moves and evaluatng each one. The annealed approach apples structural EM teratons where n each teraton, samples are re weghted. In our experments, we started the annealng procedure wth ntal temperature (varance of gamma dstrbuton) 2, and each teraton cooled the temperature by a factor of 0.9. In partcular, n Fgure 2(a) we see that the annealed search procedure clearly outperforms the drected search on a partcular settng. Ths behavor was consstently observed n all settngs, and we do not report t here. Cluster number: In all runs, models learned wth fewer clusters than the orgnal model were sharply penalzed. On the other hand, models learned wth addtonal clusters got scores that were 15

16 close to the score receved when learnng wth 5/6 clusters; see Fgure 2(b). Most runs added another cluster that captured the nose samples we added n constructng the tranng data, and thus, most of the runs pck 6 or slghtly more clusters (see Table 1). In general, runs wth the BIC score had stronger penalty for addtonal clusters, whch resulted n choosng 6 clusters as the best scorng model more often. Runs wth the CS score sometmes added more clusters. Addtonally, as one mght expect, the number of chosen clusters tends to grow wth strong nose and wth larger sample sze. Lkelhood: As expected, tranng lkelhood s hgher than that of the true model. Ths occur both because the procedure fts better the tranng data, and because of the addtonal nosy samples n the tranng data. On the other hand, the learned models are always worse (as expected) on the test data. Addtonally, the test data lkelhood mproves wth number of tranng samples ncrease, even n nosy data that also has addtonal nose samples. As expected, models traned wth noser data are somewhat worse than models learned from cleaner data. As a general trend, the tranng data lkelhood of models learned wth the CS score are as good as or better than models learned wth the BIC scores. Ths dfference s sgnfcant manly n the noser data sets. The test set performance of both scores s roughly the same when learnng wth 10% nose (wth BIC slghtly better) and CS s better n 30% nose. Structure accuracy: We measured the percentage of addtonal dependences n the learned graph G when compared to the true structure (false postves) and mssng ones n the learned graph G (false negatves). In general, the procedure (usng both BIC and CS scores) tended to have very small rato of false negatves whch dmnshes as more tranng samples are avalable. Ths shows the procedure s good on recognzng relevant attrbutes. On the other hand, the procedure had nontrval number of false postves dependences, about 13% - 17 % dependng on the sample sze, the scorng functon, and the percentage of nose. In general, when usng the CS score, the procedure has a slghtly hgher rato of false postve. Smlarly, the presence of hgher nose levels, also ncreased the number of false postve dependences. Mutual Informaton Rato: In ths category all the runs wth 800 tranng samples acheved the maxmal nformaton gan. Runs wth 200 samples acheved nformaton gan of 94% and above. Runs wth 500 samples had varous results that depended on the level of noses. For 10% nose we got maxmal nformaton gan, whle results n the noser data set got 97% nformaton gan. As wth the lkelhood of the data, the CS score had slghtly better results compared to the BIC score. These results show that the learned clusters were nformatve about the orgnal clusters. Clearly, these smulatons only explore a small part of the space of possble parameters. However, they show that on a model that has statstcal characterstcs smlar to real-lfe datasets, our procedure can perform n a robust manner and dscover clusterngs that are close to the orgnal one, even n the presence of nose. 5.2 Bologcal Data We evaluated our procedure on two bologcal data sets of buddng yeast gene expresson. The frst data set s from Spellman et al. (1998) who measured expresson levels of genes durng dfferent cell-cycle stages. We examned the expresson of the 800 genes that Spellman et al dentfy as cell-cycle related n 77 experments. The second data set s from Gasch et al. (2000) who measured expresson levels of genes n response to dfferent envronmental changes. Gasch et al dentfed a cluster of genes that have generc response to stress condtons. In addton, they dentfed 16

alpha cdc15 cdc28 elu 1 2 3 4 5 6 2x nduced 7 2x repressed Fgure 3: The clusterng found for the cell-cycle data of Spellman et al.

17 alpha cdc15 cdc28 elu x nduced 7 2x repressed Fgure 3: The clusterng found for the cell-cycle data of Spellman et al.. Lght pxels correspond to over expressed genes, and dark ones correspond to under-expressed genes. The clusters shown here, where also characterzed by the exstence of the followng bndng stes. Clusters 2 and 5: STUAP (Aspergllus Stunted proten), Cluster 3: QA1 (DNA-bndng proten wth repressor and actvator actvtes, also nvolved n slencng at telomeres and slent matng type loc), Clusters 4 and 6: HSF (Heat shock transcrpton factor). 17

Schematcs 1 2 3 4 5 6 7 8 9 10 11 12 Expresson TFs Expresson TFs 1 2 3 4 5 6 7 8 9

9 10 11 12 1 1 Data 2 3 4 5 6 7 8 9 10 11 12 (a) 2 3 4 5 6 7 8 9 10 11 12 (b) 4x

(b) clusterng based also on phylogenetc profles.

The second row contans a CSI mask plot that hdes all expresson features that were

The bottom row shows fgures of all the genes, sorted by cluster dentty.

18 Schematcs Expresson TFs Expresson TFs Phylogentc fngerprnt CSI Mask Data (a) (b) 4x nduced 4x repressed Fgure 4: Representaton of the clusterng found n the stress data of Gasch et al. (a) clusterng based on gene expresson and TF putatve bndng stes. (b) clusterng based also on phylogenetc profles. The top row contans schematc representaton of the clusterng. The second row contans a CSI mask plot that hdes all expresson features that were consdered unnformatve by the model. The bottom row shows fgures of all the genes, sorted by cluster dentty. The followng clusters were also characterzed by putatve bndng stes: Cluster 6(a) and 8(b): GCN4 (Transcrpton factor of the basc leucne zpper (bzip) famly, regulates general control n response to amno acd or purne starvaton) and CBF1, Cluster 2(a) HAP234, Cluster 7(b) GCN4. 18

19 clusters of genes that responded to partcular stress condtons, but not n a generc manner. Our data set conssts of the 950 genes, selected by Segal et al. (2001), that responded to some stress condtons but are not part of the generc stress response. Both data sets are based on cdna array technology, and the expresson value of each gene s reported as the logarthm (base 2) of rato of expresson n the sample compared to the expresson of the same gene n a common baselne ( control sample ). In addton to the expresson levels from these two data sets, we recorded for each gene the number of putatve bndng stes n the 1000bp upstream of the ORF. These were generated by usng the fung matrces n the TRANSFAC 5.1 database (Wngender et al. 2000, Wngender et al. 2001). We used the MAST program (Baley & Grbskov 1998) to scan the upstream regons. We used these matches to count the number of putatve stes n the 1000bp upstream regon. Ths generated dscrete valued random varables (wth values > 2) that correspond to each putatve ste (ether a whole promoter regon or a sub-regon). We start by descrbng the parameters used n the algorthm that were revewed n prevous sectons. We appled the annealed search procedure wth the followng parameters 0 = 2 4 and = 0:5 0:75 0:9 0:95. Best results were obtaned wth = 0:9 0:95 wth ether 0 settngs. Other varatons, such as the technque for choosng the ntal model structure, had no clear cut domnaton of one technque over the other. We now dscuss the results, they also appear (wth full data fle and descrpton of the clusters) n There were several common trends n the results on both expresson data sets, when used wth MAST TF bndng stes. Frst, the expresson measurements were consdered nformatve by the clusterng. Most of the expresson varables had mpact on many of the clusters. Second, most bndng ste measurements were consdered non-nformatve. The learnng procedure decded for most of these that they have no effect on any of the clusters. Those that were consdered relevant, usually had only 1 or 2 dstnct contexts n ther local structure. Ths can be potentally due to the fact that some of these factors were truly rrelevant, or to a large number of errors made by the bndng ste predcton programs that mask the nformatve sgnal n these putatve stes. In any case ths means these attrbutes had relatvely small nfluence on the clusterng results and only the ones that seem to be correlated wth a clear gene expresson profle of one of the clusters were chosen by the model. To llustrate the type of clusters we found, we show n Fgures 3 and 4(a) two of the clusterngs we learned. These descrbe qualtatve cluster profles that helps see whch experments dstngush each cluster, and the general trend of expresson at each cluster s experments. Note the schematc llustraton of the masks that denote the expresson attrbutes that characterze each cluster. As we can see, these capture, qute well, the experments n whch genes n the cluster devate from the average expresson. Another clear observaton s that clusters learned from the cell-cycle data all show perodc behavor. Ths can be expected snce the 800 genes are all correlated wth the cell-cycle. However, the clusters dffer n ther phase. Such clusters profles are characterstc of many of the models we found for the cell-cycle data. In the Gasch et al data, the best scorng models had twelve clusters. In the model shown n Fgure 4(a), we see two clusters of genes that are under-expressed n stress condtons (Clusters 1, and 2), seven clusters of genes that are over expressed n these condtons (Clusters 3, 5, 6, 7, 8, 10, and 12), two cluster of genes that are over expressed n some stress condtons and under-expressed 19

20 n others (Cluster 4 and 9), and two cluster of genes wth undetermned response to stress condtons (Cluster 8, and 11). These later clusters have hgh varance, whle the others have relatvely tght varance n most experments. Some of the clusters correspond to clear bologcal functon: For example, Cluster 7, contans genes that are over-expressed n amno-acd starvaton and ntrogen depleton. Examnng the MIPS (Mewes et al. 1999) functonal annotaton of these genes suggests that many of the genes nvolved amno-acd bosynthess and n transport. Another example s Cluster 2 that contans genes that are under-expressed n late stages of ntrogen depleton, dauxc shft, and under YPD growth medum. Ths cluster s assocated wth frequent occurrences of the HAP234 bndng ste. Ths bndng ste (of the complex HAP2, HAP-3, and HAP-4) s assocated wth the control of gene expresson under nonfermentatve growth condtons. Many genes n ths cluster are assocated wth mtochondral organzaton and transport, respraton, and ATP transport. The assocaton of the cluster wth the HAP234 bndng strengthens the hypothess that genes wth unknown functon n ths cluster mght be related to these pathways. We suspect that one of the reasons few clusters are assocated wth transcrpton factors bndng ste s the nosy predcton of these stes. To evaluate the effect of a more nformatve sequence motfs dentfcaton n the upstream regon, we performed the followng experment. We appled our algorthm usng expresson values from the Gasch et al data set. Then, we appled the procedure of Barash et al (2001) to each of the clusters we dentfed. Ths procedure searches for motfs that dscrmnatvely appear n the upstream regon of genes n partcular clusters and are uncommon n other genes n the genome. We then annotated each gene wth the set of motfs we found, and used these annotatons as addtonal nput to a new run of our algorthm. Although we appled a farly smple unsupervsed sequence motf dentfcaton algorthm, ts mpact on the learnng algorthm was clear. Several hundred genes have changed ther hard assgnment from the ntal assgnment made when clusterng wth only expresson data, 26 out of 28 motfs were consdered nformatve to the fnal clusterng, and 2 motfs became relevant for 2 dfferent clusters. When we ran the new hard assgnments of genes to clusters n the motf fndng algorthm we got a general mprovement n motfs dentfcaton n clusters. Next, n order to demonstrate the model s ablty to facltate relevant bologcal data from dfferent sources, we consdered addng addtonal attrbutes extracted from the COG database (Tatusov et al. 2001). Ths database assocates each yeast gene wth orthologous genes n 43 other genomes. Thus, we create for each gene a phylogenetc pattern that denotes whether there s an orthologous gene n each of the 43 genomes. When we nclude these addtonal features, the clusters learned changes. In general, we note that most of the phylogenetc patterns were consdered nformatve by the model but stll context specfc. For example, we see pars of clusters (e.g., Clusters 5 and 10) that are smlar n terms of expresson, yet have dstnct phylogenetc profles. One cluster contans genes that do not have orthologs, whle the other cluster contans genes that have orthologs n many bacteral genomes. Phylogentc patterns also allow us to gan addtonal nsght nto the functonal aspects of the clusters. For example Cluster 8 contans genes that are hghly over-expressed n amno-acd starvaton and ntrogen depleton. It s characterzed by occurrences of the bndng stes of GCN4 and CBF1, and genes n t have the typcal profle wth orthologs n C. jejun, P. mutocde, Halobacterum sp. NRC-1, P. aerugnosa, M. tuberculoss, A. aeolcus, C. crescentus, H. pylor J99, M. leprae, D. radodurans, T. volcanum, and T. acdophlum, and no orthologs n M. gentalum, B. burgdorfer, C. pneumonae, C. trachomats, S.pyogenes, T. palldum, R. prowazek,, 20

1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7 8 910 4x nduced 4x repressed Data CSI Masked Data Fgure 5: Clusterng of arrays n the stress data set of Gasch et al.

urealytcum, and Buchnera sp. APS. Ths cluster descrpton suggests that ths group of genes have common phylogenetc orgns as well as common functon and regulaton.

21 x nduced 4x repressed Data CSI Masked Data Fgure 5: Clusterng of arrays n the stress data set of Gasch et al. The left fgure shows the data rearranged accordng to the clusterng. The rght fgure shows only the postons that are nformatve the learned models (note that cluster 1 s totally masked n ths model). U. urealytcum, and Buchnera sp. APS. Ths cluster descrpton suggests that ths group of genes have common phylogenetc orgns as well as common functon and regulaton. As we noted n the ntroducton, our method can be used for other clusterng tasks. As an example, we clustered the 92 samples n the stress data. In ths clusterng, we reversed the roles of condtons and genes. Now we consder each condton as an (ndependent) sample, and each gene as a (contnuous) attrbute of the sample. The result of the clusterng are groups of samples, for each cluster we have the lst of nformatve genes. Not surprsngly, ths clusterng recovered qute well the groups of samples wth the same treatments. Table 2 shows the composton of each cluster n a run wth 10 clusters n terms of the orgnal treatments. Each of the followng treatments were recovered n a separate cluster: ddt, damde, YP, and steady state. In addton, the ntrogen depleton tme course was splt nto two clusters. The earler samples (30 mnutes to 4 hours) appeared n a cluster wth the amno acd starvaton samples, whle the later samples (8 hours to 5 days) were clustered separately. Ths s consstent wth clusters we learned over genes, that showed that some genes had dstnct behavor n later parts of the ntrogen depleton tme course. Smlar phenomena occurs wth H2O2 samples. Earler samples (10 mnutes - 50 mnutes) are clusters wth Menadon samples. Later H2O2 samples (60 mnutes to 80 mnutes, and also 40 mnutes) were clustered wth sorbtol samples. Fnally, both heat shock tme courses (fxed temperature, and varable temperature) were mostly clustered n one cluster, although some of the heat shock samples 21

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust