BIOINFORMATICS ORIGINAL PAPER

Size: px

Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Karin McCoy
5 years ago
Views:

1 BIOINFORMATICS ORIGINAL PAPER Vol. 21 no , pages do: /bonformatcs/bt402 Sequence analyss A boostng approach for motf modelng usng ChIP-chp data Pengyu Hong 1, X. Shrley Lu 2, Qng Zhou 1, Xn Lu 2, Jun S. Lu 1,2 and Wng H. Wong 1,2, 1 Department of Statstcs, Harvard Unversty, Cambrdge, MA 02138, USA and 2 Department of Bostatstcs, Harvard School of Publc Health, Boston, MA 02115, USA Receved on July 30, 2004; revsed on January 10, 2005; accepted on March 21, 2005 Advance Access publcaton Aprl 7, 2005 ABSTRACT Motvaton: Buldng an accurate bndng model for a transcrpton factor (TF) s essental to dfferentate ts true bndng targets from those spurous ones. Ths s an mportant step toward understandng gene regulaton. Results: Ths paper descrbes a boostng approach to modelng TF DNA bndng. Dfferent from the wdely used weght matrx model, whch predcts TF DNA bndng based on a lnear combnaton of poston-specfc contrbutons, our approach bulds a TF bndng classfer by combnng a set of weght matrx based classfers, thus yeldng a non-lnear bndng decson rule. The proposed approach was appled to the ChIP-chp data of Saccharomyces cerevsae. When compared wth the weght matrx method, our new approach showed sgnfcant mprovements on the specfcty n a majorty of cases. Contact: wwong@hsph.harvard.edu Supplementary nformaton: The software and the Supplementary data are avalable at hong2004/ MotfBooster/. 1 INTRODUCTION Wth the contnung explosve growth of sequenced genomes and genome-wde mrna expresson data, scentsts are ncreasngly nterested n modelng regulatory motfs and predctng bndng targets of transcrpton factors (TFs). In ths paper, we propose a dscrmnant approach that bulds models to dstngush postve sequences (.e. bndng targets of a TF) from negatve sequences (.e. non-targets of a TF). Several approaches for ths dscrmnant task have been proposed prevously. DMotfs apples an enumeratve search of the motf space and reports the best motf as a feature of the sequences that best dfferentates postve from negatve sequences (Snha, 2002). Vlo et al. (2000) used a bnomal formula for sgnfcance test to evaluate the occurrences of a motf n postve sequences aganst those n negatve sequences. Smlar to the approach of Vlo et al. (2000), the random selecton null hypothess approach n Barash et al. (2001) tests the sgnfcance of a motf aganst negatve sequences based on a hypergeometrc dstrbuton. Takusagawa and Gfford (2004) extended the works of Vlo et al. (2000) and Barash et al. (2001) to consder the effects of the lengths of sequences. The above approaches report motfs as consensus words, whch are arguably less senstve and precse than the correspondng weght matrx representatons (Stormo et al., 1982). To whom correspondence should be addressed. Snce the poneerng work of Stormo et al. (1982), the weght matrx model has become one of the most wdely used models for representng motfs. A popular approach to estmatng the parameters of a weght matrx de novo s to fnd a statstcally enrched motf n postve sequences wth respect to a background model (Stormo and Hartzell, 1989; Lawrence and Relly, 1990; Lawrence et al., 1993; Lu et al., 1995; Barash et al., 2001). The background model, whch usually s defned as an n-th order Markov model (n = 0, 1, 2 or 3), tres to capture all nformaton n the non-bndng stes that are much more heterogeneous than the bndng stes. Such a background model s so general that the weght matrx model tends to have very low specfcty. To better dentfy the non-bndng stes that are very smlar to the bndng stes, Workman and Stormo (2000) proposed a dscrmnant method called ANN-Spec, whch uses a Perceptron model and Gbbs samplng to tran the weght matrx. They showed that the weght matrx models output by ANN-Spec have hgher specfcty than those bult by non-dscrmnant approaches, such as MEME (Baley and Elkan, 1994). A motf reported as a weght matrx assumes that dfferent postons of the motf are ndependent. Under ths assumpton, a weght matrx s essentally a lnear classfer when used wth a cutoff value to predct bndng stes n sequences. Recent bologcal studes have demonstrated that ndvdual postons of bndng stes are not always ndependent (Bulyk et al., 2001, 2002; Man and Stormo, 2001), and suggested that some TFs recognze ther targets n a non-lnear fashon. Barash et al. (2003), adopted Bayesan networks to model dependences n bndng motfs as trees and mxtures of trees. The Bayesan tree model s smlar to the one used n an early work by Agarwal and Bafna (1998) to model the dependency between bases. It s recently reported (Zhou and Lu, 2004) that a smpler parcorrelaton model can largely account for all observed correlatons among motf postons and usng such a model n conjuncton wth the Gbbs samplng method suffers no overfttng problem. However, such a model stll cannot accommodate some non-lnear factors n dscrmnatng postve and negatve sequences. It s wdely accepted that a TF partcpates n controllng the mrna levels of ts target genes through ts bndng stes n the correspondng promoter regons. Hence, the REDUCE method (Bussemaker et al., 2001) and Motf Regressor (Conlon et al., 2003) were proposed to dscover motfs by assocatng motf abundances wth real-valued changes n genome-wde expresson data. The REDUCE method enumerates all K-mers (DNA segments of length K) and checks whether the combnatoral effects of a set of K-mers can be used to explan changes of gene-expresson data n a regresson manner The Author Publshed by Oxford Unversty Press. All rghts reserved. For Permssons, please emal: journals.permssons@oupjournals.org

2 Motf modelng usng ChIP-chp data Motf Regressor frst uses MDSCAN (Lu et al., 2002) to generate a large set of matrx-based motf canddates that are enrched n the promoter regons of genes wth the hghest fold changes n gene expresson data. Then t uses regresson analyses to select motf canddates that are most relevant to the change of gene expressons. Nevertheless, nether approach explots the potental of usng negatve sequences to change the parameters of a motf so as to ncrease the specfcty of the model. We propose a novel dscrmnant approach to enhance TF DNA bndng models usng the boostng technque. Frst, we use the ChIP-chp data to select postve and negatve sequences. In ChIPchp experments, DNA s crosslnked n vvo to protens at stes of DNA proten nteracton and sheared to 500 bp 2 kb fragments. The DNA proten complexes are precptated by antbodes specfc to the TF of nterest. The precptated proten-bound DNA fragments are PCR amplfed, fluorescently labeled and hybrdzed to mcroarrays contanng every promoter (sometmes also every ORF) n the genome. DNA fragments that are consstently enrched by ChIP-chp over repeated experments are dentfed as postve sequences contanng the proten DNA nteractng loc at 1 kb resoluton. When compared wth the gene-expresson data, the ChIP-chp data provde much more accurate nformaton about the genome-wde locaton of n vvo TF DNA nteractons, whch enables us to assgn defntve class labels to some promoter sequences wth hgh confdence. Consequently, we can model the TF DNA bndng problem as a classfcaton problem. We modfy the confdence-rated boostng (CRB) algorthm (Schapre and Snger, 1999) to tran a TF DNA bndng classfer as an ensemble model, whch s a weghted combnaton of a set of base classfers. The modfed CRB algorthm automatcally decdes the number of base classfers to be used so as to avod overfttng. A key aspect of the boostng technque s that t forces some of the base classfers to focus on the boundary between postve and negatve samples, thus effectvely reducng classfcaton errors. We demonstrate the power of ths approach by ts performance on the ChIP-chp data of Saccharomyces cerevsae (Lee et al., 2002). 2 METHODS 2.1 The ensemble model We defne a TF DNA bndng model as a weghted combnaton of a set of base classfers {q m ( )}: Q(S ) = α m q m (S ), (1) m where α m s the weght of q m ( ). The model weghts can be normalzed so that they sum up to 1. The class label of a DNA sequence S s decded by sgn(q(s )), wth +1 denotng that S s a postve sequence. The base classfer has ts root n the weght matrx method (Stormo et al., 1982). Let f m ( ) be the weght matrx model on whch q m ( ) s based. And let the set {s j } represent all K-mers n a DNA sequence S. The score of a K-mer s j, gven f m ( ) s: f m (s j ) = K w m I (s j ) t, (2) k=1 b {A,C,G,T} where (1) w m s the parameter (n the logarthm scale) of the model f m( ) for the nucleotde b at poston k; (2) I (s j ) = 1fthek-th base of s j s b and I (s j ) = 0, otherwse; (3) t s a threshold decded by some crtera (e.g. P -value). The hgher the score, the more lkely a ste wll be bound by the TF. The weght matrx model decdes s j as a target of the TF f f m (s j )>0 and a non-target ste, otherwse. We wll show later that the threshold can be embedded nto the parameter matrx [w m ]. In many stuatons (e.g. ChIP-chp experments), we only have nformaton about whether a DNA sequence s bound by a TF, but do not know whch stes n the sequence the TF bnds to. Hence, gven a weght matrx, we need to derve a scorng functon to assess the lkelhood of a DNA sequence as a target of a TF. Ths score should be affected by: (1) the number of matchng stes n the sequence; and (2) the degree of the match for each matchng ste. The followng functon takes nto account of the above factors and scores a sequence as: h m (S ) = log e fm(s r), (3) (r) where the sum s over the r best matchng K-mers. Ths equaton s smlar to that proposed by Motf Regressor (Conlon et al., 2003). However, we lmt t to the best r stes to avod favorng very long sequences. Detals for decdng the value of r are explaned n Secton 3.2. The base classfer q m ( ) transforms the score of a sequence wth a hyperbolc tangent functon to a soft class predcton: q m (S ) = 1 e hm(s ) 1 + e hm(s ) = (r) efm(sr) 1 (r) efm(sr) + 1. (4) The hyperbolc tangent functon s a scaled and based logstc functon, whch has been used for motf ste predctons (Barash et al., 2001; Segal et al., 2002). 2.2 Learn the ensemble model va boostng We adopt the CRB algorthm (Schapre and Snger, 1999) to perform the followng tasks n buldng an ensemble model Q( ): (1) decdng the number of lnear classfers q m ( ) n Q( ) and (2) learnng the parameters of each q m ( ) and ts weght α m. Loosely speakng, n the frst round, the CRB algorthm assgns equal weghts to all samples and trans the frst base classfer. In each of the rounds that follow, the boostng procedure gves hgher weghts to prevously msclassfed samples and learns a new base classfer wth ts weght usng the reweghted samples. The fnal classfer s a lnear assembly of weghted base classfers from each round. We made some modfcatons to the CRB algorthm to serve our purpose better. The modfed CRB algorthm s outlned as Fgure 1. Our frst change tres to accommodate the unbalanced tranng set (the number of negatve samples s much larger than that of postve ones) by assgnng larger ntal weghts to the postve samples. Second, to prevent overfttng, we reserve some tranng sequences for nternal test durng tranng. The detals of our mplementatons are explaned n the next secton. 3 IMPLEMENTATION 3.1 Intalze the weghts of sequences In our study, the number of negatve sequences (usually n thousands) s often much larger than the postve ones (usually <100). Wthout proper adjustments, negatve sequences would overwhelm a classfer and reduce ts capablty of recognzng postve sequences. As a remedy, we constran the total weght of the postve sequences to be equal to that of the negatve sequences (step b n Fg. 1). The sequences wthn each class have equal weghts. Ths n effect mposes a hgher penalty for msclassfyng a postve sequence than msclassfyng a negatve one. Note that ths heurstcs s not equvalent to ncreasng the number of postve observatons. 3.2 Learn base classfers The CRB algorthm (Schapre and Snger, 1999) s a Newton-lke algorthm that constructs an ensemble model to mnmze the upper bound on msclassfcaton error Err = d (1) exp( y Q(S )), (5) 2637

3 P.Hong et al. (a) Randomly reserve part of the tranng data for nternal test. The remanng n tranng sequences and ther class labels are denoted as (S 1, y 1 ),...,(S n, y n ); y { 1, 1}. (b) Intalze the weghts of sequences d (1) ( = 1,..., n). (c) For m = 1,..., M (c.1) Tran the parameters of q m ( ) and ts weght α m usng the weghted sequences wth the weghts {d (m) }. (c.2) Update sequence weghts: d (m+1) = d(m) exp( α my q m(s )) j d(m) j exp( α my j q m(s j )) (c.3) Use the reserved data to check f the overall model overfts the tranng data. Roll back (m = m 1) and stop f t overfts. (d) Output the fnal model Q( )= m α mq m ( ). Fg. 1. The modfed boostng algorthm. where d (1) s the ntal weght of S and y s the class label of S. Fredman et al. (2000) have detaled a dscussons on the ratonale of choosng the above crteron. In the m-th round, the CRB algorthm trans q m ( ) and ts weght α m to mnmze the weghted error: ε m = d (m) exp( α m y q m (S )), (6) where d (m) s the weght of S n the m-th round. In our case, the parameters to be estmated n each round nclude α m, r and [ w m ]. Bascally, at step c.1 n Fgure 1, we ncrease r from 1 to R (currently [ R = 5) ] by the step sze 1. For each value of r, the parameters α m and w m are ntalzed and refned to mnmze the weghted error. Fnally, the m-th round reports the values of r, α m and [ w m ], whch correspond to the mnmum weghted error Intalzaton Snce the motf must be an enrched pattern n the postve sequences, we take advantage of Motf Regressor (Conlon et al., 2003) to generate a good seed weght matrx for ntalzng [ w m ]. The seed weght matrx, reported by Motf Regressor, has the best correlaton between the logarthm of ChIPchp [ P ] -value and motf-matchng score of all tranng sequences. Let w 0 be the seed weght matrx. Gven a value of r, we ntalze α m and w m as α m(0) = 1 and w m (0) = w0 + (σ t/k), respectvely, where σ s randomly generated n the range [ 0.2, 0.2] and t s the threshold as n Equaton (2). The value of t s determned as the followng. We frst use the matrx [ w 0 + σ ] to score all stes n the tranng sequences and obtan the mnmum and maxmum ste scores as t mn and t max. Then, we ncrease t from t mn to t max by the step sze 0.1 and select the value that corresponds to the mnmum weghted error under the current values of r and α m Refnement The parameters [ w m ] and α m are teratvely refned by a gradent-lke method. In the n-th teraton (n 1), use [ w m (n 1)] to fnd the best r stes n each sequence as ts representatve stes, and update [ w m (n)] and α m(n) based on the correspondng gradents of the weghted error,.e.: w m (n) = wm (n 1) η 1 (1 + n/10) ε m(n 1) w m (n 1) η 2 α m (n) = α m (n 1) (1 + n/10) ε (7) m(n 1) α m (n 1), where the update rates are set as η 1 = 0.05 and η 2 = 0.1 based on our experence. The teraton stops f (1) the weghted error ncreases, (2) the mprovement of error s < or (3) the maxmum number of teratons (currently 100) s reached. Note that a ste s j s now b {A,C,G,T} wm (n)i (s j ), whch s slghtly df- scored as K k=1 ferent from Equaton (2). The threshold t n Equaton (2) s absorbed by [ w m (n)] and s updated mplctly. 3.3 Prevent overfttng A man challenge wth the small number of postve samples s that one can easly overtran the classfers. Our strategy to allevate ths effect s to reserve a subset of the negatve tranng sequences (5% n our current settng) and one postve tranng sequence for nternal valdaton durng tranng. The sequences are randomly selected. The weght of each reserved sequence s set as the ntal weght of a tranng sequence wth the same class label. Overfttng s checked usng the reserved data at step c.3 n Fgure 1. The boostng procedure wll stop, f addng one more base classfer ncreases the error [as defned n Equaton (5)] for the reserved sequence set. Sometmes, the ensemble model may have only one base classfer, say q 1 ( ). We buld a base classfer q υ ( ) wth ts parameters as r υ and [ w 0 t υ /K ], where r υ and t υ are decded by the ntalzaton method (wthout σ ) descrbed n Secton 3.2. The weght of q υ ( ) s set as 1. We compare q υ ( ) wth q 1 ( ) and choose the one wth a smaller weghted error as defned n Equaton (5). The ratonale for ths step s that the current way for tranng base classfers may not fnd the best one. Ths lmtaton can be amended by a weghted combnaton of multple base classfers. If the fnal model has only one base classfer, q υ ( ) could be a better alternatve. 4 RESULTS 4.1 Data We used the ChIP-chp data reported n Lee et al. (2002). Postve sequences are selected usng ChIP-chp P -value as the cutoff. At ths cutoff selecton, the false postve rate s 6 10% and the false negatve rate s 33% (Lee et al., 2002). Although the data are stll nosy, they are the best genome-wde data of n vvo TF DNA bndng localzaton so far. To avod havng too few postve samples, we also requred that each selected TF should have at least 25 postve sequences. Forty TFs (Lee et al., 2002) satsfy these crtera. Negatve sequences were selected as those wth ChIP-chp 2638

4 Motf modelng usng ChIP-chp data Table 1. Data summary and cross-valdaton results for 31 ChIP-chp data TF Pos seq (no.) Neg seq (no.) Base classfers (no.) Average FP of weght matrx Average FP of boostng Improvement of boostng over weght matrx(%) ABF ACE BAS CAD CBF CIN DAL FHL FKH FKH GCN HAP HSF MBP MCM NRG PDR PHD RAP REB RLM SKN SMP STE SUM SWI SWI SWI YAP YAP YAP Columns 1, TF names; 2, number of postve sequences; 3, number of negatve sequences; 4, number of base classfers n the boosted classfer; 5, number of false postves FP w usng the weght matrx reported by Motf Regressor as a classfer; 6, number of false postves FP b of the boostng method; 7, percentage of mprovement of the boostng method over the weght matrx method, measured as (FP w FP b )/FP w. rato 1 and ChIP-chp P -value Each selected TF has 3000 negatve sequences. For each gene, we take ts upstream sequence, up to 800 bp, not overlappng wth the prevous gene. 4.2 Boostng mproves the specfcty of motf models To evaluate our method, we used the followng cross-valdaton procedure. In each run, we leave one postve sequence and 5% of randomly selected negatve sequences as the test data and tran a classfer on the remanng data. Ths procedure s repeated 10 tmes for each postve sequence. The cross-valdaton error of each run s calculated as the number of false postves f the number of the false negatves s zero. The results are then averaged for all runs and compared. The detaled data, whch nclude the sequence data, the ensemble models of the TFs, the logos of the ensemble models and all the test results, are avalable as the Supplementary data at hong2004/motfbooster/. We used Motf Regressor (Conlon et al., 2003) to fnd the seed weght matrx. For each TF, Motf Regressor called MDSCAN (Lu et al., 2002) to fnd canddate motfs of wdth 6 17 bases. At each wdth, MDSCAN reported the best 20 weght matrces enrched n the postve tranng sequences. Each weght matrx was used to score the tranng sequences. Motf Regressor then performed smple lnear regresson between the logarthm of ChIP-chp P -values and sequence scores. We chose the motf correspondng to the best regresson P -value as our seed motf. We observed that Motf Regressor dd not fnd sgnfcant enough motfs for nne TFs (DIG1, GAL4, GAT3, GCR2, IME4, IXR1, NND1, PHO4 and ROX1). It s possble that under the asynchronzed growth condton, these TFs were not actvated, or the modfed tagged TFs have changed ther bndng characterstcs. Table 1 summarzes the results for the remanng 31 TFs. Compared wth the weght matrx reported by Motf Regressor, the ensemble models performed markedly better n 27 cases and evenly n 4 cases (FKH1, FKH2, RLM1 and YAP6). A closer examnaton on the four even cases reveals that each ensemble model only has one base classfer that s a drect converson from the ntal weght matrx. The boostng approach also reported fnal models wth sngle base classfer n 5 of 27 cases that performed better. These fve TFs are CIN5, MBP1, NRG1, SKN7 and STE12. Snce the base classfer s equvalent to a weght matrx model, these results ndcate that 2639

5 P.Hong et al. Table 2. Contrbutons of the base classfers (BCs) n the leave-one-out cross valdaton tests TF BC no. Average FP of WM Average FP of BC 1 Average FP of BC (1 + 2) Average FP of BC Average FP of BC ( ) ( ) ABF ACE BAS CAD CBF DAL FHL GCN HAP HSF MCM PDR PHD RAP REB SMP SUM SWI SWI SWI YAP YAP Columns 1 7 are the TF names, number of BCs n the ensemble model, number of false postves of the weght matrx method and number of false postves of the ensemble model when ts frst 1, 2, 3 and 4 BCs are used, respectvely. We order the base classfers n each ensemble model so that ther weghts are n the descendng order. usng negatve nformaton can help dscover better weght matrces n many cases. Ths s consstent wth the fndngs of Workman and Stormo (2000). However, the frst base classfer does not always perform better than the ntal weght matrx. Table 2 summarzes the contrbutons of the base classfers for the cases where the boostng method selected more than one base classfer. The base classfers n the fnal models are arranged n the descendng order of ther weghts. The performances of 13 frst base classfers,.e. the ones wth the largest weghts, are worse than those of the weght matrces reported by Motf Regressor. Ths may suggest that when the bndng stes of a TF are heterogeneous and maybe grouped nto clusters, our boostng method fnds base classfers correspondng to dfferent cluster profles, whereas Motf Regressor reports an average profle. Thus, a sngle base classfer may be too specfc to a partcular cluster and does not dscrmnate well globally. 5 DISCUSSION For some cases, the ensemble model can reveal dependences among motf postons. For example, Fgure 2a dsplays the weght matrx found by Motf Regressor for RAP1, from whch we can see that C and T domnate n poston 5, and A and G domnate n poston 8. But there s no further nformaton on how these two postons mght correlate wth each other. In contrast, our boostng approach selected three base classfers (Fg. 2b d) to compose the fnal model. Two base classfers favored C and A n postons 5 and 8, respectvely, whereas the thrd one preferred T and G n those postons, respectvely. Ths observaton mples that postons 5 and 8 may cooperate n a certan way such that the change n one poston correlates wth the change n the other. As another example, we observe that postons 1, 10 and 13 of REB1 motf (Fg. 3) can be decomposed n a smlar way. In ts frst base classfer, poston 13 strongly prefers G; postons 1 and 10 are ambvalent about G and C, respectvely. In the second base classfer, however, poston 13 strongly dsfavors G, and postons 1 and 10 strongly favor G and C, respectvely. Ths suggests that the three postons may cooperate to facltate the proten DNA bndng. The boostng approach termnates wth an ensemble of 2 3 base classfers for most cases. Ths s atypcal for applcatons usng the boostng technque that usually can boost for hundreds to thousands of base classfers. The small number of base classfers could be due to three reasons. The frst reason mght be the unbalanced tranng data ( 100 postve versus 3000 negatve sequences). We examned the senstvty and specfcty of each base classfer alone usng the tranng samples (Fg. 4a). The senstvty of base classfers spreads out n the range of 40 90%, whle ther specfcty concentrates n the range of 75 95%. Ths suggests that t s easer to tran base classfers to recognze negatve samples n our case although the negatve samples are more heterogeneous than the postve ones. We modfy the boostng algorthm by addng more ntal weghts to the postve samples such that the ntal total weghts of two classes are equal. We note that although ths method helps to brng out a less based classfer, t s not equvalent to ncreasng the number of postve observatons. As shown n Fgure 4b, base classfers wth hgher senstvty tend to have lower generalzaton errors. A smlar trend can be observed for the specfcty of base classfers n Fgure 4c. Fgure 5a shows that t s more 2640

Motf modelng usng ChIP-chp data Fg. 2. Logos of the bndng models of RAP1. (a) Poston specfc probablty matrx.

(b), (c) and (d): Logos of the base classfers 1, 2 and 3, respectvely n the ensemble model reported by the boostng approach (weght of base classfer 1 = 0.31; weght of base classfer 2 = 0.

The heght of a letter corresponds to the absolute magntude of ts weght scaled by a factor k (For vsualzaton purpose, k = 3 for postve weghts and k = 1 for negatve weghts.

Moreover, base classfers traned wth less postve samples are more lkely to have hgher generalzaton errors (Fg. 5b).

6 Motf modelng usng ChIP-chp data Fg. 2. Logos of the bndng models of RAP1. (a) Poston specfc probablty matrx. Logo of the weght matrx reported by Motf Regressor, drawn usng the method of (Schneder and Stephens, 1990). (b), (c) and (d): Logos of the base classfers 1, 2 and 3, respectvely n the ensemble model reported by the boostng approach (weght of base classfer 1 = 0.31; weght of base classfer 2 = 0.30; weght of base classfer 3 = 0.39). Base classfers have negatve parameters and cannot be vsualzed n the same way. (b), (c) and (d) are drawn n the followng way. The heght of a letter corresponds to the absolute magntude of ts weght scaled by a factor k (For vsualzaton purpose, k = 3 for postve weghts and k = 1 for negatve weghts.) Letters are ordered by ther weghts. The black horzontal lne represents zero. Letters above the zero lne have postve weghts, and those below the zero lne have negatve weghts. Fg. 3. Logos of the ensemble model of REB1. (a) The logo of base classfer 1 (Weght = 0.52). (b) The logo of base classfer 2 (Weght = 0.47). lkely to tran base classfers wth relatvely low tranng senstvty and specfcty when the sze of postve sequences s small. Moreover, base classfers traned wth less postve samples are more lkely to have hgher generalzaton errors (Fg. 5b). Based on the above analyses, we reason that (1) base classfers hardly overft the tranng data n most cases and (2) the small sze of postve samples does not provde enough nformaton to boost for more base classfers. Second, the bndng mechansms of some TFs may ndeed be almost lnearly dependent of nucleotde types of the motf postons. For example, ABF1 has a much larger postve sample sze (176) when compared wth other TFs. Both the weght matrx and the ensemble model of ABF1 have low and comparable generalzaton errors (Table 1). The ensemble model has two base classfers. The tranng senstvty/specfcty of the base classfers are 93.18/94.66% and 90.34/95.58%. These results suggest that the bndng mechansm of ABF1 may have lttle non-lnearty because ts samples can be well classfed by lnear decson rules ncludng the weght matrx and the base classfers. The base classfer becomes a strong learner (.e. t can explan most of the tranng data) n such a case. On the other hand, the mld performances of many other base classfers suggest that the bndng mechansms of some other TFs could have relatvely hgh non-lnearty. Fnally, our approach ntalzes a base classfer usng a seed matrx. The successve refnng step may only explore a lmted subspace around the seed matrx. The tranng of base classfers can be mproved by a samplng-based de novo motf fndng algorthm that s capable of explorng a wder range of the soluton space (e.g. by samplng at multple temperature levels). Or we can replace the base learner wth a smpler one, e.g. a smple decson tree that uses rules lke whether a poston should be C or not, etc. Wth the above modfcatons, the ensemble model could have more base classfer and capture more comprehensve features that lead to better classfcaton performance. Nonetheless, the resultant base classfers could be very dverse. Some base classfers could represent hghly degenerated motfs. One potental drawback of ths alternatve s the loss of bologcal nterpretablty of the ensemble model. Although t s stll not perfectly understood why the number of base classfers s small, our approach provdes a good balance between the nterpretablty and the performances of the boosted models. Another choce for mprovng the boosted models s to tran each base classfer only by a randomly selected subset of the full tranng set as suggested 2641

7 P.Hong et al. (a) (b) (c) Fg. 4. (a) The tranng senstvty (horzontal axs) specfcty (vertcal axs) plot of the base classfers. Star, crcle, damond and pentagram denote the senstvty/specfcty of the base classfers, 1, 2, 3 and 4 respectvely. (b) The cross-valdaton false postve rate (FPR) tranng senstvty (horzontal axs) plot of the base classfer 1. (c) The cross-valdaton FPR tranng specfcty (horzontal axs) plot of the base classfer 1. (a) (b) Fg. 5. The result plots of the frst base classfers. (a) Tranng senstvty (star) and specfcty (crcle) number of postve sequences (horzontal axs). (b) Cross-valdaton FPR number of postve sequences (horzontal axs). by Fredman (2002). It was reported that such knd of randomness has advantages n the stuatons of small samples and powerful weak learners. 6 CONCLUSION We ntroduce a boostng-based method for modelng TF DNA bndng. By repeatedly fttng weght matrx based classfers to weghted samples that focus on erroneous classfcatons, the boostng approach can buld a more accurate TF DNA bndng model as a weghted combnaton of the base classfers. The proposed approach was appled to the ChIP-chp data of S.cerevsae and showed sgnfcant mprovements on specfcty n many cases. Lke many recent studes that use mrna mcroarray data to help refne regulatory bndng motfs and nfer combnatoral rules of transcrpton regulaton (W. Wang et al., submtted for publcaton; Beer and 2642

8 Motf modelng usng ChIP-chp data Tavazoe, 2004), we found that ChIP-chp data can be used to further refne motf models and reveal novel features of TF DNA nteractons. Currently, we use Motf Regressor to generate the seed motf for boostng. However, our algorthm s not lmted to workng wth Motf Regressor and can be used to boost weght matrces reported by any motf fndng algorthm. ACKNOWLEDGEMENTS The work of W.H.W. s supported by NIH-HG The work of J.S.L. s supported by NIH-P20-CA96470 and NSF DMS The work of P.H. s supported by NIH-GM We thank the anonymous revewers for constructve suggestons that helped us to unfy the way to ntalze and tran base classfers and nspred us to thnk hard on the overfttng ssue of the ensemble models. REFERENCES Agarwal,P.K. and Bafna,V. (1998) Detectng non-adjonng correlatons wth sgnals n DNA. In Proceedngs of the Second Annual Internatonal Conference on Research n Computatonal Molecular Bology, March 22 25, 1998, New York, USA. ACM Press, pp Baley,T.L. and Elkan,C. (1994) Fttng a mxture model by expectaton maxmzaton to dscover motfs n bopolymers. Proc. Int. Conf. Intell. Syst. Mol. Bol., 2, Barash,Y. et al. (2001) A smple hyper-geometrc approach for dscoverng putatve transcrpton factor bndng stes. In Algorthms n Bonformatcs: Proceedngs of the 1st Internatonal Workshop, LNCS 2149, pp Barash,Y. et al. (2003) Modelng dependences n proten DNA bndng stes. In Prooceedngs of the 7th Annual Internatonal Conference on Computatonal Molecular Bology (RECOMB 2003), Berln, Germany, ACM Press, NY, pp Beer,M.A. and Tavazoe,S. (2004) Predctng gene expresson from sequence. Cell, 117, Bulyk,M.L. et al. (2001) Explorng the DNA-bndng specfctes of znc fngers wth DNA mcroarrays. Proc. Natl Acad. Sc. USA, 98, Bulyk,M.L. et al. (2002) Nucleotdes of transcrpton factor bndng stes exert nterdependent effects on the bndng affntes of transcrpton factors. Nuclec Acds Res., 30, Bussemaker,H.J. et al. (2001) Regulatory element detecton usng correlaton wth expresson. Nat. Genet., 27, Conlon,E.M. et al. (2003) Integratng regulatory motf dscovery and genomewde expresson analyss. Proc. Natl Acad. Sc. USA, 100, Fredman,J.H. (2002) Stochastc gradent boostng. Comput. Stat. Data Anal., 38, Fredman,J.H. et al. (2000) Addtve logstc regresson: a statstcal vew of boostng (Wth dscusson and a rejonder by the authors). Ann. Statst., 28, Lawrence,C.E. et al. (1993) Detectng subtle sequence sgnals: a Gbbs samplng strategy for multple algnment. Scence, 262, Lawrence,C.E. and Relly,A.A. (1990) An expectaton maxmzaton (EM) algorthm for the dentfcaton and characterzaton of common stes n unalgned bopolymer sequences. Protens, 7, Lee,T.I. et al. (2002) Transcrptonal regulatory networks n Saccharomyces cerevsae. Scence, 298, Lu,J.S. et al. (1995) Bayesan models for multple local sequence algnment and Gbbs samplng strateges. J. Am. Stat. Assoc., 90, Lu,X.S. et al. (2002) An algorthm for fndng proten DNA bndng stes wth applcatons to chromatn mmunoprecptaton mcroarray experments. Nat. Botechnol., 20, Man,T.K. and Stormo,G.D. (2001) Non-ndependence of Mnt repressor operator nteracton determned by a new quanttatve multple fluorescence relatve affnty (QuMFRA) assay. Nuclec Acds Res., 29, Schapre,R. and Snger,Y. (1999) Improved boostng algorthms usng confdence-rated predctons. Machne Learnng, 37, Schneder,T.D. and Stephens,R.M. (1990) Sequence logos: a new way to dsplay consensus sequences. Nuclec Acds Res., 18, Segal,E. et al. (2002) From promoter sequence to expresson: A probablstc framework. In Proceedngs of the 6th Internatonal Conference on Research n Computatonal Molecular Bology (RECOMB 02), Washngton, DC, ACM Press, pp Snha,S. (2002) Dscrmnatve motfs. In Proceedngs of the 6th Internatonal Conference on Research n Computatonal Molecular Bology (RECOMB 02), Washngton, DC, ACM Press, pp Stormo,G.D. and Hartzell,G.W.III (1989) Identfyng proten-bndng stes from unalgned DNA fragments. Proc. Natl Acad. Sc. USA, 86, Stormo,G.D. et al. (1982) Use of the Perceptron algorthm to dstngush translatonal ntaton stes n E.col. Nuclec Acds Res., 10, Takusagawa,K. and Gfford,D. (2004) Negatve nformaton for motf dscovery. Pac. Symp. Bocomput., Vlo,J. et al. (2000) Mnng for putatve regulatory elements n the yeast genome usng gene expresson data. Proc. Int. Conf. Intell. Syst. Mol. Bol., 8, Workman,C.T. and G.D. Stormo (2000) ANN-Spec: a method for dscoverng transcrpton factor bndng stes wth mproved specfcty. Pac. Symp. Bocomput., Zhou,Q. and Lu,J. (2004) Modelng wthn-motf dependence for transcrpton factor bndng ste predctons. Bonformatcs, 20,

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.