Bayesian Classifier Combination

Bayesan Classfer Combnaton Zoubn Ghahraman and Hyun-Chul Km Gatsby Computatonal Neuroscence Unt Unversty College London London WC1N 3AR, UK http://www.gatsby.ucl.ac.uk {zoubn,hckm}@gatsby.ucl.ac.uk September 8, 2003 Abstract Bayesan model averagng lnearly mxes the probablstc predctons of multple models, each weghted by ts posteror probablty. Ths s the coherent Bayesan way of combnng multple models only under very restrctve assumptons, whch we outlne. We explore a general framework for Bayesan model combnaton (whch dffers from model averagng) n the context of classfcaton. Ths framework explctly models the relatonshp between each model s output and the unknown true label. The framework does not requre that the models be probablstc (they can even be human assessors), that they share pror nformaton or receve the same tranng data, or that they be ndependent n ther errors. Fnally, the Bayesan combner does not need to beleve any of the models s n fact correct. We test several varants of ths classfer combnaton procedure startng from a classc statstcal model proposed by [1] and usng MCMC to add more complex but mportant features to the model. Comparsons on several datasets to smpler methods lke majorty votng show that the Bayesan methods not only perform well but result n nterpretable dagnostcs on the data ponts and the models. 1 Introducton There are many methods avalable for classfcaton. When faced wth a new problem, where one has lttle pror knowledge, t s temptng to try many dfferent classfers n the hope that combnng ther predctons would gve good performance. Ths had lead to the prolferaton of classfer combnaton, a.k.a. ensemble learnng, methods [3]. The Bayesan model averagng (BMA) framework appears to be deally suted to combnng the outputs of multple classfers. However, ths s msleadng. Before we dscuss Bayesan classfer combnaton (BCC), the topc of ths paper, let us revew BMA and outlne why t s not the rght framework for combnng classfers. 1 The work was done whle H-C.K. was a vstng student from POSTECH, South Korea. 1 We have focused on classfcaton, although many of the deas carry forth to other modellng problems; we return to ths n the dscusson. 1

Assume there are K dfferent classfers. Bayesan model averagng starts wth a pror over the classfers, p(k) for the kth classfer. Ths s meant to capture the (pror) belef n each classfer. Then we observe some data D, and we compute the margnal lkelhood or model evdence p(d k) for each k (whch can nvolve ntegratng out the parameters of the classfer). Usng Bayes rule we compute the posteror p(k D) = p(k)p(d k)/p(d) and we use these posterors to weght the classfers predctons: K K p(t x, D) = p(t, k x, D) = p(t x, k, D)p(k D) (1) where x denotes a new nput data pont and t the predcted class label assocated wth data pont. The key element of ths well-known procedure s that the predctve dstrbuton of each classfer s lnearly weghted by ts posteror probablty. Whle ths approach s appealng and well-motvated from a Bayesan framework, t suffers from three mportant lmtatons: 1) It s only vald f we beleve that the K classfers capture mutually exclusve and exhaustve possbltes about how the data was generated. In fact, we mght not beleve at all that any of the K classfers reflects the true data generaton. However, we may stll want to be able to combne them to form a predcton. 2) For many classfcaton methods avalable n the machne learnng communty, t s not possble to compute, or even defne, the margnal lkelhood (for example, C4.5, knn, etc.). Moreover, one should n prncple be able to nclude human experts nto any classfer combnaton framework. The human expert would not naturally defne a lkelhood functon from whch margnal lkelhoods can be computed. 3) Not all classfers may have observed the same data or started wth the same pror assumptons. The Bayesan framework descrbed above would have dffcultes dealng wth such cases, snce the posteror s computed by condtonng on the same data set. Here we propose an approach to Bayesan classfer combnaton whch does not assume that any of the classfers s the true one. Moreover, t does not requre that the classfers be probablstc; they can even be human experts. Fnally, the classfers can embody wdely dfferent pror assumptons about the data, and have observed dfferent data sets. There are well-known technques for classfer combnaton, so called ensemble methods([3, 9]). 2, such as baggng, boostng, and daggng. These methods try to make ndvdual classfers dfferent by tranng them wth dfferent tranng sets or weghtng data ponts dfferently. Ths s because t s mportant to make the ndvdual classfers as ndependent as possble for ensemble methods to work well. In ths work, we do not restrct how the ndvdual classfers are traned, but nstead assume they are gven and fxed. Another powerful and general method, called stacked generalsaton can be used to combne lower-level models [10]. Stackng methods for classfer combnaton use another classfer whch has as nputs both the orgnal nputs and the output of the ndvdual classfers. Stackng can be combned wth baggng and daggng [9]. In 2 Note that the term ensemble learnng has also been used n the Bayesan lterature n a dfferent context to refer to approxmate Bayesan model averagng usng varatonal methods. 2

ths work, we do not use the nput vectors and we explctly model the errors and correlatons between ndvdual classfers. Therefore, our work deals wth a dfferent problem from those whch are usually handled usng ensemble and stackng methods. It should be possble to extend our method to encompass a fully-bayesan generalsaton of stackng, but we leave ths for future work. The method we propose for Bayesan classfer combnaton n a machne learnng context s drectly derved from the method proposed n [5] for modellng dsagreement between human assessors, whch n turn s an extenson of [2]. Ths method assumes ndvdual classfers are ndependent, whch s often unrealstc and results n lmted performance. We therefore start wth these models and propose three extensons for modellng the correlatons between ndvdual classfers. The lterature of combnng probablty dstrbutons s qute extensve, and revews of other methods ncludng lnear, logarthmc and multvarate normal opnon pools, can be found n [4] and [6]. 2 Independent Models for Bayesan Classfer Combnaton 2.1 Probablstc Model for Classfer Combnaton We descrbe the method proposed n [2] wth the vew of applyng t to classfer combnaton. For the th data pont, we assume the true label t s generated by a multnomal dstrbuton wth parameters p: p(t = j p) = p j. Then, we assume that the output of classfer k s generated by a multnomal dstrbuton wth parameters π (k) c (k) p(c (k) t = j) = π (k) j,c (k) j :. For smplcty we assume that the classfers have dscrete outputs,.e. c (k) {1,..., J} where J s the number of classes. The extenson to ndvdual classfers whch output probablty dstrbutons s obvously mportant and wll be explored n the future. The matrx π (k) captures the confuson matrx for classfer k. If we assume that the classfer outputs are ndependent gven the true label t, we K get p(c, t p, π) = p t π(k) where c denotes the vector of class labels over t,c (k) all classfers. If we further assume that labels across data ponts are ndependent and dentcally dstrbuted, we obtan the lkelhood { } p(c, t p, π) = I =1 p t K π (k) t,c (k). (2) Usually, c (k) s known and the other varables and parameters are unknown. By consderng t as hdden varables, we can apply the EM algorthm to fnd ML estmates for p and π. Ths s the approach taken n [2] and we also provde further detals n a longer verson of ths paper [7]. It should be noted that not only does ths perform classfer combnaton, but t provdes estmates of nterpretable quanttes such as the confuson matrces. 3

2.2 Independent BCC Model A Bayesan treatment of the probablstc model n Secton 2.1 was recently proposed n [5] for combnng multple human raters. They also consdered multple ratngs (.e. c (k) 1... c(k) M ) for the same nput vector by the same raters. Snce artfcal classfers are not usually varable n how they respond to the same nput, we do not consder replcates n the ratngs. The Bayesan model needs prors on the parameters; we used herarchcal conjugate prors. A row of the confuson matrx π (k) j = [π (k) j,1, π(k) j,2,, π(k) j,j ], s modeled to have = [α (k) j,1, α(k) j,2,, α(k) j,j ]. The pror s modeled by an exponental dstrbuton wth parameters λ j,l. All a Drchlet dstrbuton wth hyperparameters α (k) j dstrbuton of α (k) j,l rows are assumed ndependent wthn and across classfers; even so t s easy to bas the pror to prefer dagonal confuson matrces. (Detaled expressons are provded n the longer verson of the paper [7].) The pror for the class proportons p s also set to be Drchlet, wth hyperparameters ν. Based on the above pror, we can get the posteror for all random varables gven the observed class labels. Snce we assumed ndependence among classfers (as n [5]), the posteror densty s p(p, π, t, α c),2,...,k c (k) { I =1 =1,2,...,I p t K Fgure 1: The drected graphcal model for IBCC, wth plates over classfers K and data ponts I. π t,c (k) } p(p ν)p(π α)p(α λ). (3) We call ths model the Independent Bayesan Classfer Combnaton (IBCC) model. The graphcal model for IBCC s shown n Fg 1. Inference for the unknown random varλ α (k) p ν ables p, π, t, and α can be done va Gbbs samplng. Snce the condtonal denstes on p and π (k) j are both Drchlet, they can be sampled easly; also, t π (k) t can be sampled snce t s a multnomal dstrbuton. However, the exact condtonals for α (k) are not easly obtaned, j,l so we use rejecton samplng. Hyperparameters ν are set so that class are roughly balanced a pror; λ s set to have bgger values on the dagonal than the off-dagonals. Ths encodes the pror that classfer outputs are better than random. 4

3 Dependent Models for Bayesan Classfer Combnaton One of the problems wth the above model s the assumpton that classfers are ndependent, whch s often not true n a real stuaton. Consder several poor classfers that make hghly correlated mstakes and one good classfer. Assumng ndependence results n performance based toward majorty votng, whereas accountng for the dependence would dscount the poor classfers by an amount related to ther correlaton. Modellng dependence therefore appears to be an essental element of Bayesan classfer combnaton. We propose three models to deal wth correlaton among classfer outputs. Frst, we nsert a new hdden varable representng the dffculty of each data pont margnalsng ths out results n a weakly dependent model. Second, we explctly model parwse dependence between classfers usng a Markov Network. Thrd, we combne the above two deas. 3.1 Enhanced BCC Model We enhance the IBCC model by usng dfferent confuson matrces accordng to dffculty of each data pont for classfcaton. Easy data ponts are classfed usng a confuson matrx E whch s fxed to have dagonal elements 1 ɛ and off-dagonal elements ɛ/(j 1) (we ve also tred extensons where E s learned). For hard data ponts, each classfer uses ts own confuson matrx, π (k), as before. Whether a data pont s easy or hard s controlled by ndependent Bernoull latent varables s (=1, f hard) wth mean d, whch s gven a Beta pror. The lkelhood term s as follows. { } I K K p(c, t p, π, s) = p t ( π (k) ) s ( E t,c (k) t,c ) (1 s) (4) (k) =1 We call ths model the Enhanced Bayesan Classfer Combnaton (EBCC) model. The graphcal model for the EBCC model n shown n Fg 2. Inference s agan performed usng Gbbs and rejecton samplng. λ β d s =1 β d α (k) π (k),2,...,k t (k) (k) c c s.t. s =1 ν p E,2,...,K t s =0 s.t. s =0 ν p Fgure 2: The graphcal model for the EBCC model. Note that we have a dfferent graphcal model condtonal on the settng of s for each pont; the left graph s for hard data and the rght graph s for easy data. (The usual DAG formalsm does not represent such dependence of structure on varable settng elegantly.) 5

3.2 Dependent BCC Model To model correlatons between classfers more drectly, we extend the IBCC model wth a Markov network. The part related to confuson matrces s replaced wth the followng Markov network. 1 p(c V, W, t ) = Z(V, W, t ) exp{ j<k W j,k δ(c (j), c (k) ) + k V (k) t,c (k) } (5) In ths Markov network, V relates t wth c (k), and W relates c (j) wth c (k), whch models correlatons between classfers; Z s a partton functon (normalser). The same prors p(t p)p(p ν) as n IBCC are used. As prors for elements of V and W, we use zero-mean ndependent Gaussans wth varance σ 2 v and σ 2 w. Samplng for most of the parameters of ths model s agan straghtforward. However, samplng from V, W s more subtle due to the partton functon, so we mplemented t usng a Metropols samplng method. We call ths model the Dependent Bayesan Classfer Combnaton (DBCC) model. Snce t s a mx of drected and undrected condtonal ndependence relatons t s most smply depcted as a factor graph (Fg 3). ν p (j) V (1) V V (K) t Fgure 3: The factor graph for the DBCC model. Each dot represents a factor n the jont probablty and connects varables nvolved n that factor. (1) c (j) c (K) c =1,2,...,I W 1,j W 1,K W j,k 3.3 Enhanced Dependent BCC model The Enhanced Dependence BCC model (EDBCC) combnes the easy/hard latent varable for the EBCC wth the explct model of correlaton between classfers of the DBCC. For easy data, the condtonal probablty of each class s gven by: p easy (c (:) U, t ) = 1 Z e (U, t ) exp{ k U t,c } (6) (k) U relates t wth c (k) (playng a role analogous to the E matrx n EBCC). For easy data ponts, t s assumed that classfers are ndependent, for hard data t s assumed to be as n DBCC. The factor graph for the EDBCC model s shown n (Fg 4). 6

ν ν β d s =1 p t (j) (1) V V (K) V c c c (1) (j) (K) s.t. s =1 β d s =0 p t U (1) c (j) c (K) c s.t. s =0 Fgure 4: The factor graph for the EDBCC model. Agan we have a dfferent graph condtonal on the settng of s. The left half shows the factor graph for hard data ponts (s = 1) and the rght half for easy data ponts. W 1,j W 1,K W j,k 4 Expermental Results We compared the Bayesan classfer combnaton methods on several data sets and usng dfferent component classfers. We used Satellte and DNA data sets from the Statlog project([8]) and the UCI dgt data set ([1]) 3. Our goal was not to obtan the best classfer performance for ths we would have pad very careful attenton to the component classfers and chosen sophstcated models suted to the propertes of each data set rather our goal was to compare the usefulness of dfferent BCC methods even when component classfers are poor, correlated or traned on partal data. We compared the four varants of the BCC dea outlned above to two other methods: selectng the best classfer usng valdaton data 4 and majorty votng. In all BCC models the valdaton data was used as known t to ground the estmates of model parameters. In theory ths groundng s not necessary: we can treat the labels n the observed data set as smply another classfer s outputs (perhaps the human who handlabelled the data) and assume that no true labels t are ever observed. Ths varant dd not seem to work as well n ntal experments but needs to be explored further. BCC results are based on comparng the posteror mode of t for data ponts n the test set to the true observed label. We dd two sets of experments. In Experment 1, we combned the outputs of the same type of classfer traned on dsjont tranng sets. 5 In Experment 2, we traned several dfferent classfers on the (same) whole tranng set. 6 For all BCC models ran 3 The DNA data set has a tranng set of 2000, a test set of 1186 wth 3 classes and 50 varables. Satellte has a tranng set of 4435, a test set of 2000 wth 6 classes and 36 varables. UCI dgt data set has a tranng set of 3823, a test set of 1797, 10 classes and 64 varables. 4 500, 1000, 797 data ponts were selected from the orgnal test set as a valdaton set for DNA data set, Satellte data set, UCI dgt data set, respectvely. The rest of the orgnal test set was used to evaluate the performance. 5 For DNA data set, we had 5 dsjont tranng sets and traned C4.5 for each of them. For Satellte data set, we had 4 dsjont tranng sets and traned C4.5 for each of them. For UCI dgt data set, we had 3 dsjont tranng sets and traned SVM wth 2nd-order polynomal kernel and C = 100.0. 6 For DNA data set, we traned 5 classfers: C4.5 (C1), SVM wth 2nd-order polynomal kernel and C = 100.0 (C2), 1-Nearest Neghbor (C3), logstc regresson (C4), and Fsher dscrmnant (C5). For Satellte data set, we traned 4 classfers: C4.5 (C1), SVM wth 2nd-order polynomal kernel and C = 100.0 (C2), 7

Experment 1 Experment 2 Data set Satellte UCI dgt DNA Satellte UCI dgt DNA C1 0.1920 0.0320 0.1210 0.1420 0.0460 0.0714 C2 0.1820 0.0320 0.1458 0.1450 0.0250 0.1137 C3 0.1910 0.0390 0.1283 0.1760 0.0290 0.2551 C4 0.1860 N/A 0.1254 0.2560 N/A 0.1020 C5 N/A N/A 0.1050 N/A N/A 0.0598 Val 0.1910 0.0390 0.1458 0.1450 0.0250 0.0598 MV 0.1505 0.0263 0.0780 0.1460 0.0250 0.0415 IBCC 0.1510 0.0260 0.0758 0.1240 0.0250 0.0408 EBCC 0.1490 0.0260 0.0758 0.1250 0.0250 0.0408 DBCC 0.1520 0.0240 0.0904 0.1300 0.0230 0.0423 EDBCC 0.1410 0.0290 0.0889 0.1280 0.0230 0.0466 Table 1: The performances of ndvdual classfers and varous combnaton schemes n the case of usng the same classfer wth the dsjont tranng sets (Experment 1) and dfferent classfers wth the same whole tranng set (Experment 2) the MCMC sampler for at least 50000 samples, averagng every 100th and dscardng the frst 10000. The dependent models (DBCC and EDBCC) were generally slower to converge. Detals of the samplng and hyperparameter settngs are provded n the longer verson of the paper. Table 1 shows the performance of each classfer and BCC combnaton strategy for both experments. Val and MV denote selectng the classfer wth smallest valdaton set errors, and majorty votng, respectvely. IBCC and EBCC have smlar performance and EBCC model s always better than or as good as majorty votng. Model selecton by valdaton set s qute bad especally n Experment 1. BCC methods are always better than or as good as model selecton by valdaton. The dependent factor graph models (DBCC and EDBCC) do not always work well. Especally on the DNA data set, they dd not seem to learn reasonable parameters, perhaps because the DNA data set s relatvely small and has based class dstrbuton. For Satellte and UCI dgts, t learned resonable parameters and showed comparable performance to other BCC methods. We examned the V and W matrces nferred by the dependent methods and the dffculty assgned to each pont by the enhanced methods. These have ntutve nterpretatons and may provde useful dagnostcs, one of the strengths of the BCC approach. Due to space lmtatons we do not dsplay these matrces or dscuss them n ths paper; see [7]. logstc regresson (C3), and Fsher dscrmnant (C4). For UCI dgts, we traned 3 classfers: SVM wth lnear kernel (C1), SVM wth 2nd-order polynomal kernel (C2), and SVM wth Gaussan kernel (σ = 0.01) (C3), where all SVMs has C = 100.0. 8

5 Dscusson We have shown several approaches to classfer combnaton whch explctly model the relaton between true labels and classfer outputs. They worked reasonably well and some of them were always better than or as good as majorty votng or valdaton selecton. The parameters n BCC models can be nterpreted resonably and gve useful nformaton such as confuson matrces, correlatons between classfers, and dffculty of data ponts. We emphassed that Bayesan classfer combnaton s not the same as Bayesan model averagng. Our approach s closely related to supra-bayesan methods for aggregatng opnons [4, 6]. Other models and extensons are certanly possble; we outlne some here. Clearly the model presented here needs to be generalsed to combne classfers that output probablty dstrbutons. In ths case, e.g. nstead of a matrx π (k) we need a model that relates t to class probablty dstrbutons. Condtonal Drchlet dstrbutons seem a natural choce for ths. Smlarly, there s no reason to restrct ths approach to combnng classfers. Combnng dfferent regressons s another mportant problem whch could be handled by an approprate choce of the densty of regressor outputs gven true target. A Bayesan generalsaton of stackng methods s another mportant avenue for research. The combner, n our setup, does not see the nput data. If the combner does see the nput and the outputs of all the other classfers, then t should model the full relaton between true labels, nputs, and other classfer outputs. One practcal lmtaton of the DBCC approach s that the computaton tme for the exact partton functon of the Markov network grows exponentally wth the number of classfers. Effcent approxmatons to the partton functon, many of whch have been recently developed, could be used here. Such approxmate nference could also be a tractable replacements for all the MCMC computatons. References [1] C. L. Blake and C. J. Merz. UCI Repostory of machne learnng databases. Irvne, CA: Unversty of Calforna, Department of Informaton and Computer Scence., 1998. [2] A. Dawd and A. Skene. Maxmum lkelhood estmaton of observer error-rates usng the em algorthm. Appled Statstcs, 28:20 28, 1979. [3] T. G. Detterch. Ensemble methods n machne learnng. Frst Internatonal Workshop on Multple Classfer Systems, LNCS, pages 1 15, 2000. [4] C. Genest and J. V. Zdek. Combnng probablty dstrbutons: A crtque and an annotated bblography. Statstcal Scence, 1:114 118, 1986. [5] Y. Hatovsky, A. Smth, and Y. Lu. Modellng dsagreements among and wthn raters assessments from the bayesan pont of vew. In Draft. Presented at the Valenca meetng 2002, 2002. [6] R. Jacobs. Methods for combnng experts probablty assessments. Neural Computaton, 7:867 888, 1995. [7] H. Km and Z. Ghahraman. Graphcal models for Bayesan classfer combnaton. GCNU Techncal Report (n preparaton), 2003. 9

[8] D. Mche, D. Spegelhalter, and C. Taylor. Machne Learnng, Neural and Statstcal Classfcaton. Ells Horwood Lmted, 1994. [9] K. Tng and I. H. Wtten. Stackng bagged and dagged models. In Proc. of ICML 97. San Francsco, CA, 1997. [10] D. H. Wolpert. Stacked generalzaton. Neural Networks, 5:241 259, 1992. 10