Fast Sparse Gaussan Processes Learnng for Man-Made Structure Classfcaton Hang Zhou Insttute for Vson Systems Engneerng, Dept Elec. & Comp. Syst. Eng. PO Box 35, Monash Unversty, Clayton, VIC 3800, Australa hang.zhou@eng.monash.edu.au Davd Suter Insttute for Vson Systems Engneerng, Dept Elec. & Comp. Syst. Eng. PO Box 35, Monash Unversty, Clayton, VIC 3800, Australa d.suter@eng.monash.edu.au Abstract Informatve Vector Machne (IVM) s an effcent fast sparse Gaussan processs (GP) method prevously suggested for actve learnng. It greatly reduces the computatonal cost of GP classfcaton and makes the GP learnng close to real tme. We apply IVM for man-made structure classfcaton (a two class problem). Our work ncludes the nvestgaton of the performance of IVM wth vared actve data ponts as well as the effects of dfferent choces of GP kernels. Satsfactory results have been obtaned, showng that the approach keeps full GP classfcaton performance and yet s sgnfcantly faster (by vrtue f usng a subset of the whole tranng data ponts). 1 Introducton We am to develop an effcent way of classfyng man-made structures from natural scenes by applyng the fast GP approxmaton as an onlne learnng method. Gaussan Process (GP) classfcaton models the posteror drectly, thus relaxng the strong assumpton of condtonal ndependence of the observed data (generally 3 used n a generatve model). However, GP has O ( N ) computatonal complexty wth respect to the number of tranng data ponts N. Dfferent approaches have been proposed to deal wth ths problem. Csato and Opper developed a sparse representaton by mnmzaton of KL dvergences between the approxmate posteror and a sparse representaton [1]. Snelson and Ghahraman presented a sparse approxmaton method wth M pseudo-nput ponts whch are learnt by a gradent based optmzaton []. Lawrence, Seeger and Herbrch proposed a sparse GP method that employs greedy selectons to choose the subset of the tranng data ponts maxmzng a dfferental entropy score [3] (whch s more smple and effcent to mplement compared wth other smlar methods). Usng ths enables us to tackle the ssue of kernel selecton and to begn to tackle the questons of constructon of on-lne learnng support (such as actve data selecton). For man-made structure classfcaton on D mage data, typcal approaches are based on Bayesan generatve models as proposed n [4] and [5]. The generatve model [4, 5] models the jont probablty of the observed data and the related labels. Data s condtonally ndependent gven the class labels [4] whch s not true for man-made structures wth obvous neghbourng dependences. The TSBN generatve model descrbed n [4] s more for outdoor general scene segmentaton rather than for manmade structure specfcally. Hebert and Kumar [5] proposed a generatve Mult-Scale Random Feld (MSRF) model whch extracts mage block features that capture the general propertes of the man-made structures. Observed data dependency s modelled by a pseudolkelhood approxmaton. It yelds better results compared wth most other approaches. We adopt a smlar feature extracton procedure as n [5] but we replace the generatve model approach wth a dscrmnatve GP model approach, to capture the dependences between the block features by drectly modellng the posteror over labels. Moreover, ts kernel based non-parametrc nature makes GP more flexble compared wth parametrc models. The paper s structured as follows. GP Classfcaton s ntroduced n Secton and a descrpton of IVM s gven n Secton 3. In Secton 4, experment detals and results are presented. Secton 5 gves out the man conclusons of the work. Gaussan Processes for classfcaton.1 GP 1-444-1180-7/07/$5.00 007 IEEE Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.
A GP s a collecton of random varables, any fnte number of whch has a jont Gaussan dstrbuton [6]. It s fully specfed by ts mean functon m(x) and covarance functon k(x, x ), expressed as: f ~ GP( m, k) (.1) whch defnes a dstrbuton over t covarance functons. The nference can be cast drectly nto the GP framework by learnng a covarance functon from tranng data.. GP regresson GP regresson ams to recover the underlyng process from the observed tranng data. Followng the exposton n [7]: we have a dataset D wth n observatons D = {( x, y) = 1,..., n}, where x s the nput vector of dmenson d and y s the scalar output. Input data are put n d x n matrx X and the targets/output n vector y, D = ( X, y). Typcally, gven nosy observatons D = ( X, y) where y = f + ε and addtve nose ε ~ Ν (0, σ I), the condton GP mean predctve dstrbuton can be expressed as 1 f* = K( X*, X )[ K( X, X ) + σ n I] y (.) where K ( X *, X ) denotes the covarance matrx of tranng and test ponts and K ( X, X ) s the tranng data covarance. The GP mean predcton n equaton (.) can ether be regarded as a lnear combnaton of the observatons y or the lnear combnaton of kernel functons, each centred on a tranng pont..3 GP classfcaton In our applcaton, we need a bnary classfer to dscrmnate between man-made structure and nonstructure so our dataset s D = ( X, y), where X are nput tranng mage features and y the class labels -1/+1. GP bnary classfcaton s done through a latent functon. After calculatng the dstrbuton over latent functon: the output of regresson s squashed through a sgmod transformaton to guarantee the vald probablstc value wthn the range of [0,1]. Snce class labels are dscrete n bnary classfcaton the Gaussan lkelhood s no longer vald, and so approxmaton s needed for calculaton. EP approxmaton s generally used (see Algorthms (3.5) and (3.6) n [7])..4 GP kernels The GP kernel s the crucal part of GP learnng, snce t ncorporates the pror smoothness assumpton. The typcal covarance functons, studed n ths paper, nclude [7]: 1) Radal Bass Functon (RBF), also called as Squared Exponental (SE) functon or Gaussan functon r k RBF ( r) = exp( ) (.3) l where r = x x', x and x' are nput pars, l s the characterstc length-scale. ) Matern class of covarance functons 5r 5r 5r kν = 5/ ( r) = (1 + + ) exp( ) l 3l l (.4) 3) Lnear kernel k ( x, x' ) = σ 0 + x x' (.5) where x and x' are nput pars. 3 Fast sparse Gaussan Process the Informatve Vector Machne (IVM) IVM [8, 9] selects only a small subset of the dataset: the most nformatve d ponts out of the total N tranng ponts, thus reduce the computaton complexty from 3 O ( N ) to O ( d N). IVM greedly mnmse the entropy of the posteror by ncludng only the most nformatve data ponts that most reduce the entropy n a sequental manner. The selected d ponts form the so called actve set [3]. Followng Eq (A.0) n [7], the entropy of a Gaussan N ( µ, Σ) n D dmensons can be expressed as: [ N( µ, Σ) ] H = 1 D log Σ + (log π e ) (3.1) For the greedy algorthm decdng whch ponts are taken nto the actve set I, Lawrence [3] proposed to choose the next pont for ncluson nto the actve set beng the one that maxmzes the dfferental entropy H, where Q s the Gaussan approxmaton of the posteror p( f X, y) at ste as score [ Q ] H[ Q ] descrbed n Secton.3, H [ Q ] s the entropy at ste H beng the entropy at ste once the Q and [ ] observaton at ths ste s ncluded. By nvolvng Eq (3.1), the dfferental entropy score can be wrtten as: 1 H[ Q ] H[ Q ] = log Σ 1 log Σ Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.
Σ 1 = log (3.) Σ Thus, t s proportonal to the log rato between the varances of the Q and Q. The change of the entropy after ncludng a pont s equvalent to the reducton n the level of uncertanty. Choosng the nclusons (d of them) forces the resultng model to be sparse. Moreover, IVM uses an EP style approxmaton of the posteror and, as shown n Eq (3.51) n [7], the lkelhood term can be gnored f ts ste values are very small. In ths way a sparse model s obtaned and computaton effcency s ganed. Detals of IVM mplementaton can be found n [3, 10]. 4 Experments and results 4.1 Orentogram features A feature vector s computed at each 16 16 block.these features are desgned to capture the lnes and edges patterns n man-made structure [5] [11]. As descrbed n [11], a 14 component feature vector s generated at dfferent scales: 1 1,, and 4 4 blocks. These features are derved from orentograms : hstograms of gradent orentatons n a regon weghted by gradent magntudes. The 14 features nclude: 1) The frst heaved central-shft moments (three scales) ) The thrd heaved central-shft moments (three scales) 3) The absolute locaton of the hghest bn (three scales) 4) The relatonshp of two most domnant orentatons at the three scales expressed as rd_ntra = sn( δ1 δ ) (4.1) where δ 1 and δ are the two domnant orentatons. 5) The relatonshp of the domnant orentatons between adjacent scales, whch s + 1 rd_nter = cos( δ δ (4.) +1 where δ and δ are domnant orentatons at adjacent scales and +1. We only keep the eght features n 1), 4) and 5) whch cover more general propertes of man-made structures. 4. Experments and results The proposed approach was traned and tested usng the that Kumar [5] used 1. To ncrease the varaton and to test the generalzaton ablty, we used some, collected by the authors around our campus, for testng as well All are cut to the sze of 1 http://www.cs.cmu.edu/~skumar/manmadedata.tar 56x56 and dvded nto non-overlappng 16x16 pxels blocks whch are labelled as one of the two classes,.e. buldng or non-buldng blocks. We used a tranng set of 11, contanng 407 structured blocks and 1768 non-structured blocks. Testng s mplemented on 43, ncludng 33 and 10 self-collected. All test do not appear n the tranng set. For IVM GP classfcaton, we run Lawrence s program [9]. Rasmussen and Wllams s GP classfcaton program s appled for standard GP classfcaton [7] 3. The ncluson tranng pont number, d, s set to 660 n our test. Ths s a compromse between speed and performance consderng the tme complexty beng proportonal to d. The RBF kernel s the most frequently used n applcatons of GP learnng. Kernels allow for ncorporaton of pror knowledge, therefore t makes lttle sense to apply the same kernel to dfferent applcatons. Thus we also nvestgated a varety of GP kernels, ncludng RBF, RBF wth ARD (Automatc Relevance Determnaton) [1], RBF wth lnear functon, etc. The Ratonal Quadratc (RQ) and Neural Network (NN) functons were also tested: However these two functons yeld less satsfactory results, and are not lsted n the comparson fgures. Fgure 1 shows some of the test results for IVM GP classfcaton (usng the Matern kernel and RBF kernel respectvely) and standard GP classfcaton as well as Kumar s MSRF results. GP classfcaton wth Matern kernel tends to cover more buldng blocks and have less false detectons. Specfcally, we have compared our results wth that of Kumar s[5] on group of test Table 1. Despte usng only 1/0 number of tranng data compared wth hs (and only 8 of the 14 feature types he used), the results on are almost equvalent to hs results. Moreover, we do not mpose spatal coherence n mage space (unlke the MSRF of Kumar). The results on the 10 added by the authors have a relatvely lower detecton rate and smlar false postves whch mples that our campus buldng may not well represented by the buldngs n the data set. Nevertheless, clearly there s sgnfcant generalsaton to dfferent archtectural types. The overall results on all 43 test : wth a detecton rate of 70.65%, the false postve rate s 1.49 block/mage. One can ncrease the detecton rate at the cost of more false postves: The false postves go up to.53 wth a hgher detecton rate of 78.59%. http://www.cs.man.ac.uk/~nell/vm/downloadfles/ 3 http://www.gaussanprocess.org/gpml/code/matlab/doc /classfcaton.html Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.
(c) (d) (c) (d) (e) (f) (e) (f) (g) (h) (g) (h) Fgure. Classfcaton results. orgnal (c)(d) IVM Matern kernel results (e)(f) IVM RBF kernel results (g)(h) GP RBF kernel results. () Fgure 1. Classfcaton results. orgnal (c)(d) Kumar s results (e)(f) IVM Matern kernel results (g)(h) IVM RBF kernel results ()(j) GP RBF kernel results. (j) In Fgure, results on our campus agan shows that IVM GP classfcaton wth Matern kernel s better although no campus have been ncluded n our tranng set yet. We focus on a comparson between Matern kernel and RBF kernel, snce these have the best performance, compared wth other kernels, n our applcaton. The Matern kernel wth IVM s compared wth RBF wth IVM (shown n Fgure 3). The RBF kernel used n a standard GP s compared to the IVM Matern (shown n Fgure 4). Results n Fgure 3 shows a clear advantage of Matern kernel over RBF kernel on detecton rate (and a smlar Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.
rate of false postves). Compared wth the RBF kernel, the Matern kernel wth IVM has a smlar detecton rate wth less false postves as shown n Fgure 4. In Fgure 5, the Matern kernel s compared wth several kernels n an IVM mplementaton. It has better performance n that ether the false postves are low under smlar detecton rate or the detecton rate s hgher wth smlar false postves. In Fgure 5, performance s compared on the only. Overall, the Matern kernel seems to yeld the best performance. Tests have also been done n extendng the IVM ncluson ponts from 660 to 1060 as well as enlargng the tranng data sets up to 8000 ponts. Results are all smlar as to that of 000 tranng ponts wth 660 ncluded. Ths mples that the IVM approach s not only effcent n terms of computaton tme but also can capture the nformaton well wth lmted ncluson ponts. Results n Fgure 6 are obtaned from the computer wth Intel 1.66GHz+980MHz CPU. It can be seen that the computatonal tme of GP ncreases drastcally wth the growth n the number of tranng data ponts. In case of 8000 tranng ponts, the standard GP s almost prohbtve. IVM tmes are consstent wth O( N d ). Fgure 4. Comparson of IVM Matern kernel and GP RBF kernel on test data. Detecton rate of IVM Matern vs GP RBF. False postves of IVM Matern vs GP RBF. Fgure 3. Comparson of IVM Matern kernel and IVM RBF kernel on test data. Detecton rate of IVM Matern vs IVM RBF. False postves of IVM Matern vs IVM RBF. Fgure 5. Detecton rate and false postves comparson of dfferent kernels on all test. Detecton rate and false postves comparson of dfferent kernels on test only. Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.
5 Conclusons We have descrbed the applcaton of IVM (whch s an effcent sparse approxmaton of GP classfcaton) to man-made structure classfcaton. Wth IVM GP classfcaton, performance s mantaned wth only a fracton of the tranng data. Moreover, snce ths affords expermental kernel tunng, the resultng structure can be more accurately traned. Future work wll nvolve the nvestgaton of actve data selecton (seekng parts of the to mprove the classfcaton n regons where the GP ndcates most uncertanty, and askng the user to verfy the classfcaton, for example) for sem-supervsed learnng and other facets that wll facltate on-lne learnng of buldng detecton n mage data. Doctor of Phlosophy Ednburgh: Unversty of Ednburgh, 003. [11] C. Pantofaru, R. Unnkrshnan, and M. Hebert, "Toward Generatng Labeled Maps from Color and Range Data for Robot Navgaton," n 003 IEEE/RSJ Internatonal Conference on Intellgent Robots and Systems (IROS), 003. [1] R. M. Neal, Bayesan Learnng for Neural Networks: Sprnger-Verlag New York, Inc., 1996. 6 References [1] L. Csato and M. Opper, "Sparse representaton for Gaussan process models," Advances n Neural Informaton Processng Systems, vol. 13, pp. 444-450, 001. [] E. Snelson and Z. Ghahraman, "Sparse Gaussan Processes usng Pseudo-nputs," n Neural Informaton Processng Systems, 005. [3] N. Lawrence, M. Seeger, and R. Herbrch, "Fast Sparse Gaussan Process Methods: The Informatve Vector Machne," Advances n Neural Informaton Processng Systems, 003. [4] X. Feng, C. K. I. Wllams, and S. N. Felderhof, "Combnng Belef Networks and Neural Networks for Scene Segmentaton," IEEE Transacton on Pattern Analyss and Machne Intellgence, vol. 4, pp. 467-483, 00. [5] S. Kumar and M. Hebert, "Man-Made Structure Detecton n Natural Images usng a Causal Multscale Random Feld," n CVPR003, 003, p. 119. [6] C. E. Rasmussen, "Gaussan Processes n Machne Learnng," Advanced Lectures on Machne Learnng, 003. [7] C. E. Rasmussen and C. K. I. Wllams, Gaussan Processes for Machne Learnng: The MIT Press, 006. [8] N. D. Lawrence and J. C. Platt, "Learnng to Learn wth the Informatve Vector Machne," n Internatonal Conference on Machne Learnng, 004. [9] N. D. Lawrence, J. C. Platt, and M. I. Jordan, "Extensons of the Informatve Vector Machne," n Determnstc and Statstcal Methods n Machne Learnng, 004. [10] M. Seeger, "Bayesan Gaussan Process Models: PAC-Bayesan Generalsaton Error Bounds and Sparse Approxmatons," n Insttute for Adaptve and Neural Computaton, Dvson of Informatcs. vol. 1000 pts 000pts 4000pts 8000pts IVM 8mn 14mn mn 35mn GP GP 40mn 30mn 560mn 0480mn* Fgure 6. Comparson of computatonal tme between IVM GP and GP. * Estmaton only. Tranng set Tranng data scale Testng set Data scale Detecton rate False postves Kumar s MSRF model 108 (3004 structured blocks + 3669 nonstructured blocks) 19 7.13% 1.46 Our IVM GP model wth Matern kernel 11 wth 407 structured blocks and 1768 non-structured blocks Images from Photo Stock as well as vared pctures collected by the authors n the campus 43 (33 + 10 33 random collected ) 10 collected All 43 71.69% 61.11% 70.65%/78.59% 33 10 collected All 43 1.4 1.54 1.49/.53 Table 1. Comparson of our IVM GP approach wth Kumar s MSRF model. Authorzed lcensed use lmted to: Unversty of Adelade Lbrary. Downloaded on February,010 at 18:30:48 EST from IEEE Xplore. Restrctons apply.