Selecting Shape Features Using Multi-class Relevance Vector Machine

Size: px

Start display at page:

Download "Selecting Shape Features Using Multi-class Relevance Vector Machine"

Blake Sparks
5 years ago
Views:

1 Selectng Shape Features Usng Mult-class Relevance Vector Machne Hao Zhang Jtendra Malk Electrcal Engneerng and Computer Scences Unversty of Calforna at Berkeley Techncal Report No. UCB/EECS October, 5

2 Copyrght 5, by the author(s). All rghts reserved. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson.

3 Selectng Shape Features Usng Mult-class Relevance Vector Machne Hao Zhang Jtendra Malk Computer Scence Dvson, EECS Dept UC Berkeley, CA 947 Abstract The task of vsual obect recognton benefts from feature selecton as t reduces the amount of computaton n recognzng a new nstance of an obect, and the selected features gve nsghts nto the classfcaton process. We focus on a class of current feature selecton methods known as embedded methods: due to the nature of mult-way classfcaton n obect recognton, we derve an extenson of the Relevance Vector Machne technque to mult-class. In experments, we apply Relevance Vector Machne on the problem of dgt classfcaton and study ts effects. Expermental results show that our classfer enhances accuracy, yelds good nterpretaton for the selected subset of features and costs only a constant factor of the baselne classfer. 1. Introducton Fgure 1: Collecton of dgts Prototype dgts wth hghlghted dscrmnatve part. When lookng at a slate of dgts (fg. 1), after compensatng for the varaton n wrtng style, one can see that there are dstnctve parts of the shape whch best tell apart each dgt aganst other classes (fg. 1). Beng able to dentfy these dscrmnatve parts s a remarkable ablty for classfyng the dgts quckly and accurately. The am of ths work s to fnd those parts automatcally usng feature selecton technques. Even though we llustrate t wth dgts, ths problem s representatve and s mportant to general obect recognton because: 1. Importance of shape cue Studes n bologcal vson have suggested that the shape cues account for the maorty of the necessary nformaton for vsual obect recognton (among other cues such as texture and color) [1]. Therefore, t s worthwhle to study how to dfferentate among shape cues and to derve methods to pck out the most useful shape cues from the data. Handwrtten dgts have natural varatons n shape, thus provdes a good settng for studyng selecton of shape cues.. Fne dscrmnaton In obect recognton, a common challenge s to classfy vsually smlar obects. Ths demands a vson system to emphasze the more dscrmnatve parts of the nstances rather than comparng them on the whole. Dgts 1

4 provde a good source of ths problem because there are borderlne nstances (e.g. among 8s and 9s) that need detaled dscrmnaton. Wth currently avalable technques, we choose to tackle the problem by phrasng t n terms of feature selecton. Gven the popularty of usng feature selecton technques on genomc data, t s mportant to note that our am s dfferent than thers: the ground truth n genomcs s that only a few features (genes) are relevant for the property under study, but n shape classfcaton, n fact, all features are relevant and useful for classfcaton. In genomcs, feature selecton has the potental to yeld a near-perfect classfer. But n shape classfcaton, we do not expect a classfer bult on feature selecton to perform better than those that use all features. Our am s to lmt ourselves wth the number of features and see what are the best ones to use n a classfer. Followng the survey of [], feature selecton approaches fall nto three man categores: flter, wrapper and embedded methods. We descrbe those categores wth examples from computer vson lterature where possble: 1. Flter methods select the features ndependent of the classfer, and are a preprocessng step to prune features. Example n computer vson: In [3], scale-nvarant mage features are extracted and ranked by a lkelhood or mutual nformaton crteron. The most relevant features are then used n a standard classfer. However, we fnd the separaton of the selecton stage and the classfcaton stage unsatsfactory.. Wrapper methods treat the classfer as a black box and provde t wth varyng subsets of features so as to optmze for the best feature set. In practce, wrapper methods often search exhaustvely on all possble subsets whch s exponentally slow or use greedy search whch only explores a porton of all subsets. 3. Embedded methods, as an mprovement on wrapper methods, ncorporate feature selecton n the tranng of the classfer. Example n computer vson: In [4], a par-wse affnty matrx s computed on the tranng set and an optmzaton on the Laplacan spectrum of the affnty matrx s carred out. However, The constructon of affnty matrx and ts subsequent learnng would be neffcent for moderate sze tranng set (e.g. a few hundred examples). In ths paper, we focus on a new class of embedded methods for feature selecton ( [5] and [6]). Compared to other methods, they are dstnctve n the sense that they ntegrate feature selecton nto the tranng process n an effcent manner: 1. Roughly speakng they optmze the tranng error as a sum of loss on data plus a regularzaton penalty on the weghts that promotes sparseness. They optmze a clear error crteron that s tuned for the task of classfcaton (n contrast, flter methods s not ted wth the classfer).. Wthout the regularzaton term (or wth a slght change of the term), they reduce to well-known classfers such as SVM and logstc regresson. Therefore, from a practcal pont of vew they can be thought of as mproved (or alternatve) verson of the baselne classfer. 3. Algorthmcally, they cost only a constant factor of the tme of the baselne classfer. ([5] costs about the same as a usual SVM, [6] costs about tmes of a logstc regresson, snce t usually converges n teratons). The theoretcal part of ths work s to extend [6] to mult-class settng because n shape dscrmnaton the task almost always nvolves more than two classes. On the emprcal sde, we study the our classfer n both two-class(n comparson to [5]) and mult-class problems. In lght of the vson for learnng motvaton, we beleve that shape cues provde an excellent data set to gan nsghts nto a partcular feature selecton technque. The shape features selected by a technque can be vsualzed and evaluated aganst ntuton, thus provdng nsghts nto the classfcaton process. Ths s an advantage that other data sets (e.g. genomcs) do not have. The paper s organzed as follows: Secton dscusses the shape features extracted from the mage. Secton 3 ntroduces two embedded methods: (1) 1-norm Support Vector Machne (1-norm SVM) and () Relevance Vector Machne (RVM). There we develop a mult-class extenson to RVM. Secton 4 studes the expermental results on dgts and we conclude n secton 5. Shape Features The most straghtforward experment on feature selecton takes the raw mage as the feature vector. Ths leaves the learnng machnery wth the ob of modellng shape varaton whch s often hard. Moreover, each pxel locaton s treated as a feature whch may actually correspond to dfferent parts on the shape n dfferent mages. In order to obtan features n correspondence, we choose to use the shape context descrptor obtaned from algnng the mages of the dgts, followng that

5 Fgure : Anchor shapes for four dgt classes Vsualzaton of the process of fndng correspondences of [7]. We choose ths set of features over other mage-based features (e.g. ntensty) to focus on the shape aspect of the mage. From each class, we select an anchor dgt that all other dgts of that class are matched to. Ths dgt s pcked as the medan n terms of shape context dstance (so that the maxmum of ts dstance to all dgts of that class s mnmum). Fg shows the anchor shapes. When each dgt s matched to the anchor, the algorthm establshes correspondence between the ponts on the anchor and the ponts on the dgt (the process s llustrated n fg ). Ths enables us to defne a shape feature as the pont on each dgt that corresponds to a partcular pont on the anchor shape. At each shape feature locaton, we extract two numbers: the shape context dstance to the correspondng shape context on the anchor dgt, and the sum of squared dfferences to the correspondng mage patch on the anchor dgt. We use dstances nstead of orgnal values to cut down dmensonalty, so that there are enough tranng examples for the feature selecton stage. We have shape feature locatons on each shape and therefore we obtan numbers from each dgt. When studyng the mportance of a sngle feature, we add the weght on ts shape context dstance and ts mage patch dstance. 3 Feature Selecton: Embedded Methods We study embedded methods that optmze the followng obectve functon: 1 n n f loss (w T x ) + λf penalty (w) (1) =1 where x s the th data pont, w s the weght vector, and λ s a relatve weghtng of the two terms. The frst term s the tranng error: an average of loss values at the data ponts. The second term s a regularzng penalty on the weghts. By choces of the loss functon (f loss ) and the penalty functon (f penalty ), ths general model reduces to well-known cases such as lnear regresson (f loss beng square loss and f penalty beng ), logstc regresson (f loss beng logstc loss and f penalty beng ) or SVM (f loss beng hnge loss and f penalty beng w T w). It s also well-known that certan types of f penalty promote sparseness n the weghts. An example s the Lasso regresson [8] whch s smlar to lnear regresson but puts an L 1 penalty on the weghts. It pushes many components of the weght vector w to zero, effectvely selectng features. To understand the reason why those types of f penalty promote sparseness, we llustrate n the specal case of lnear regresson n two dmensons (where w = (w 1, w ) and f loss = (w 1 x 1 + w x ) ). Frst, note that the optmzaton problem (1) s equvalent as: 1 n f loss (w T x ) s.t. f penalty (w) = C () n =1 where C s a constant determned by λ. Then, the optmzaton problem can be thought of as fndng the pont on the penalty contour (f penalty (w) = C) such that the t touches the smallest tranng error contour ( n =1 f loss(w T x ) = C ). Note that n the lnear regresson case, the tranng error contour s always an ellpse. We consder three dfferent types of f penalty : 1.f penalty = w T w (L norm): also known as rdge regresson, the penalty contour s a crcle. The pont on the penalty contour that mnmzes the tranng error s where the two contours are tangental. Usually, at that pont none of the weghts are zero. (fg. 3). f penalty = w (L 1 norm): specal case of lasso regresson, the penalty contour s a damond. Because the vertces of the damond tend to stck out, they are usually the spot on the penalty contour where the smallest tranng error contour s attaned. (fg. 3) At those vertces, one of the w 1, w s pushed to zero. 3

6 w w 1 w w 1 w w 1 3. f penalty = ( k w k ɛ ) 1/ɛ (L ɛ norm): the penalty contour s concave and the vertces of the contour stcks out much more than the L 1 case, therefore t s more lkely that one of the vertces mnmzes the tranng error. (fg. 3(c)) Ths phenomenon s general for other types of f loss as well: by choosng those types of f penalty, we obtan a classfer whch has feature selecton capablty. However, not all of the choce has a tractable soluton. We study two partcular varants that admts effcent computaton: 1-norm SVM and mult-class Relevance Vector Machne. (c) Fgure 3: The penalty contour (n sold lne) and the tranng error ellpses (n dotted lne) L norm L 1 norm (c) L ɛ norm norm SVM Here the loss functon f loss s the SVM hnge loss and the f penalty s L 1 -norm. In the optmal soluton, the number of nonzero weghts depends on the value of the relatve weghtng λ n Eq. 1: a larger λ drves more weghts to zero. [5] proposes a soluton to Eq. 1 n whch each weght, as a functon of λ, s computed n an ncremental way as more weghts are actvated. Each weght w follows a pecewse lnear path, and the computatonal complexty of all the paths s ust slghtly more than the baselne SVM classfer. In our experments, we vsualze the weghts as they are beng ncrementally actvated by the 1-norm SVM. 3. Mult-class Relevance Vector Machne In the mult-class settng, the SVM loss does not extend naturally. For ths reason, we turn to an alternatve classfcaton technque: multnomal logstc regresson, whch gves rse to the Relevance Vector Machne technque ntroduced n [6]. In ths case, the f loss s the multnomal logstc loss, and the f penalty s the negatve log of the student-t dstrbuton on w, and the weghtng λ s 1. The penalty contours, whch are plotted n fg. 4, has the desrable property of beng hghly concave. (In practce, we use a lmtng pror whose penalty contour s even more concave, see fg. 4.) Also, n ths case, the optmzaton problem Eq. 1 s equvalent(by takng the negatve of the log of probablty) to the problem of a maxmum a posteror (MAP) estmate gven the student-t pror and the multnomal logt lkelhood. Therefore t can be casted as a estmaton problem on the followng herarchcal model, as shown n fg. 4(c): a hyper-parameter α s ntroduced for each weght (gven α, w s dstrbuted as a zero-mean, varance 1/α Gaussan), and the α tself has a Gamma pror. (When the α s ntegrated out, the margnal dstrbuton of w s a student-t densty, as n the orgnal setup.) The α parameter for each w s ntutvely called the relevance of that feature, n the sense that the bgger the α, the more lkely the feature weght w s drven to zero. Ths addtonal layer of hyper-parameter yelds an optmzaton process n two stages, n a fashon smlar to Expectaton- Maxmzaton: (1) optmzng over w wth fxed α, () optmze over α by usng the optmal w from (1). Ths s the man teraton n the RVM technque. In our dervatons below, we call them the nner loop and outer loop. The detals of the dervaton and a dscusson s n the appendx. At the end of the teraton, we obtan a set of converged α s and w s. Typcally, many of the w s are zero. However, even for those w s that are not zero, ther assocated α s vary much, whch suggests the method of rankng the features by the α s and threshold at successve levels to select dfferent sze subsets of features. (Note that the rankng s not ndependent for each feature because the α s are obtaned by consderng all features.) Ths way of studyng the effect of feature selecton makes t comparable to the successve larger subsets of features n the 1-norm SVM case. The orgnal RVM s derved and expermented on two-class problems. Whle mentonng an extenson to mult-class, the orgnal formulaton essentally treats the mult-class problems as a seres of n one-vs-rest bnary classfcaton problems. Ths would translate nto tranng n bnary classfers ndependently. 1 Instead, to fully explot the potental of ths technque, 1 In [6] eqn. 8, the multclass lkelhood s defned as P (t w) = Q Q n k σ(y k(x n; w k )) t nk, where x and w are nput varables and weghts, 4

7 contour plot of (1+xx )*(1+yy )=C contour plot of xx*yy=c we derve a mult-class RVM based on the frst prncples of the multnomal logstc regresson and the pror on w, n the herarchcal model. (n appendx) α w φ (c) y n Fgure 4: The equal penalty contour for student-t pror on w when a = b >. The equal penalty contour for Jeffrey s pror (densty 1 x ) on w when a = b =. (c) Graphcal model for the RVM. The response y s a multnomal logt functon on the nput data φ wth the weghts w s. Each weght w s assocated a hyper-parameter α and p(w α) = N(w, α 1 ). Each α s dstrbuted as Gamma(a, b). 4 Expermental results 4.1 Setup We experment on dgts from the MNIST database [9]. To extract features, each query dgt s algned aganst a collecton of prototype dgts, one for each class, resultng n a total number of C shape context features. By restrctng our attenton to only one prototype per class, we use an overly smple classfer so that we can study the feature selecton problem n solaton. Here, we do not address the ssue of feature selecton across more prototypes per class but beleve that our fndngs are ndcatve of the general case. In our dervaton of RVM, the features are class-specfc (.e. the φ (p) n secton A.1), whch agrees wth the process of gettng features by matchng aganst each class. In the 1-norm SVM settng we smply concatenate the features from all classes. 4. Two-class We frst study the effects n the two-class problem: 8s vs 9s. They are a easly confused par and ndeed yelds worse error rate than a classfer traned on most other pars of dgts. For ths problem, we run both RVM and 1-norm SVM on the shape context features. An overvew plot of the error rate s n fg. 5, computed by 5-fold cross valdaton. The eventual error rate s not n the range of the state of the art(.6% as n [7]), but s well-performng for a sngle unt n a prototype-based method, that wll perform much better wth addtonal unts of other prototypes []. The error rate also drops notceably wth a few actvated features, suggestng that a small porton of the features account for the overall classfcaton. We look more closely as to what features are actvated n ths range, shown n fg. 6. As a baselne, we also nclude results from the smple flter method of rankng features by ts mutual nformaton wth the class label, shown n fg. 6. A few thngs can be notced: 1. Snce the mutual nformaton crteron pcks features ndependently, t selects features from the lower half of the dgt 9 repeatedly. Those feature each has a hgh predctve power but as a collecton, they are hghly correlated wth each other. In contrast, RVM or 1-norm SVM quckly exhausts the features there and moves onto other parts of the shape.. The two embedded methods, RVM and 1-norm SVM, agree farly well on the features they selected, and agree well on ther weghts. One dfference s that RVM also tend to pck out another spot of dscrmnaton, the upper rght corner of the dgt 9 (whch s useful for tellng aganst 8s that are not closed n that regon). In comparson, 1-norm SVM took longer to start assgnng sgnfcant weghts to that regon. respectvely; t nk s the ndcator varable for observaton n to be n class k, y k s the predctor for class k, and σ(y) s the logt functon 1/(1 + e y ). The product Q Q of bnary logt functons treats the class ndcator varable y k ndependently for each class. In contrast, a true multclass lkelhood s P (t w) = n k σ(y k; y 1, y,..., y K ) t nk where the predctors for each class y k s coupled n the multnomal logt functon (or the softmax): σ(y k ; y 1,..., y K ) = e y k/(e y e y K ). 5

8 As an nterestng sde experment, we also try to compare the shape context feature aganst the raw ntensty values from the orgnal MNIST mage, usng our feature selecton methods. We frst run the 1-norm SVM on the two separate feature sets. Fg. 7 shows that shape context outperforms ntensty when a few features are actvated, whle ther eventual performance are comparable. Ths suggests that shape contexts are the feature of choce when allowed only a lmted number of features, whch s confrmed n fg. 7: a run of 1-norm SVM on the combned pool of features selects mostly the shape contexts frst. Ths effect s also present n the RVM, though the contrast between the two types of feature s less error rate of RVM and 1normSVM on 8s vs 9s RVM 1normSVM.5 error rate number of features Fgure 5: Error rate on two-class problem as more features are added 4.3 Four-class We pck classes 3,6,8,9 for a smlar reason as n the two-class case: They have vsually smlar shapes and some parts of ther shapes we can dentfy as beng dscrmnatve n ths four-class classfcaton task. We are nterested n seeng whether ths noton s reflected n the results from the our method. Snce the problem s mult-class, 1-norm SVM s not applcable and we run only RVM, wth mutual nformaton as a baselne. Fg. 8 plots the error rate as a functon of the number of features. Smlar to the two-class problem, peak performance s reached wth a smlar percentage of features. The peak error rate s only slghtly worse than that of the two-class case (6% from 4%), valdatng the mult-class extenson. In the well-performng range of number of features, we vsualze the magntude of the weghts on each feature(fg. 8). Smlar to the two-class case, we see that the features tend to spread out n RVM as t avods selectng correlated features. It s nterestng to reflect on the features selected n fg. 8 to notce that those are the features good for tellng the dgt class apart from the rest of the classes: the upper left porton of 3, the tp of the stem of 6, the lower crcle of 8(whch has a dfferent topology than the other classes), and the lower porton of 9 can be examned alone to decde whch class t s. If we then look at the magntude of the weghts on those features, for example, the bgger weghts on the lower porton of the 9 suggests that the compared to other cues, the multnomal logstc classfer can beneft the most from comparng the query dgt to the prototype 9 and examne the lower porton of the matchng score. We thnk t s mportant and nterestng to nterpret the learnng mechansm n ths way. 5 Concluson In ths work, we have demonstrated how doman knowledge from shape analyss can be used to extract a good ntal set of features sutable for selecton algorthms. We extended one of the embedded feature selecton methods to handle mult-class RVM 1norm SVM (c) Fgure 6: Feature weghts as more features are actvated: mutual nformaton RVM (c)1-norm SVM 6

9 error rate vs nubmer of features on dgts ntensty sc 1 1 number features selected from each category sc ntensty overall.4.35 error rate number of features num features 8 teraton steps Fgure 7: Comparng shape context and ntensty features by 1-norm SVM: error rate as a functon of number of actvated features number of features from each category.7 error rate of RVM on error rate number of features Fgure 8: Error rate of RVM on four-class problem as more features are added Fgure 9: Feature weghts as more features are actvated: mutual nformaton RVM problems, studed the performance of two varants of the embedded method, and showed that the selected features have an ntutve nterpretaton for the classfcaton task. Acknowledgments We thank Matthas Seeger and J Zhu for frutful dscussons, and L Wang for provdng the 1-norm SVM code. References [1] Stephen E. Palmer. Vson Scence Photons to Phenomenology. The MIT Press, [] Isabelle Guyon and André Elsseeff. An ntroducton to varable and feature selecton. J. Mach. Learn. Res., 3: , 3. [3] Gy. Dorkó and C. Schmd. Selecton of scale nvarant neghborhoods for obect class recognton. In Proceedngs of the 9th Internatonal Conference on Computer Vson,, pages 634 6, 3. 7

10 [4] Lor Wolf and Amnon Shashua. Feature selecton for unsupervsed and supervsed nference: the emergence of sparsty n a weghtedbased approach. In Proceedngs of the 9th Internatonal Conference on Computer Vson, pages , 3. [5] J Zhu, Saharon Rosset, Trevor Haste, and Rob Tbshran. 1-norm support vector machnes. In Sebastan Thrun, Lawrence Saul, and Bernhard Schölkopf, edtors, Advances n Neural Informaton Processng Systems 16. MIT Press, Cambrdge, MA, 4. [6] Mchael E. Tppng. Sparse bayesan learnng and the relevance vector machne. J. Mach. Learn. Res., 1:11 44, 1. [7] S. Belonge, J. Malk, and J. Puzcha. Shape matchng and obect recognton usng shape contexts. IEEE Trans. Pattern Anal. Mach. Intell., 4(4):9 5,. [8] Robert Tbshran. Regresson shrnkage and selecton va the lasso. Journal of the Royal Statstcal Socety B, 58:67 88, [9] Y. LeCun, L. Bottou, Y. Bengo, and P. Haffner. Gradent-based learnng appled to document recognton. Proceedngs of the IEEE, 86(11):78 34, November [] Hao Zhang and Jtendra Malk. Learnng a dscrmnatve classfer usng shape context dstances. In Proc. IEEE Conf. Comput. Vson and Pattern Recognton, pages 4 47, 3. [11] Davd J. C. MacKay. Bayesan nterpolaton. Neural Comput., 4(3): , 199. A Appendx A.1 Inner loop: L regularzed logstc regresson Ths s where α s are fxed. Suppose the number of nput data to be n, the number of classes to be C. The feature vector components are class-specfc,.e., an nput s mapped nto a seres of feature vectors, one correspondng to each class p: φ (p). Followng multnomal logstc modellng: u (p) = w (p), φ (p). The output s a softmax over the u s: µ (p) = Suppose each w (p) s regularzed by hyper parameter α (p),.e., w (p) prncple, we mnmze the negatve log of the posteror: e u(1) N(, 1/α (p) e u(p) +...+e u(c) log p(w y). = log p(y w) log p(w α) := Ψ(w). = X,p y (p) log µ (p) + X,p 1 (α(p) (w (p) ) log α (p). ). Under the maxmum a posteror (MAP) where =. denotes equalty modulo constants(the frst =. s modulo log p(y α), a term whch wll become mportant n the outer loop but here s a constant snce α s fxed), and y (p) s the bnary ndcator (.e. y (p) = 1 ff data s of class p). For a more ntutve dervaton, we adopt a more succnct notaton: let φ (p) be the desgn matrx for class p, namely, φ (p) = (φ (p) 1...φ(p) n ) T. And let φ = dag(φ (p) ) p=1..c. Let w be the concatenaton of all the weghts w (p) (n -maor order). Let w k denotes ts k th component. Smlarly, let α and α k be that of the α (p). Let K be the total number of features from all classes. Let A = dag(α k ) k=1..k. Let B p,q = dag(µ (p) (δp q µ (q) )) =1..n and B be a block matrx consstng of B p,q. Then the dervatves of Ψ(w) can be wrtten n matrx form as: H(w) := Ψ w = φt (µ y) + Aw Ψ w w = φt Bφ + A These frst and second dervatves are used n the teratve reweghed least squares(irls) procedure, as n the ordnary logstc regresson, untl convergence. A. Outer loop: MAP estmate of α We want to mnmze the negatve log posteror for α: f (α) = log p(y α) log p(α). To obtan p(y α), we know by condtonal p(y, w α) probablty defnton that p(y α) =. Take negatve log of both sdes, recall the defnton of Ψ(w) and t gves: log p(y α) = p( w y,α) Ψ( w) + log p( w y, α). As ustfed n [6], assume saddle pont approxmaton for p(w y, α),.e., p(w y, α) N(w w, H( w) 1 ). Then, log p( w y, α) log N( w w, H( w) 1 ) = 1 log det H( w) K log(π). Therefore the overall negatve log posteror, droppng the constant of K log(π), s f(α) =. 1 log det H( w) + Ψ( w) log p(α) ) 8

11 The α s tend to grow very large durng estmaton hence we wll optmze w.r.t. to log(α). (Ths reparametrzaton affects p(α) slghtly va a change of varable.) d log det X The dervatve of f(α) has three parts: The frst term s computed based on the matrx calculus result that = X T and the d X chan rule that d f(x) = tr ` f X. d α X α 1 log det H( w) = 1 «tr (H( w)) 1 H( w) = 1 ˆ(H( w)) 1 := 1 kk Σ k Here we have assumed that B s constant w.r.t. α. An exact dervatve wthout assumng a constant B can be obtaned whch s more complcated. However, durng our experments, t produces neglgble dfference n the converged answer. The second term Ψ( w) depends on α through two ways: n a drect way through the terms nvolvng the pror on w and ndrectly through the optmal w whch depends on the value of α. However, we explot the fact that w s optmal so that the second part has dervatve zero: Ψ( w) Ψ( w) = = Ψ( w) fxed w + w Ψ( w) fxed w + = 1 ( w k 1 α k ) The thrd term s smply the negatve log of the Gamma pror: ( log p(α)) w = b a α k fxed α w Based on the dervatve, we set them to zero to obtan a set of fxed-pont teraton equatons. Ths leads to the re-estmaton rule for the α s, smlar n form to [11]: defne the degree of well-determnedness parameter γ k to be γ k = 1 α k Σ k, then the re-estmate update s: α k = γ k + a w k + b A.3 Dscusson of RVM The choce of the values for a and b: When a = b >, the equvalent pror on w s a student-t dstrbuton whch approxmates a Gaussan near the orgn. Ths s undesrable as t puts an L norm penalty on the weghts when the weghts become small. To avod ths, we set the parameters a = b =, whch puts an mproper, densty 1 pror that s ndependent of the scale of the weghts and always has concave x equal-penalty contours. The algorthm s fast: In mplementaton, the nner loop returns along wth w and b the nverse of the Hessan at the optmum (as s needed n logstc regresson anyway). The smple updates on α vrtually don t cost any tme. In experments, we see that the reestmate converges quckly. Those α s that correspond to suppressed features tend to grow very large after a few teratons. As n [6], we prune those features whch also speeds up later teratons. 9

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components