IMAGE CLASSIFIER LEARNING FROM NOISY LABELS VIA GENERALIZED GRAPH SMOOTHNESS PRIORS

IMAGE CLASSIFIER LEARNING FROM NOISY LABELS VIA GENERALIZED GRAPH SMOOTHNESS PRIORS Yu Mao, Gene Cheung #, Cha-Wen Ln $, Yusheng J # The Graduate Unversty for Advanced Studes, # Natonal Insttute of Informatcs, $ Natonal Tsng Hua Unversty ABSTRACT When collectng samples va crowd-sourcng for sem-supervsed learnng, often labels that desgnate events of nterest are assgned unrelably, resultng n label nose. In ths paper, we propose a robust method for graph-based mage classfer learnng gven nosy labels, leveragng on recent advances n graph sgnal processng. In partcular, we formulate a graph-sgnal restoraton problem, where the objectve ncludes a fdelty term to mnmze the l 0-norm between the observed labels and a reconstructed graph-sgnal, and generalzed graph smoothness prors, where we assume that the reconstructed sgnal and ts gradent are both smooth wth respect to a graph. The optmzaton problem can be effcently solved va an teratve reweghted least square IRLS) algorthm. Smulaton results show that for two mage datasets wth varyng amounts of label nose, our proposed algorthm outperforms both regular SVM and a nosy-label learnng approach n the lterature notceably. Index Terms Graph-based classfers, label denosng, generalzed smoothness prors 1. INTRODUCTION The prevalence of socal meda stes lke Facebook and Instagram means that user-generated content UGC) lke selfes s growng rapdly. Classfcaton of ths vast content nto meanngful categores can greatly mprove understandng and detect prevalng trends. However, the sheer sze of UGC means that t s too costly to hre experts to assgn labels classfcaton nto dfferent events of nterest) to partal data for sem-supervsed classfer learnng. One approach to ths bg data problem s crowd-sourcng [1]: employ many non-experts onlne to assgn labels to a subset of data at a very low cost. However, non-experts can often be unrelable e.g., a non-expert s not competent n a label assgnment task but pretends to be, or he smply assgns label randomly to mnmze mental effort), leadng to label errors or nose. In ths paper, we propose a new method to robustly learn a graphbased mage classfer gven partal) nosy labels, leveragng on recent advances n graph sgnal processng GSP) []. In partcular, we formulate a graph-sgnal restoraton problem, where the graphsgnal s the desred labelng of all samples nto two events. The optmzaton objectve ncludes a fdelty term that measures the l 0- norm between the observed labels n the tranng samples and the reconstructed sgnal, and generalzed graph smoothness prors that assume the desred sgnal and ts gradent are smooth wth respect to a graph an extenson of total generalzed varaton TGV) [3] to the graph-sgnal doman. Because the noton of smoothness apples to the entre graph-sgnal, unlke SVM that consders only samples along boundares that dvde the feature space nto clusters of dfferent events, all avalable samples are consdered durng graph-sgnal reconstructon, leadng to a more robust classfer. The optmzaton s solved effcently va an teratve reweghted least square IRLS) algorthm [4]. Smulaton results for two mage datasets wth varyng amount of label nose show that our proposed algorthm outperforms both regular SVM and a nosy-label learnng approach n the lterature notceably. The outlne of the paper s as follows. We frst overvew related works n Secton. We then revew basc GSP concepts and defne graph smoothness notons n Secton 3. In Secton 4, we descrbe our graph constructon usng avalable samples and formulate our nosy label classfer learnng problem. We present our proposed IRLS algorthm n Secton 5. Fnally, we present expermental results and conclusons n Secton 6 and 7, respectvely.. RELATED WORK Learnng wth label nose has garnered much nterest, ncludng a workshop 1 n NIPS 10 [1] and a journal specal ssue n Neurocomputng [5]. There exsts a wde range of approaches, ncludng theoretcal e.g., label propagaton n [6]) and applcaton-specfc e.g. emoton detecton usng nference algorthm based on multplcatve update rule [7]). In ths paper, smlar to prevous works [8, 9, 10, 11] we choose to buld a graph-based classfer, where each acqured sample s represented as a node n a hgh-dmensonal feature space and connects to other sample nodes n ts neghborhood. Our approach s novel n that we show how generalzed graph smoothness noton extendng TGV [3] to the graph-sgnal doman can be used for robust graph-based classfer learnng wth label nose. Graph-sgnal prors have been used for mage restoraton problems such as denosng [1, 13, 14], nterpolaton [15, 16], bt-depth enhancement [17] and JPEG de-quantzaton [18]. The common assumpton s that the desred graph-sgnal s smooth or band-lmted wth respect to a properly chosen graph that reflects the structure of the sgnal. In contrast, we defne a generalzed noton of graph smoothness for sgnal restoraton specfcally for classfer learnng. 3. SMOOTHNESS OF GRAPH-SIGNALS 3.1. Prelmnares GSP s the study of sgnals on structured data kernels descrbed by graphs []. We focus on undrected graphs wth non-negatve edge weghts. A weghted undrected graph G = {V, E, W} conssts of a fnte set of vertces V wth cardnalty V = N, a set of edges E connectng vertces, and a weghted adjacency matrx W. W s a real N N symmetrc matrx, where w,j 0 s the weght assgned to the edge, j) connectng vertces and j, j. Gven G, the degree matrx D s a dagonal matrx whose - th dagonal element D, = Σ N j=1w,j. The combnatoral graph Laplacan L graph Laplacan for short) s then: L = D W 1) 1 https://people.cs.umass.edu/ wallach/workshops/nps010css/ 978-1-5090-0746-/16/$31.00 c 016 IEEE

Because L s a real symmetrc matrx, there exsts a set of egenvectors φ wth correspondng real egenvalues λ that decompose L,.e., ΦΛΦ T = λ φ φ T = L ) where Λ s a dagonal matrx wth egenvalues λ on ts dagonal, and Φ s an egenvector matrx wth correspondng egenvectors φ as ts columns. L s postve sem-defnte [],.e. x T Lx 0, x R N, whch mples that the egenvalues are non-negatve,.e. λ 0. The egenvalues can be nterpreted as frequences of the graph. Hence any sgnal x can be decomposed nto ts graph frequency components va Φ T x, where α = φ T x s the -th frequency coeffcent. Φ T s called the graph Fourer transform GFT). 3.. Generalzed Graph Smoothness We next defne the noton of smoothness for graph-sgnals. x T Lx captures the total varaton of sgnal x wth respect to graph G n l -norm: x T Lx = 1 w,j x x j) 3),j) E In words, x T Lx s small f connected vertces x and x j have smlar sgnal values for edge, j) E, or f the edge weght w,j s small. x T Lx can also be expressed n terms of graph frequences λ : ) x T Φ ) Φ T x x T Lx = Λ = λ α 4) Thus a small x T Lx also means that the energy of sgnal x s mostly concentrated n the low graph frequences. a) lne graph W = 0 1 0 1 0 1 0 1 0 D = 1 0 0 0 0 0 0 1 b) adjacency and degree matrces Fg. 1. Example of a lne graph wth three nodes and edge weghts 1, and the correspondng adjacency and degree matrces W and D. Lke TGV, we can also defne a hgher-order noton of smoothness. Specfcally, L s related to the second dervatve of contnuous functons [], and so Lx computes the second-order dfference on graph-sgnal x. As an llustratve example, a 3-node lne graph wth edge weght w,j = 1, shown n Fg. 1, has the followng graph Laplacan: L = 1 1 0 1 1 5) 0 1 1 Usng the second row L,: of L, we can compute the second-order dfference at node x : L,:x = x 1 + x x 3 6) On the other hand, the defnton of second dervatve of a functon fx) s: f x) = lm h 0 fx + h) fx) + fx h) h 7) We see that 6) and 7) are computng the same quantty n the lmt. https://en.wkpeda.org/wk/second dervatve Fg.. Example of a constructed graph G for bnary-event classfcaton wth two features h1) and h). A lnear SVM would dssect the space nto two for classfcaton. Hence f Lx s small, then the second-order dfference of x s small, or the frst-order dfference of x s smooth or changng slowly. In other words, the gradent of the sgnal s smooth wth respect to the graph. We express ths noton by statng that the square of the l -norm of Lx s small: Lx = x T L T Lx = x T L x = where 8) s true snce L s symmetrc by defnton. 4.1. Graph Constructon 4. PROBLEM FORMULATION λ α 8) In a sem-supervsed learnng scenaro, we assume that a set of tranng samples of sze N 1 wth possbly nosy bnary labels.e., twoevent classfcaton) are avalable, and we are tasked to classfy N addtonal test samples. Denote by x the length-n vector of ground truth bnary labels, where x { 1, 1} and N = N 1 + N. Smlarly, denote by y the length-n 1 vector of observed tranng labels, where y { 1, 1}. We frst construct a graph to represent all N samples. Each sample s represented as a vertex on the graph G and has an assocated set of M features h m), 1 m M. Gven avalable features, we can measure the smlarty between two samples and j and compute the edge weght w,j between vertces and j n the graph G as follows: w,j = exp M m=1 ) cm hm) hjm)) σh where σ h s a parameter and c m s a correlaton factor that evaluates the correlaton between feature h m) of N 1 samples and the samples labels y. In words, 9) states that two sample vertces and j has edge weght w,j close to 1 f ther assocated relevant features are smlar, and close to 0 f ther relevant features are dfferent. Ths method of graph constructon s very smlar to prevous works on graph-based classfers [8, 9, 10, 11]. For robustness, we connect all vertex pars and j n the vertex set V, resultng n a complete graph. Emprcal results show that a more connected graph s more robust to nose than a sparse graph. An example constructed graph s shown n Fg., where each sample has two features h 1) and h ). Two samples and j wth smlar relevant features wll have a small dstance n the feature space and an edge weght w,j close to 1. A lnear SVM would dvde the feature space nto two halves for a two-event classfcaton. Havng defned edge weghts w,j, the graph Laplacan L can be computed as descrbed n Secton 3.1. 9)

4.. Label Nose Model To model label nose, we adopt a unform nose model [1], where the probablty of observng y = x, 1 N 1, s 1 p, and p otherwse;.e., { 1 p f y = x P ry x ) = 10) p o.w. Hence the probablty of observng a nose-corrupted y gven ground truth x s: P ry x) = p k 1 p) N 1 k k = y Dx 0 11) where D s a N 1 N bnary matrx that selects the frst N 1 entres from length-n vector x. 11) serves as the lkelhood or fdelty term n our MAP formulaton. 4.3. Graph-Sgnal Pror For sgnal pror P rx), followng the dscusson n Secton 3. we assume that the desred sgnal x and ts gradent are smooth wth respect to a graph G wth graph Laplacan L. Mathematcally, we use the Gaussan kernel to defne P rx): P rx) = exp xt L x σ 0 ) exp ) xt L x σ1 1) where σ 0 and σ 1 are parameters. One can nterpret 1) as an extenson of TGV [3] to the graph-sgnal doman. We nterpret the two smoothness terms n the context of bnaryevent classfcaton. We know that the ground truth sgnal x s ndeed pecewse smooth; each true label x s bnary, and labels of the same event cluster together n the same feature space area. The sgnal smoothness term n 1) promotes pecewse smoothness n the reconstructed graph-sgnal ˆx, as shown n prevous graph-sgnal restoraton works [13, 14, 18], and hence s an approprate pror here. Recall that the purpose of TGV [3] s to avod over-smoothng a ramp lnear ncrease / decrease n pxel ntensty) n an mage, whch would happen f only a total varaton TV) pror s used. A ramp n the reconstructed sgnal ˆx n our classfcaton context would mean an assgnment of label other than 1 and 1, whch can reflect the confdence level n the estmated label; e.g., a computed label ˆx = 0.3 would mean the classfer has determned that event s more lkely to be 1 than 1, but the confdence level s not hgh. By usng the gradent smoothness pror, one can promote the approprate amount of ambguty n the classfcaton soluton nstead of forcng the classfer to make hard bnary decsons. As a result, the mean square error MSE) of our soluton wth respect to the ground truth labels s low. 4.4. Objectve Functon We can now combne the lkelhood and sgnal pror together to defne an optmzaton objectve. Instead of maxmzng the posteror probablty P rx y) P ry x)p rx), we mnmze the negatve log of P ry x)p rx) nstead: log P ry x)p rx) log P ry x) log P rx) 13) The negatve log of the lkelhood P ry x) n 11) can be rewrtten as: log P ry x) = k log1 p) logp)) N 1 log1 p) 14) }{{} γ Because the second term s a constant for fxed N 1 and p, we can gnore t durng mnmzaton. Together wth the negatve log of the pror P rx) n 1), we can wrte our objectve functon as follows: mn x y Dx 0 γ + σ 0 x T L x + σ 1 x T L x 15) 5. ALGORITHM DEVELOPMENT 5.1. Iteratve Reweghted Least Square Algorthm To solve 15), we employ the followng optmzaton strategy. We frst replace the l 0-norm n 15) wth a weghted l -norm: mn x y Dx) T Uy Dx)γ + σ 0 x T L x + σ 1 x T L x 16) where U s a N 1 N 1 dagonal matrx wth weghts u 1,..., u N1 on ts dagonal. In other words, the fdelty term s now a weghted sum of label dfferences: y Dx) T Uy Dx) = N 1 =1 uy x). The weghts u should be set so that the weghted l -norm mmcs the l 0-norm. To accomplsh ths, we employ the teratve reweghted least square IRLS) strategy [4], whch has been proven to have superlnear local convergence, and solve 16) teratvely, of teraton t + 1 s computed usng of the prevous teraton t,.e., where the weghts u t+1) soluton x t) u t+1) = 1 y x t) ) + ɛ 17) for a small ɛ > 0 to mantan numercal stablty. Usng ths weght update, we see that the weghted quadratc term y Dx) T Uy Dx) mmcs the orgnal l 0-norm y Dx 0 n the orgnal objectve 15) when the soluton x converges. 5.. Closed-Form Soluton per Iteraton For a gven weght matrx U, t s clear that the objectve 16) s a unconstraned quadratc programmng problem wth three quadratc terms. One can thus derve a closed-form soluton by takng the dervatve wth respect to x and equatng t to zero, resultng n: x = γd T UD + σ 0 L + σ 1 L T L) 1 γd T U T y 18) 5.3. Intalzaton It s clear that the IRLS strategy converges to a local mnmum n general, and thus t s mportant to start the algorthm wth a good ntal soluton x 0). To ntalze x 0) so that u 1) can be computed usng 17), we perform the followng ntalzaton procedure. 1. Intalze x by thresholdng the soluton of 18) wth observed y, usng the dentty matrx I as the weght matrx U: { 1, x > 0 x = 1, x 19) < 0.. Identfy the entry x that mnmzes the sgnal smoothness pror,.e., = arg mn σ 0 x T L x + σ 1 x T L x 0) where x s the label vector x wth flpped.e. convert 1 nto 1 or vce versa).

3. If x results n a smaller objectve functon 15), set x to x and goto step. Otherwse stop. The above ntalzaton procedure explots the fact that the ground truth sgnal x contans bnary labels, and thus each entry x devates from nose-corrupted y by at most 1. In subsequent teratons, computed x wll not be restrcted to be a bnary vector to reflect confdence level, as dscussed earler. 5.4. Interpretng Computed Soluton ˆx After the IRLS algorthm converges to a soluton ˆx, we nterpret the classfcaton results as follows. We perform thresholdng by a predefned value τ on ˆx to dvde t nto three parts, ncludng the rejecton opton for ambguous tems 1, x > τ x = Rejecton, τ < x < τ 1) 1, x < τ. Note that a multclass classfcaton problem can be reduced to multple bnary classfcaton problems va the one-vs.-rest rule or the one-vs.-one rule [19]. It can then be solved usng our proposed graph-based bnary classfer successvely. 6.1. Expermental Setup 6. EXPERIMENTATION a) female 1 b) female c) male 1 d) male Fg. 3. Examples of mages n gender classfcaton dataset. a) face 1 b) face c) non-face 1 Fg. 4. Examples of face and non-face mages. We tested our proposed algorthm aganst two schemes: ) a more robust verson of the famed Adaboost called RobustBoost 3 that clams robustness aganst label nose, and ) SVM wth a RBF kernel. The frst dataset s a gender classfcaton dataset conssts of 5300 mages of the frontal faces of celebrtes from FaceScrub dataset 4, where half of them are male and the other half are female. Example mages from the dataset are shown n Fg. 3. We normalze the face mages to 400 400 pxels and extracted space LBP features wth a cell sze of 5 5 pxels. To test the robustness of dfferent classfcaton schemes, we randomly selected a porton of mages from the tranng set and reversed ther labels. All the classfers were then traned usng the same set of features and labels. The test set was classfed by the classfers and the results are compared wth the ground truth labels. We also tested the same classfers usng the face detecton dataset, whch conssts of 400 face mages from ORL face database provded by ATT Cambrdge labs and 800 non-face mages. We used half of the dataset 00 face mages and 3 http://arxv.org/pdf/0905.138.pdf 4 http://vntage.wnklerbros.net/facescrub.html Table 1. Classfcaton error and rejecton rate n gender detecton for competng schemes under dfferent tranng label errors tranng set: 1000/1000) % label nose 5% 10% 15% 5% 1 = 0) 0.07/1.59% 0.31/1.86% 1.4/3.60% 5.39/7.4% 1 = 0.5) 0.00/1.86% 0.3/.98% 0.80/5.67%.47/8.51% 1 = 0.9) 0.00/.30% 0.1/3.54% 0.49/6.73% 0.67/11.6% RobustBoost 4.0% 6.06% 9.74% 4.57% SVM-RBF 4.30% 7.84% 17.3% 40.43% Table. Classfcaton error and rejecton rate n face / non-face dataset for competng schemes under dfferent tranng label errors tranng set: 300/300) % label nose 5% 10% 15% 5% 1 = 0) 0.00/0.48% 0.46/0.59% 0.86/0.87% 1.69/.63% 1 = 0.5) 0.00/0.8% 0.00/1.03% 0.00/1.85% 0.77/3.34% 1 = 0.9) 0.00/0.91% 0.00/1.3% 0.00/.41% 0.00/3.93% RobustBoost.3% 3.13% 4.04% 13.76% SVM-RBF 3.31% 5.39% 7.55% 30.68% 400 non-face mages) as the tranng set and the other half as the test set. See Fg. 4 for example mages. 6.. Expermental Results The resultng classfcaton error and rejecton rate for dfferent classfers are presented n Table 1, where the percentage of randomly erred tranng labels ranges from 5% to 5%. In the experment, we kept σ 0 constant and vared σ 1 to nduce dfferent rejecton rates. We observe that our graph-sgnal recovery scheme graph) acheved lower classfcaton error when compared to RobustBoost and SVN-RBF at all tranng label error rates. In partcular, at 5% label error rate, our proposal can acheve very low error rates of 5.39%,.47% and 0.67% at the cost of rejecton rates of 7.4%, 8.51% and 11.6% respectvely. In comparson, Robustboost and SVM suffer from severe classfcaton error rate of 4.57% and 40.43% respectvely, whch s much hgher than the sum of error and rejecton rate observed n our proposal. The results also show that by assgnng a larger σ 1, we can nduce a lower classfcaton error rate at the cost of a slghtly hgher rejecton rate. In dfferent applcatons, a user may defne the desred classfer performance as a weghted sum of classfcaton error and rejecton rate, as done n [0]. Usng our algorthm, a user can thus tune σ 1 to adjust the preference of classfcaton error versus rejecton rate. The results for face detecton dataset are shown n Table. We observe smlar trends where our proposed algorthm outperforms RobustBoost and SVM-RBF sgnfcantly n classfcaton error rate. 7. CONCLUSION Due to the sheer sze of user-generated content n socal meda, label nose s unavodable n the tranng data n a sem-supervsed learnng scenaro. In ths paper, we propose a new method for robust graph-based classfer learnng va a graph-sgnal restoraton formulaton, where the desred sgnal label assgnments) and ts gradent are assumed to be smooth wth respect to a properly constructed graph. We descrbe an teratve reweghted least square IRLS) algorthm to solve the problem effcently. Expermental results show that our proposed algorthm outperforms regular SVM and nosy label learnng schemes n the lterature notceably.

8. REFERENCES [1] A. Brew, D. Greene, and P. Cunnngham, The nteracton between supervsed learnng and crowdsourcng, n Computatonal Socal Scence and the Wsdom of Crowds Workshop at NIPS, Whstler, Canada, December 010. [] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst, The emergng feld of sgnal processng on graphs: Extendng hgh-dmensonal data analyss to networks and other rregular domans, n IEEE Sgnal Processng Magazne, May 013, vol. 30, no.3, pp. 83 98. [3] K. Bredes and M. Holler, A TGV-based framework for varatonal mage decompresson, zoomng and reconstructon. part : Analytcs, n SIAM Jour, 015, vol. 8, no.4, pp. 814 850. [4] I. Daubeches, R. Devore, M. Fornaser, and S. Gunturk, Iteratvely re-weghted least squares mnmzaton for sparse recovery, n Communcatons on Pure and Appled Mathematcs, January 010, vol. 63, no.1, pp. 1 38. [5] B. Frenay and A. Kaban, Edtoral: Specal ssue on advances n learnng wth label nose, n Elsever: Neurocomputng, July 015, vol. 160, pp. 1. [6] M. Sperosu, N. Sudan, S. Upadhyay, and J. Baldrdge, Twtter polarty classfcaton wth label propagaton over lexcal lnks and the follower graph, n Conference on Emprcal Methods n Natural Language Processng, Ednburgh, Scotland, July 011. [7] Y. Wang and A. Pal, Detectng emotons n socal meda: A constraned optmzaton approach, n Twenty-Fourh Internatonal Jont Conference on Artfcal Intellgence, Buenos Ares, Argentna, July 015. [8] A. Gullory and J. Blmes, Label selecton on graphs, n Twenty-Thrd Annual Conference on Neural Informaton Processng Systems, Vancouver, Canada, December 009. [9] L. Zhang, C. Cheng, J. Bu, D. Ca, X. He, and T. Huang, Actve learnng based on locally lnear reconstructon, n IEEE Transactons on Pattern Analyss and Machne Intellgence, October 014, vol. 33, no.10, pp. 06 038. [10] S. Chen, A. Sandryhala, J. Moura, and J. Kovacevc, Sgnal recovery on graphs: Varaton mnmzaton, n IEEE Transactons on Sgnal Processng, September 015, vol. 63, no.17, pp. 4609 464. [11] A. Gadde, A. Ans, and A. Ortega, Actve sem-supervsed learnng usng samplng theory for graph sgnals, n ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, New York, NY, August 014. [1] W. Hu, X. L, G. Cheung, and O. Au, Depth map denosng usng graph-based transform and group sparsty, n IEEE Internatonal Workshop on Multmeda Sgnal Processng, Pula, Italy, October 013. [13] J. Pang, G. Cheung, W. Hu, and O. C. Au, Redefnng selfsmlarty n natural mages for denosng usng graph sgnal gradent, n APSIPA ASC, Sem Reap, Camboda, December 014. [14] J. Pang, G. Cheung, A. Ortega, and O. C. Au, Optmal graph Laplacan regularzaton for natural mage denosng, n IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng, Brsbane, Australa, Aprl 015. [15] S. K. Narang, A. Gadde, E. Sanou, and A. Ortega, Localzed teratve methods for nterpolaton n graph structured data, n Symposum on Graph Sgnal Processng n IEEE Global Conference on Sgnal and Informaton Processng GlobalSIP), Austn, TX, December 013. [16] S. K. Narang, A. Gadde, and A. Ortega, Sgnal processng technques for nterpolaton of graph structured data, n IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng, Vancouver, Canada, May 013. [17] P. Wan, G. Cheung, D. Florenco, C. Zhang, and O. Au, Image bt-depth enhancement va maxmum-a-posteror estmaton of graph AC component, n IEEE Internatonal Conference on Image Processng, Pars, France, October 014. [18] X. Lu, G. Cheung, X. Wu, and D. Zhao, Inter-block soft decodng of JPEG mages wth sparsty and graph-sgnal smoothness prors, n IEEE Internatonal Conference on Image Processng, Quebec Cty, Canada, September 015. [19] Chrstopher M. Bshop, Pattern Recognton and Machne Learnng, Sprnger-Verlag New York, 007. [0] C. Chow, On optmum recognton error and reject tradeoff, n IEEE Transactons on Informaton Theory, January 1970, vol. 16, pp. 41 46.