A Bilinear Model for Sparse Coding

A Blnear Model for Sparse Codng Davd B. Grmes and Rajesh P. N. Rao Department of Computer Scence and Engneerng Unversty of Washngton Seattle, WA 98195-2350, U.S.A. grmes,rao @cs.washngton.edu Abstract Recent algorthms for sparse codng and ndependent component analyss (ICA) have demonstrated how localzed features can be learned from natural mages. However, these approaches do not take mage transformatons nto account. As a result, they produce mage codes that are redundant because the same feature s learned at multple locatons. We descrbe an algorthm for sparse codng based on a blnear generatve model of mages. By explctly modelng the nteracton between mage features and ther transformatons, the blnear approach helps reduce redundancy n the mage code and provdes a bass for transformatonnvarant vson. We present results demonstratng blnear sparse codng of natural mages. We also explore an extenson of the model that can capture spatal relatonshps between the ndependent features of an object, thereby provdng a new framework for parts-based object recognton. 1 Introducton Algorthms for redundancy reducton and effcent codng have been the subject of consderable attenton n recent years [6, 3, 4, 7, 9, 5, 11]. Although the basc deas can be traced to the early work of Attneave [1] and Barlow [2], recent technques such as ndependent component analyss (ICA) and sparse codng have helped formalze these deas and have demonstrated the feasblty of effcent codng through redundancy reducton. These technques produce an effcent code by attemptng to mnmze the dependences between elements of the code by usng approprate constrants. One of the most successful applcatons of ICA and sparse codng has been n the area of mage codng. Olshausen and Feld showed that sparse codng of natural mages produces localzed, orented bass flters that resemble the receptve felds of smple cells n prmary vsual cortex [6, 7]. Bell and Sejnowsk obtaned smlar results usng ther algorthm for ICA [3]. However, these approaches do not take mage transformatons nto account. As a result, the same orented feature s often learned at dfferent locatons, yeldng a redundant code. Moreover, the presence of the same feature at multple locatons prevents more complex features from beng learned and leads to a combnatoral exploson when one attempts to scale the approach to large mage patches or herarchcal networks. In ths paper, we propose an approach to sparse codng that explctly models the nterac-

ton between mage features and ther transformatons. A blnear generatve model s used to learn both the ndependent features n an mage as well as ther transformatons. Our approach extends Tenenbaum and Freeman s work on blnear models for learnng content and style [12] by castng the problem wthn probablstc sparse codng framework. Thus, whereas pror work on blnear models used global decomposton methods such as SVD, the approach presented here emphaszes the extracton of local features by removng hgher-order redundances through sparseness constrants. We show that for natural mages, ths approach produces localzed, orented flters that can be translated by dfferent amounts to account for mage features at arbtrary locatons. Our results demonstrate how an mage can be factored nto a set of basc local features and ther transformatons, provdng a bass for transformaton-nvarant vson. We conclude by dscussng how the approach can be extended to allow parts-based object recognton, wheren an object s modeled as a collecton of local features (or parts ) and ther relatve transformatons. 2 Blnear Generatve Models We begn by consderng the standard lnear generatve model used n algorthms for ICA and sparse codng [3, 7, 9]: where s a -dmensonal nput vector (e.g. an mage), and s a -dmensonal bass vector s ts scalar coeffcent. Gven the lnear generatve model above, the goal of ICA s to learn the bass vectors such that the are as ndependent as possble, whle the goal n sparse codng s to make the dstrbuton of hghly kurtotc gven Equaton 1. (2) The coeffcents and jontly modulate a set of bass vectors to produce an nput vector. For the present study, the coeffcent can be regarded as encodng the presence of object feature n the mage whle the values determne the transformaton present n the mage. In the termnology of Tenenbaum and Freeman [12], descrbes the content of the mage whle encodes ts style. The lnear generatve model n Equaton 1 can be extended to the blnear case by usng two ndependent sets of coeffcents and (or equvalently, two vectors and ) [12]: Equaton 2 can also be expressed as a lnear equaton n for a fxed :! "# $ &%' )( Lkewse, for a fxed, one obtans a lnear equaton n. Indeed ths s the defnton of blnear: gven one fxed factor, the model s lnear wth respect to the other factor. The power of blnear models stems from the rch non-lnear nteractons that can be represented by varyng both and smultaneously. 3 Learnng Sparse Blnear Models 3.1 Learnng Blnear Models Our goal s to learn from mage data an approprate set of bass vectors that effectvely descrbe the nteractons between the feature vector and the transformaton vector. (1) (3)

( # A commonly used approach n unsupervsed learnng s to mnmze the sum of squared pxel-wse errors over all mages: (4) (5) where denotes the norm of a vector. A standard approach to mnmzng such a functon s to use gradent descent and alternate between mnmzaton wth respect to and mnmzaton wth respect to. Unfortunately, the optmzaton problem as stated s underconstraned. The functon has many local mnma and results from our smulatons ndcate that convergence s dffcult n many cases. There are many dfferent ways to represent an mage, makng t dffcult for the method to converge to a bass set that can generalze effectvely. A related approach s presented by Tenenbaum and Freeman [12]. Rather than usng gradent descent, ther method estmates the parameters drectly by computng the sngular value decomposton (SVD) of a matrx contanng nput data correspondng to each content class n every style. Ther approach can be regarded as an extenson of methods based on prncpal component analyss (PCA) appled to the blnear case. The SVD approach avods the dffcultes of convergence that plague the gradent descent method and s much faster n practce. Unfortunately, the learned features tend to be global and non-localzed smlar to those obtaned from PCA-based methods based on second-order statstcs. As a result, the method s unsutable for the problem of learnng local features of objects and ther transformatons. The underconstraned nature of the problem can be remeded by mposng constrants on and. In partcular, we could cast the problem wthn a probablstc framework and mpose specfc pror dstrbutons on and wth hgher probabltes for values that acheve certan desrable propertes. We focus here on the class of sparse pror dstrbutons for several reasons: (a) by forcng most of the coeffcents to be zero for any gven nput, sparse prors mnmze redundancy and encourage statstcal ndependence between the varous and between the varous [7], (b) there s growng evdence for sparse representatons n the bran the dstrbuton of neural responses n vsual cortcal areas s hghly kurtotc.e. the cell exhbts lttle actvty for most nputs but responds vgorously for a few nputs, causng a dstrbuton wth a hgh peak near zero and long tals, (c) prevous approaches based on sparseness constrants have obtaned encouragng results [7], and (d) enforcng sparseness on the encourages the parts and local features shared across objects to be learned whle mposng sparseness on the allows object transformatons to be explaned n terms of a small set of basc transformatons. 3.2 Blnear Sparse Codng We assume the followng prors for and : "! where and! are normalzaton constants, $ and % are parameters that control the degree of sparseness, and & s a sparseness functon. For ths study, we used ' & ()+* ',.! (6) (7)

& Wthn a probablstc framework, the squared error functon summed over all mages can be nterpreted as representng the negatve log lkelhood of the data gven the parameters: ()+*! (see, for example, [7]). The prors and can be used to margnalze ths lkelhood to obtan the new lkelhood functon:. The goal then s to fnd the that maxmze, or equvalently, mnmze the negatve log of. Under certan reasonable assumptons (dscussed n [7]), ths s equvalent to mnmzng the followng optmzaton functon over all nput mages:, $, % Gradent descent can be used to derve update rules for the components bass : $ & & (8) and of the feature vector and transformaton vector respectvely for any mage, assumng a fxed $, n batch mode accordng to:, (9) % & (10) Gven a tranng set of nputs, the values for and for each mage after convergence can be used to update the bass set $ (11) As suggested by Olshausen and Feld [7], n order to keep the bass vectors from growng wthout bound, we adapted the norm of each bass vector n such a way that the varances of the and were mantaned at a fxed desred level. 4 Results 4.1 Tranng Paradgm We tested the algorthms for blnear sparse codng on natural mage data. The natural mages we used are dstrbuted by Olshausen and Feld [7], along wth the code for ther algorthm. The tranng set of mages conssted of patches randomly extracted from ten source mages. The mages are pre-whtened to equalze large varances n frequency, and thus speed convergence. We choose to use a complete bass where and we let be at least as large as the number of transformatons (ncludng the notransformaton case). The sparseness parameters $ and % were set to and. In order to assst convergence all learnng occurs n batch mode, where the batch conssted of mage patches. The step sze for gradent descent usng Equaton 11 was set to. The transformatons were chosen to be 2D translatons n the range "!#$ pxels n both the axes. The style/content separaton was enforced by learnng a sngle vector to descrbe an mage patch regardless ts translaton, and lkewse a sngle vector to descrbe a partcular style gven any mage patch content. 4.2 Blnear Sparse Codng of Natural Images Fgure 1 shows the results of tranng on natural mage data. A comparson between the learned features for the lnear generatve model (Equaton 1) and the blnear model s

(a) Example of lnear bass Example of Blnear bass w w y( 3) w y( 2) w y( 1) w y(0) w y(1) w y(2) w y(3) = 1 = 2 (b) Estmated feature x vector = 3 9 91 x y j 3 w j after learnng 1 2 7 8 Canoncal patch Translated patch y y j = Estmated transformaton vectors 1 2 7 8 9 91 Fgure 1: Representng natural mages and ther transformatons wth a sparse blnear model. (a) A comparson of learned features between a standard lnear model and a blnear model, both traned wth the same sparseness prors. The two rows for the blnear case depct the translated object features w (see Equaton 3) for translatons of ( pxels. (b) The representaton of an example natural mage patch, and of the same patch translated to the left. Note that the bar plot representng the vector s ndeed sparse, havng only three sgnfcant coeffcents. The code for the style vectors for both the canoncal patch, and the translated one s lkewse sparse. The bass mages are shown for those dmensons whch have non-zero coeffcents for or. provded n Fgure 1 (a). Although both show smple, localzed, and orented features, the blnear method s able to model the same features under dfferent transformatons. In ths case, the range $ horzontal translatons were used n the tranng of the blnear model. Fgure 1 (b) provdes an example of how the blnear sparse codng model encodes a natural mage patch and the same patch after t has been translated. Note that both the and vectors are sparse. Fgure 2 shows how the model can account for a gven localzed feature at dfferent locatons by varyng the y vector. As shown n the last column of the fgure, the translated local feature s generated by lnearly combnng a sparse set of bass vectors. 4.3 Towards Parts-Based Object Recognton The blnear generatve model n Equaton 2 uses the same set of transformaton values for all the features. Such a model s approprate for global transformatons

Feature 1 (x 57 ) y( 1,+2) Selected transformatons w j y(0,3) Feature 2 (x 32 ) y( 2,0) w j j = 1 2 3 4 5 6 7 8 y(+1,0) j = 1... 8 Fgure 2: Translatng a learned feature to multple locatons. The two rows of eght mages represent the ndvdual bass vectors for two values of. The values for two selected transformatons for each are shown as bar plots. ' & denotes a translaton of ' pxels n the Cartesan plane. The last column shows the resultng bass vectors after translaton. that apply to an entre mage regon such as a shft of llumnaton change. pxels for an mage patch or a global Consder the problem of representng an object n terms of ts consttuent parts. In ths case, we would lke to be able to transform each part ndependently of other parts n order to account for the locaton, orentaton, and sze of each part n the object mage. The standard blnear model can be extended to address ths need as follows: Note that each object feature now has ts own set of transformaton values for all.. The double summaton s thus no longer symmetrc. Also note that the standard model (Equaton 2) s a specal case of Equaton 12 where We have conducted prelmnary experments to test the feasblty of Equaton 12 usng a set of object features learned for the standard blnear model. Fg. 3 shows the results. These results suggest that allowng ndependent transformatons for the dfferent features provdes a rch substrate for modelng mages and objects n terms of a set of local features (or parts) and ther ndvdual transformatons. 5 Summary and Concluson A fundamental problem n vson s to smultaneously recognze objects and ther transformatons [8, 10]. Blnear generatve models provde a tractable way of addressng ths problem by factorng an mage nto object features and transformatons usng a blnear equaton. Prevous approaches used unconstraned blnear models and produced global bass vectors for mage representaton [12]. In contrast, recent research on mage codng has stressed the mportance of localzed, ndependent features derved from metrcs that emphasze the hgher-order statstcs of nputs [6, 3, 7, 5]. Ths paper ntroduces a new probablstc framework for learnng blnear generatve models based on the dea of sparse codng. Our results demonstrate that blnear sparse codng of natural mages produces localzed orented bass vectors that can smultaneously represent features n an mage and ther transformaton. We showed how the learned generatve model can be used to translate a (12)

(a) w y 81 y(0,1) x 81 z x w y 57 57 81 x 57 y(0,1) (b) w 81 j y 81 j y 81 y(0,1) y 57 y( 2,0) x 81 x 57 z y( 2,0) y(0,1) y(1,0) y(1,1) y(1,0) y(1,1) Fgure 3: Modelng ndependently transformed features. (a) shows the standard blnear method of generatng a translated feature by combnng bass vectors usng the same set of values for two dfferent features ( and ). (b) shows four examples of mages generated by allowng dfferent values of for the two dfferent features. Note the sgnfcant dfferences between the resultng mages, whch cannot be obtaned usng the standard blnear model. bass vector to dfferent locatons, thereby reducng the need to learn the same bass vector at multple locatons as n tradtonal sparse codng methods. We also proposed an extenson of the blnear model that allows each feature to be transformed ndependently of other features. Our prelmnary results suggest that such an approach could provde a flexble platform for adaptve parts-based object recognton, wheren objects are descrbed by a set of ndependent, shared parts and ther transformatons. The mportance of parts-based methods has long been recognzed n object recognton n vew of ther ablty to handle a combnatorally large number of objects by combnng parts and ther transformatons. Few methods, f any, exst for learnng representatons of object parts and ther transformatons drectly from mages. Our ongong efforts are therefore focused on dervng effcent algorthms for parts-based object recognton based on the combnaton of blnear models and sparse codng.

Acknowledgments Ths research s supported by NSF grant no. 133592 and a Sloan Research Fellowshp to RPNR. References [1] F. Attneave. Some nformatonal aspects of vsual percepton. Psychologcal Revew, 61(3):183 193, 1954. [2] H. B. Barlow. Possble prncples underlyng the transformaton of sensory messages. In W. A. Rosenblth, edtor, Sensory Communcaton, pages 217 234. Cambrdge, MA: MIT Press, 1961. [3] A. J. Bell and T. J. Sejnowsk. The ndependent components of natural scenes are edge flters. Vson Research, 37(23):3327 3338, 1997. [4] G. E. Hnton and Z. Ghahraman. Generatve models for dscoverng sparse dstrbuted representatons. Phlosophcal Transactons Royal Socety B, 352(1177 1190), 1997. [5] M. S. Lewck and T. J. Sejnowsk. Learnng overcomplete representatons. Neural Computaton, 12(2):337 365, 2000. [6] B. A. Olshausen and D. J. Feld. Emergence of smple-cell receptve feld propertes by learnng a sparse code for natural mages. Nature, 381:607 609, 1996. [7] B. A. Olshausen and D. J. Feld. Sparse codng wth an overcomplete bass set: A strategy employed by V1? Vson Research, 37:33113325, 1997. [8] R. P. N. Rao and D. H. Ballard. Development of localzed orented receptve felds by learnng a translaton-nvarant code for natural mages. Network: Computaton n Neural Systems, 9(2):219 234, 1998. [9] R. P. N. Rao and D. H. Ballard. Predctve codng n the vsual cortex: A functonal nterpretaton of some extra-classcal receptve feld effects. Nature Neuroscence, 2(1):79 87, 1999. [10] R. P. N. Rao and D. L. Ruderman. Learnng Le groups for nvarant vsual percepton. In Advances n Neural Informaton Processng Systems 11, pages 810 816. Cambrdge, MA: MIT Press, 1999. [11] O. Schwartz and E. P. Smoncell. Natural sgnal statstcs and sensory gan control. Nature Neuroscence, 4(8):819 825, August 2001. [12] J. B. Tenenbaum and W. T. Freeman. Separatng style and content wth blnear models. Neural Computaton, 12(6):1247 1283, 2000.