The Rate Adapting Poisson Model for Information Retrieval and Object Recognition

Size: px

Start display at page:

Download "The Rate Adapting Poisson Model for Information Retrieval and Object Recognition"

Clara Barker
6 years ago
Views:

1 for Informaton Retreval and Object Recognton Peter V. Gehler Max Planck Insttute for Bologcal Cybernetcs, Spemannstrasse 38, Tübngen, Germany Alex D. Holub Department Of Electrcal Engneerng, Calforna Insttute of Technology, MC Pasadena, CA USA Max Wellng Bren School of Informaton and Computer Scence, Unversty of Calforna Irvne, CA USA Abstract Probablstc modellng of text data n the bagof-words representaton has been domnated by drected graphcal models such as plsi, LDA, NMF, and dscrete PCA. Recently, state of the art performance on vsual object recognton has also been reported usng varants of these models. We ntroduce an alternatve undrected graphcal model sutable for modellng count data. Ths Rate Adaptng Posson (RAP) model s shown to generate superor dmensonally reduced representatons for subsequent retreval or classfcaton. Models are traned usng contrastve dvergence whle nference of latent topcal representatons s effcently acheved through a smple matrx multplcaton. 1. Introducton and Context The domnant paradgm for modellng hstogram data s the extracton of latent semantc structure, often referred to as topcs. Text data for example can be represented as word counts for a gven dctonary, a representaton referred to as bag of words. For mage data there exsts an analogue, the so-called bag of features representaton, whch can be thought of as count data of vsual words. Latent varable models determne a mappng from such count data to a compressed latent representaton. Ths representaton can subsequently be used to mprove document retreval and classfcaton performance. The smplest models assgn each document to a sngle clus- Appearng n Proceedngs of the 23 rd Internatonal Conference on Machne Learnng, Pttsburgh, PA, Copyrght 2006 by the author(s)/owner(s). ter a pror. However, t has been recognzed that dstrbuted latent representatons are superor. For nstance, a smple sngular value decomposton of the count matrx, known as latent semantc ndexng (LSI), s qute successful n extractng semantc structure (Deerwester et al., 1990). A probablstc extenson of ths dea was ntroduced by Hofmann (1999) as probablstc latent semantc ndexng (PLSI). By realzng that PLSI s not a proper generatve model at the level of documents, a further extenson, latent Drchlet allocaton, was ntroduced by Ble et al. (2003). As ponted out by Buntne and Jakuln (2004), the basc archtecture of LDA s known under varous names such as add-mxtures, grade of membershps model, multple aspect model and multnomal PCA. These authors also extend LDA to a stll broader class of models known as dscrete PCA. These models can be characterzed along a number of dmensons. Frstly, they represent a subset of the of the class of drected graphcal models, or approxmatons thereof. Drected models share certan propertes, such as the phenomenon of explanng away (gven an observaton on a chld node, ts parents become dependent) and easy ancestral samplng. As shown n (Buntne, 2002; Grolam & Kaban, 2003) the growth of the number of parameters wth the number of tranng documents for PLSI can be understood as varatonal EM learnng of an LDA model, where for each tranng document the true posteror s replaced wth a pont estmate. Ths nsght also relates non-negatve matrx factorzaton (Lee & Seung, 1999) to the Gamma- Posson model (Buntne & Jakuln, 2004) n a smlar manner. More sophstcated approxmatons to the ntractable nference problem have also been studed n the lterature: a structured mean feld approxmaton (Ble & Jordan, 2004), expectaton propagaton (Mnka & Lafferty, 2002) and a collapsed Gbbs sampler (Grffths & Steyvers, 2002).

2 The Rate Adaptng Posson Model There s also another property to characterze these models. Models such as LDA, PLSI, Gamma-Posson models and NMF (n fact all dscrete PCA models and ther varatonal approxmatons) combne topcs n the probablty doman. For nstance, n LDA we generate a probablty vector θ wth θ = 1 from a Drchlet dstrbuton and lnearly combne these probabltes usng a stochastc matrx M. Each column of M represents a dscrete dstrbuton over words for topc j and a document s modelled as N doc samples from the lnear combnaton p = j M jθ j. However, we can also take lnear combnatons n the logprobablty doman. Exponental famly PCA (EPCA) represents an example of ths class of models. In fact we can thnk of EPCA exactly as a varatonal approxmaton (agan usng pont estmates) of a model wth condtonal dstrbutons n the exponental famly and a flat (constant) pror. Specal cases nclude PCA as the varatonal approxmaton of factor analyss (or probablstc PCA)(Rowes, 1997) and the sparse codng algorthm of Olshausen and Feld (1997) s a varatonal approxmaton of ICA. Exponental famly harmonums ntroduced by Wellng et al. (2004) can be understood as undrected probablstc models whch lnearly combne topcs n the logprobablty doman. The undrected semantcs of ths model has nterestng consequences. Most mportantly, the latent varables are condtonally ndependent gven the data, and vce versa. Ths s n stark contrast to the margnal ndependence of the latent varables n drected models. The mplcaton s that the mappng from nput space to latent space s gven by a sngle matrx multplcaton, possbly followed by a componentwse nonlnearty. For applcatons such as nformaton retreval and object recognton where speed f of the essence, ths s a very useful property. We note that harmonums also generate dstrbuted latent representatons. An nterestng explanaton for the mproved retreval and classfcaton results usng harmonums was gven n Xng et al. (2005). These authors observe that harmonums mx ther topcs usng a very dfferent mechansm than LDA. Ths has mportant consequences n partcular for low count values. If a word appears only once n a document, LDA assumes a pror that ths word s generated by a sngle topc, an assumpton not made by harmonums. In some sense, the smpler nference n harmonums s traded-off aganst more dffcult learnng due to the presence of an ntractable normalzaton constant whch depends on the parameters of the model. However, harmonums are desgned to take advantage of contrastve dvergence learnng (Hnton, 2002) whch has shown to be an effcent algorthm that scales well to large problems. Posson Bnomal Fgure 1. Markov random feld representaton of the RAP model. Top-layer nodes represent bnomal hdden varables h whle bottom-layer nodes represent Posson vsble varables x. 2. The Rate Adaptng Posson Model The rate adaptng Posson (RAP) model follows the general archtecture of an exponental famly harmonum (Wellng et al., 2004). The RAP model s dfferent from the undrected probablstc latent semantc ndexng (UP- LSI) model presented n Wellng et al. (2004) whch uses a multnomal condtonal dstrbuton over the observed varables. Ths results n a large array W j a of couplng parameters between topcs and observed varables wth a separate entry for every count-level a. Ths fact renders that model only practcal for observatons wth very few states (e.g. bnary). The RAP model s more economcal n ts use of parameters, couplng topcs to counts usng a condtonal Posson dstrbuton nvolvng a sngle matrx W j. Ths change has made the experments n secton 3 and 4 possble RAP: Generatve Model A harmonum can be specfed by wrtng down two consstent condtonal dstrbutons n the exponental famly. For RAP, we use condtonal Posson dstrbutons for the observed count data and condtonal bnomal dstrbutons for the latent topc varables, p(x h) = Pos x [log(λ ) + j W jh j ] (1) p(h x) = j Bn h j [σ(log( p j 1 p j ) + W jx ); M j ] (2) where σ(x) = 1/(1+e x ) s the sgmod functon, λ s the mean rate of the condtonal Posson dstrbuton for word, p j s the probablty of success and M j the total number of samples for the condtonal bnomal dstrbuton for topc j, x s the count vector, h a dscrete topc vector and W the nteracton between topcs and counts. From these equatons t can be seen that the value of the varables of the opposte layer shft the canoncal parameters of the varables n the layer under consderaton. It s due to ths behavor that we named the model rate adaptng. Note also that all varables are condtonally ndependent gven values for the varables n the opposte layer. These two condtonal dstrbutons are consstent wth the jont dstrbuton over {x, h} defned through p(x, h) = h W x

3 The Rate Adaptng Posson Model topc factors Posson x f(x) word factors Fgure 2. Factor graph representaton of the margnalzed RAP model. Square boxes ndcate word and topc factors. exp[f(x, h)]/z wth f(x,h) = + j p j log(λ )x log(x!) log( )h j log(h j!) log((m j h j )!) 1 p j x Fgure 3. Hnge nonlnearty f w T j x β j then the factor does not contrbute. On the other hand, f w T j x β j then the hnge functon s lnear, and hence those factors modulate the log Posson rate λ as follows, + j W j x h j (3) log(λ ) log(λ ) + j I( W j x > β j ) M j W j (7) where we have not wrtten any terms that do not explctly depend on a random varable. The two-layer undrected archtecture of ths model s shown n fgure 1. Samples from the model can be obtaned effcently by Gbbs samplng because all varables n a layer can be sampled n parallel gven the values for the varables of the opposte layer and vce versa. To fnd the most lkely varable assgnments one can locate modes of the dstrbuton by teratng the equatons, x mode h mode j = exp(log(λ ) + j = (M j + 1)σ(log( ) + 1 p j W j h j ) (4) p j W j x ) (5) The RAP model can also be represented as a factor graph (see fgure 2) by margnalzng out the latent varables, p(x) exp[ (log(λ )x log(x!))+ M j log(1 + exp( W j x β j ))] (6) j where we have abbrevated β j = log[p j /(1 p j )]. We can read out the word-factors from ths expresson as F (x ) = λ x /x! for each varable and the topc-factors F j (x) = exp(m j log(1 + exp( W jx β j ))). Note that the factors F j (x) are functons of all the varables x jontly. The nonlnearty for a topc-factor n the log doman s precsely gven by the hnge functon (see fgure 3). Hence, a topc factor does not contrbute to the probablty dstrbuton (.e. F j (x) = 1) f the nput count vector x s not well algned wth the topc vector w j = {W } j. A threshold β j determnes what t means to be well algned : where I( ) s the ndcator functon. Clearly, ths s an approxmaton because there s n fact a soft transton between the two regmes of the hnge functon 1. However, t clarfes the role of the weght matrx as a collecton of topc vectors that form a new low dmensonal bass for the latent representaton. Count vectors get mapped nto latent topc space by computng ts coordnates n ths bass as h = Wx. The thresholds β then decde on the necessary magntude of these coordnates before they wll have an mpact on the Posson rates. We note that there are n fact 2 K (wth K the total number of topcs) dfferent ways to modulate the log Posson rates because there are 2 K subsets of {1,..., K}. Emprcally we have found that the best performance n terms of retreval and classfcaton s obtaned when the angle between latent coordnates s used as a measure of smlarty: K(x n,x m ) = cos( h T n h m ). Ths s not surprsng as we expect that the length of a document roughly scales the count vector lnearly assumng ts topcal content does not change RAP: Invarant Transformatons The margnal dstrbuton p(x) n equaton (6) s gven as a product of factors where each factor follows the general form log(1 + e z ). It s not hard to check that the followng dentty holds, log(1 + e z ) = z + log(1 + e z ) whch has the consequence that we can change parameters wthout affectng the model. In other words, the parameters are not dentfable n the current parameterzaton. If we defne an arbtrary subset S of the ntegers {1,..., K}, then the followng transformatons, when executed jontly, do 1 Ths approxmaton s expected to be accurate when all parameter values {W, β} are large.

4 not change the RAP model, log(λ ) log(λ ) + j S M j W j (8) W j W j β j β j j S. (9) Fortunately, t s easy to fx the spurous degrees of freedom by choosng for nstance wj T x > 0 j or alternatvely, fxng the sgn of β j, j. Gong one step further, we can apply the transformaton only to half the hnge nonlnearty and obtan, log(1+e z ) = 1 2 z+ 1 2 log(1+cosh(z))+ 1 2 log(2) where the constant term s absorbed n the normalzaton and the lnear term s absorbed n the varable factors F. The new topc factors, F j = 1 + cosh(z), are now symmetrc around z = 0 mplyng that large nner products w j T x wth ether sgn (.e. algned or ant-algned) result n a postve contrbuton of that factor to the probablty of nput x. We wll call ths type of bass vectors {w j } prototypes n contrast to constrants whch contrbute postve probablty for an nput x when t s approxmately orthogonal to t: wj T x 0 (Wellng et al., 2002) RAP: Learnng Parameter learnng for the RAP model s performed by stochastc gradent ascent on the log-lkelhood of the data. For large redundant datasets t s more effcent to estmate the requred gradents on small batches rather than on the entre dataset. Ths s true n partcular n the ntal phase of learnng where there s consensus among the data on how to change the parameters. Towards convergence t s useful to ether ncrease the batch-sze or decrease the stepsze n order to reduce the varance of the stochastc optmzaton. We have also ncluded a momentum term to help speed up convergence. The dervatves of the log-lkelhood of the RAP model are easy to wrte down (but hard to calculate n practce due to the ntractable normalzaton constant), δ log λ x p x pt (10) δβ j M j [ σ(w T j x β j ) p σ(w T j x β j ) pt ] δw j M j [ x σ(w T j x β j ) p x σ(w T j x β j ) pt ] where p denotes the emprcal dstrbuton 2 and p T the model dstrbuton at the current values of the parameters. Note that our estmate of the gradents of β and W nvolves Rao-Blackwellsaton over the latent varables. Here we replace a sample average 1 N N n=1 f(h n) wth 1 N N n=1 f(h)p(h x n). Ths s guaranteed to reduce the varance of our estmates (Casella & Robert, 1996). 2 The average over the emprcal dstrbuton s smply gven by the sample average over the data-cases. It s n partcular the negatve terms n these equatons that are hard to estmate. One approach s to run the Gbbs sampler defned by equatons (1) and (2). Note however that at every teraton of learnng we have to run ths sampler to equlbrum. Instead, we wll follow the contrastve dvergence (CD) paradgm where for every data-case n the batch we ntalze a separate Gbbs sampler at that datacase and run t for only a few steps. Wth p 1 (.e. T = 1 n equaton (10)) we wll denote the Gbbs chan 3 that samples: h 0 n p(h data-case n ) x 1 n p(x h 0 n). CDlearnng smply bols down to obtanng samples from p 1 through the above one-step Gbbs chan and computng nosy estmates of the gradent through equaton 10. The averages n the frst terms are agan computed as sample estmates of data-cases n the batch whle the averages n the second terms are computed as sample averages over the samples from p 1. Although truncatng the Markov chan wll ntroduce a bas n the estmates of the gradents, the fnal bas of the parameter estmates has been shown to be small emprcally for a number of applcatons (Carrera-Perpnan & Hnton, 2005). Moreover, the varance of the gradent estmates and hence the varance of the fnal parameter estmates s greatly reduced (albet at the expense of ntroducng a bas). Below a summary of the CD-learnng algorthm as descrbed n the precedng text. We have also mplemented Algorthm 1 Contrastve Dvergence Learnng for RAP Repeat untl convergence: 1 For each data-case x n do: 1a Sample the hdden unts gven the data-case clamped to the vsble unts from h 0 n Q j p(h jn x n ) usng Eqn.(2). 1b Resample the data-case gven the sampled values of the hdden unts from x 1 n Q p(x n h 0 n) gven n Eqn.(1). 2 Compute the data averages and sample averages n Eqn.(10) wth T=1. 3 Perform gradent updates accordng to Eqn.(10) wth T=1. a mean feld learnng algorthm where Gbbs samplng updates are replaced by mean feld updates (Wellng & Hnton, 2001), but we found the results to be sgnfcantly nferor to the samplng based algorthm. 3. Experments: Document Retreval In ths and the next secton we descrbe how the latent structure of the RAP model can be used for two dfferent tasks, namely document retreval and object recognton 4. We compare ts performance aganst two other latent var- 3 Note that samplng from the equlbrum dstrbuton should be denoted as p,.e. T =. 4 Matlab code for tranng the RAP model and the prerpocessed text data can be obtaned from pgehler

5 able models: PLSI and LSI. Performance of LDA has never sgnfcantly surpassed PLSI (n fact we often found nferor results) whch s the reason we left them out. In document retreval the goal s to match a gven query, represented by a word count vector, wth a subset of a text corpus where the retreved subset should resemble the query as closely as possble. A latent varable model can be turned nto a document retreval algorthm through the followng three steps: 1) estmate the parameters of the model on a tranng corpus, 2) map all tranng and query documents nto the dmensonally reduced latent space, 3) compute smlartes between queres and tranng documents based on the latent representaton, 4) retreve the k most smlar tranng documents form the corpus for every query. We have used the cosne smlarty measure n our experments. One can also compute smlarty n (tfdf reweghted) word space drectly whch we use n our experments as a baselne Text Corpora In these experments we used three well known datasets: Reuters-21578, Ohsumed, and 20-Newsgroups 5. We use the BOW package and ts front-end RAINBOW to preprocess the data (McCallum, 1996). All documents were stemmed wth the Porter stemmer, a lst of stop-words and all words wth less than three characters were removed. Addtonally, for the Reuters and Ohsumed datasets all words were removed whch occur only once n the tranng data or n only a sngle tranng document. For the 20-Newsgroups dataset the words wth hghest average mutual nformaton wth the class varable were extracted. Ths preprocessng left the Newsgroup dataset wth words, documents and 20 classes, and the Reuters dataset wth words, documents and 91 classes (we also used another splt of the data wth 115 classes but found very smlar results). The Ohsumed dataset conssts of words, documents and 23 classes where each data-pont mght belong to more than one class. The corpus was splt nto a tranng set and a test set whose tems are used as query documents durng the performance evaluaton. For the Reuters dataset the predefned ModApte splt of the data nto tran and 4024 test documents was used. Ohsumed s splt n 33% test and 67% tranng data whle n the newsgroup corpus we held out 10% for testng purposes. 5 These corpora are avalable from The orgnal sources and specfcs concernng these sets can be found on ths ste and are omtted here for brevty. Precson RAP 100 dmensons PLSI 250 dmensons LSI 175 dmensons TF IDF Recall Fgure 4. RPC plot on a log-scale of the 20 Newsgroups dataset for varous models. As a baselne the retreval results wth tfdf reweghed word-counts are shown. Number of topcs for each model was chosen by optmzng 1-NN classfcaton performance on the test set correspondng to the average precson for retrevng a sngle document (left most marker) Results Learnng of the RAP model was done wth a small learnng rate and a momentum term n 200k teratons usng mnbatches of 100 tranng samples per teraton. The latent representaton of any document x s then computed by a matrx multplcaton W x. LSI s computed by performng a SVD decomposton on the tf-df 6 reweghed word counts. PLSI models are traned usng the tempered verson of the EM algorthm (Hofmann, 1999). 10% of the tranng data was held out for valdaton purposes and the temperature parameter β s ntalzed at 1 and whenever the loglkelhood on the valdaton data decreases β s decreased about.025 untl no more mprovement was observed. The latent representaton s defned by the posteror dstrbuton over the topcs z: P(z d). For a query document q, P(z q) was computed usng 25 teratons of the foldng-n heurstc (Hofmann, 1999). For comparson we also show the baselne results whch are obtaned by computng the smlarty of tf-df reweghed documents n word space. As performance measure we use the recall precson curve (RPC) where #(correctly retreved documents) Recall = #(relevant documents n the corpus) (11) #(correctly retreved documents) Precson =. #(retreved documents) (12) For a gven test document, all tranng documents were ranked n terms of ther cosne smlarty. Then recall and precson values were computed for 1, 2, 4, 8, reh 6 tf-df(d, w) = P n(d,w) #docs log n the corpus w n(d,w ) 2 #docs wth word w, where n(d, w) are the occurrences of word w n document d

6 Precson RAP 125 dmensons PLSI 225 dmensons LSI 125 dmensons TF IDF Recall Area under RPC RAP PLSI LSI Latent dmensonalty Fgure 5. Same as fgure 4 for Ohsumed dataset. Fgure 7. Area under the RPC as a functon of the latent dmensonalty on the Reuters dataset. 4. Experments: Object Recognton Precson RAP 200 dmensons PLSI 125 dmensons LSI 100 dmensons TF IDF Recall Fgure 6. Same as Fgure 4 for the Reuters dataset. treved documents. The RPC curves of all models are plotted n Fgures 4, 5 and 6, where the recall and precson values are averaged over the entre test-set. In fgure 7 we show the area under the RPC (AUC) as a functon of the number of topcs. The leftmost pont on an RPC,.e. the average precson for retrevng a sngle document, corresponds to the 1-NN classfcaton performance usng the cosne dstance. The latent dmensonalty of the models shown n the plots were selected to be the best accordng to ths measure, where we scanned the number of topcs from 25 to 250 at ncrements of 25. The RAP model yelds the best retreval performance on all datasets n terms of AUC, and scores only slghtly worse than LSI on Ohsumed n terms of 1-NN classfcaton performance. Accordng to fgure 7 the RAP model also seems to suffer less from overfttng as the number of topcs ncreases. Latent models have recently been appled to both object (Fergus et al., 2005) and scene (L & Perona, 2005) recognton. In ths secton we compare the performance of the RAP model n the vsual object recognton doman. We followed these steps n our vsual experments: (1) nterest pont detecton and extracton, (2) vocabulary generaton, (3) latent analyss, (4) kernel classfcaton on latent representatons. We brefly descrbe these steps below. Images were ntally normalzed to be the same sze. Interestng regons of mages (nterest ponts) were detected usng three dfferent feature detectors: mult-scale Harrs, mult-scale Hessan, and entropy-based (Kadr & Brady, 2001). Grey-scale patches were extracted from mages based on both the scale and locaton ndcated by the dfferent nterest pont detectors. All patches from all detectors were ntensty normalzed and reszed to and subsequently converted nto vectors. We performed K-means clusterng on the patches n order to dscretze feature space and create a vsual vocabulary of words. The number of clusters was left as a free parameter of the system and typcally vared from Each mage contans a set of nterest pont detectons. An nterest pont was assgned to the vsual cluster (word) closest n a Eucldean sense to that feature. The cumulatve counts over all clusters were used as feature vectors to represent each mage, such that each mage was represented by a vector of dmensonalty equal to the sze of the vsual vocabulary. Smlar to the document experments descrbed above, we are not utlzng any spatal nformaton between the extracted patches. We compared the performance of three dfferent latent algorthms descrbed above: LSI, PLSI and RAP. The latent representatons for each mage were used to tran SVM classfers usng the LIBSVM 7 package wth a lnear kernel. Ten- 7 Avalable at: cjln/lbsvm/.

Classfcaton Performance 0.91 0.9 0.89 0.88 0.87 0.86 0.85 Performance, Clusters=125 RAP PLSI LSI 0.84 20 40 60 80 Dmensonalty of Latent Space Fgure 8.

(Bottom Two Rows) Example mages of four random classes from the Caltech101. Two mages of each class shown to gve an ndcaton of the wthn class varance.

All experments averaged 35 tmes. Baselne (chance) performance s 25%.

Performance dfferences between RAP and PLSI/LSI were sgnfcant for all numbers of latent dmensons (p < 0.05).

Feature dmensons were normalzed to zero varance and unt standard devaton.

These datasets can be found at: www.vson.caltech.edu/htmlf

The Caltech4 contans a total of 4 object categores and s regarded to be relatvely easy to classfy due to stereotyped poses and drastc vsual dssmlarty

15 tranng mages and a maxmum of 50 testng mages were used for all experments. For the Caltech101, the class Faces-Easy was removed.

Fgures 9, 10 and 11 show comparsons between RAP and LSI/PLSI.

Instead we used the two-sded pared sgn test to determne whether the medan dfference n performance s sgnfcantly dfferent from zero at a level of α = 0.05.

Dscusson The experments provde clear evdence for the clam that harmonum models, and n partcular RAP, can be effcently and successfully traned on relatvely

Relatve to popular exstng methods such as LSI and PLSI the latent representatons generated by RAP are superor n two applcaton domans: document retreval

7 Classfcaton Performance Performance, Clusters=125 RAP PLSI LSI Dmensonalty of Latent Space Fgure 8. Example mages used for the object recognton experments. (Top Row) Example mages from the Caltech4. Classes: Arplanes, Motorcycles, Faces, Leopards. (Bottom Two Rows) Example mages of four random classes from the Caltech101. Two mages of each class shown to gve an ndcaton of the wthn class varance. Classes: Budha, Char, Watch, Bran. Note that the Caltech101 ncludes the Caltech4 classes. Fgure 9. Caltech4 performance comparson. All experments averaged 35 tmes. Baselne (chance) performance s 25%. Plotted s the test performance as a functon of the number of latent dmensons wth 125 clusters and usng a lnear kernel. Performance dfferences between RAP and PLSI/LSI were sgnfcant for all numbers of latent dmensons (p < 0.05). fold cross-valdaton was used to fnd the optmal values of the SVM hyper-parameters. We used a one-vs-one tranng paradgm for the mult-class datasets. Feature dmensons were normalzed to zero varance and unt standard devaton. We conducted experments on both the Caltech4 and the challengng Caltech101 datasets (fgure 8 llustrates representatve examples of some categores). These datasets can be found at: The Caltech4 contans a total of 4 object categores and s regarded to be relatvely easy to classfy due to stereotyped poses and drastc vsual dssmlarty between classes. The Caltech101 contans a total of 101 object categores and s more challengng due to the sheer number of object categores. 15 tranng mages and a maxmum of 50 testng mages were used for all experments. For the Caltech101, the class Faces-Easy was removed. Performance results reported correspond to the average classfcaton performance across all categores. Fgures 9, 10 and 11 show comparsons between RAP and LSI/PLSI. Error bars are not shown because the varaton from one splt of the data to another was larger than the varaton between models. Instead we used the two-sded pared sgn test to determne whether the medan dfference n performance s sgnfcantly dfferent from zero at a level of α = We conclude that almost always RAP sgnfcantly outperforms LSI and PLSI. 5. Dscusson The experments provde clear evdence for the clam that harmonum models, and n partcular RAP, can be effcently and successfully traned on relatvely large datasets. Relatve to popular exstng methods such as LSI and PLSI the latent representatons generated by RAP are superor n two applcaton domans: document retreval and object classfcaton. Moreover, mappng test-data nto latent space s orders of magntude faster for RAP (through a smple matrx multplcaton) than for PLSI (through the teratve foldng-n heurstc). A natural next step s to tran herarchcal models. Dependences between topcs are then modelled wth a new layer of meta-topcs. Intal experments n ths drecton have not shown mproved retreval or classfcaton performance. However, recent work by (Hnton et al., 2006) ndcates that deep herarches can be a promsng drecton for mprovement. The choce of a condtonal Posson dstrbuton may not be optmal due to the effect that words that have been used already become more lkely to be used than others,.e. ther frequency grows wth document length. Ths calls for dstrbutons wth longer tals such as the negatve-bnomal dstrbuton (Arold et al., 2005). The Posson dstrbuton n RAP can be easly nterchanged wth a negatve- Bnomal ncorporatng ths effect. A Bayesan approach for harmonum models seems an mportant topc for future nvestgaton. Acknowledgments Ths materal s based upon work supported by the Natonal Scence Foundaton under Grant No

8 Classfcaton Performance RAP PLSI LSI Performance, Latent= Vocabulary Sze Fgure 10. Same experment as n fgure 9 but plottng performance as a functon of the sze of the vocabulary usng 35 latent dmensons. Performance dfferences between RAP and PLSI/LSI were sgnfcant for vocabulary szes (p < 0.05). Classfcaton Performance Caltech101 Performance, Clusters=250 RAP PLSI LSI Dmensonalty of Latent Space Fgure 11. Caltech101 performance comparsons usng 250 clusters. All experments averaged 7 tmes. Baselne (chance) performance s 1% for ths task. Same plot as n fgure 9. Performance dfferences between RAP and PLSI were sgnfcant for 75 and 125 latent dmensons (p < 0.05). References Arold, E., Cohen, W., & Fenberg, S. (2005). Bayesan methods for frequent terms n text. Proc. of the CSNA & INTERFACE Annual Meetngs. Ble, D. M., & Jordan, M. I. (2004). Varatonal nference for drchlet process mxtures. Bayesan Analyss, 1, Ble, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Drchlet allocaton. Journal of Machne Learnng Research, 3, Buntne, W. (Ed.). (2002). Varatonal extensons to em and multnomal pca, vol of Lecture Notes n Computer Scence. Helsnk, Fnland: Sprnger. Buntne, W., & Jakuln, A. (2004). Applyng dscrete pca n data analyss. Proceedngs of the 20th conference on Uncertanty n artfcal ntellgence (pp ). Banff, Canada. Carrera-Perpnan, M., & Hnton, G. (2005). On contrastve dvergence learnng. Tenth Internatonal Workshop on Artfcal Intellgence and Statstcs. Barbados. Casella, G., & Robert, C. (1996). Rao-blackwellsaton of samplng schemes. Bometrka, 83(1), Deerwester, S., Dumas, S., Landauer, T., Furnas, G., & Harshman, R. (1990). Indexng by latent semantc analyss. Journal of the Amercan Socety of Informaton Scence, 41, Fergus, R., Fe-Fe, L., Perona, P., & Zsserman, A. (2005). Learnng object categores from google s mage search. Proceedngs of the Internatonal Conference on Computer Vson. Grolam, M., & Kaban, A. (2003). On an equvalence between PLSI and LDA. Proceedngs of SIGIR Grffths, T., & Steyvers, M. (2002). A probablstc approach to semantc representaton. Proceedngs of the 24th Annual Conference of the Cogntve Scence Socety. Hnton, G. (2002). Tranng products of experts by mnmzng contrastve dvergence. Neural Computaton, 14, Hnton, G., Osndero, S., & Teh, Y. (2006). A fast learnng algorthm for deep belef networks. Neural Computaton. to appear. Hofmann, T. (1999). Probablstc latent semantc analyss. Proc. of Uncertanty n Artfcal Intellgence, UAI 99. Stockholm. Kadr, T., & Brady, M. (2001). Salency, scale and mage descrpton. Int. J. Comput. Vson, 45, Lee, D., & Seung, H. (1999). Learnng the parts of objects by non-negatve matrx factorzaton. Nature, 401, L, F., & Perona, P. (2005). A bayesan herarchcal model for learnng natural scene categores. Proceedngs of the Conference on Computer Vson and Pattern Recognton. McCallum, A. (1996). Bow: A toolkt for statstcal language modelng, text retreval, classfcaton and clusterng. mccallum/bow. Mnka, T., & Lafferty, J. (2002). Expectaton-propogaton for the generatve aspect model. Proc. of the 18th Annual Conference on Uncertanty n Artfcal Intellgence (pp ). Olshausen, A., & Feld, D. (1997). Sparse codng wth overcomplete bass set: A strategy employed by v1? Vson Research, 37, Rowes, S. (1997). Em algorthms for pca and spca. Neural Informaton Processng Systems (pp ). Wellng, M., & Hnton, G. (2001). A new learnng algorthm for mean feld Boltzmann machnes. Proc. of the Int l Conf. on Artfcal Neural Networks. Madrd, Span. Wellng, M., Hnton, G., & Osndero, S. (2002). Learnng sparse topographc representatons wth products of student-t dstrbutons. Neural Informaton Processng Systems. Wellng, M., Rosen-Zv, M., & Hnton, G. (2004). Exponental famly harmonums wth an applcaton to nformaton retreval. Neural Informaton Processng Systems. Xng, E., Yan, R., & Hauptman, A. (2005). Mnng assocated text and mages wth dual-wng harmonums. Proc. of the Conf. on Uncertanty n Artfcal Intellgence.

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth