Universität Augsburg. Institut für Informatik. PLSA on Large Scale Image Databases. Rainer Lienhart and Malcolm Slaney.

Size: px

Start display at page:

Download "Universität Augsburg. Institut für Informatik. PLSA on Large Scale Image Databases. Rainer Lienhart and Malcolm Slaney."

Kellie Payne
5 years ago
Views:

1 Unverstät Augsburg Ã ÊÇÅÍÆ ËÀÇ¼ PLSA on Large Scale Image Databases Raner Lenhart and Malcolm Slaney Report Dezember 2006 Insttut für Informat D Augsburg

2 Copyrght c Raner Lenhart and Malcolm Slaney Insttut für Informat Unverstät Augsburg D Augsburg, Germany all rghts reserved

3 PLSA O LARGE SCALE IMAGE DATABASES Raner Lenhart Multmeda Computng Lab Unversty of Augsburg Augsburg, Germany ABSTRACT The web and mage repostores such as Fcr are the largest mage databases n the world. There are bllons of mages on the web, and hundreds of mllon hgh-qualty mages n mage repostores. Currently, these mages are ndexed based on manually-entered tags and ndvdual and group usage patterns. In ths wor we explore a thrd nformaton dmenson: mage features. We explore probablstc latent semantc analyss (plsa) n order to nfer whch vsual patterns descrbe each obect. We buld models that connect words and mage features, and use content features and tags to fnd smlar mages. We demonstrate that mage features usng gray-scale salent ponts and an aspect model based on plsa outperforms a conventonal word-frequency model as well as refned color-hstrogram approach on an mage-smlarty tas. Index Terms large scale mage retreval, probablstc semantc analyss, color coherence vectors. 1. ITRODUCTIO The usage of Probablstc Latent Semantc Analyss (plsa) [4] a statstcal technque to derve hdden concepts from data has recently become very popular n the mage doman. So far, plsa has only been appled to relatvely small, carefully selected mage databases rangng from a few hundred to a few thousand mages [2][8]. In ths paper we study plsa on a large-scale, real-world mage database for mprovng mage retreval based on mage smlarty as perceved by humans. Our wor centers around fndng vsual words that are typcal for the varous nds of aspects an mage can show. One of the largest mage repostores on the web s Flcr. For ths wor we have download 253,460 mages that were tagged wth at least one out of the 23 tags lsted n Table 1 1. These words where grouped nto 12 categores for our mage-retreval tas. The resultng mage database was not 1 These mages were selected from all publc Flcr mages uploaded pror to 8 Sep and labeled wth one of the followng tags: sanfrancsco, beach, toyo and geotagged. Malcolm Slaney Yahoo! Research Santa Clara, CA USA cleaned nor pre-processed n any way to ncrease consstency. Snce these tags are provded by the creators of the pctures wth unnown ntensons, the technques we nvestgate must be able to tolerate from a pure vsual smlarty standpont a sgnfcant fracton of ncorrect labels. A good example, for nstance, are the mages tagged Chrstmas n Flcr. Only a very small fracton of the mages depct a relgous event (as one mght expect). Instead the tag mostly denotes the tme and date of creaton. Thus thousands of vacaton and party photos pop up wth no real common theme. The ambguty of tags maes mage retreval more dffcult. On ths real-world database we explore two questons: (1) Does t matters how vsual words are created? We compare three dfferent technques: (a) random selecton, (b) clusterng random subsets, and (c) clusterng tagbased subsets. (2) Does plsa outperform a smple word-occurrence statstc? How does plsa on grayscale SIFT [5] features compare to well-nown global color-retreval technques such as color-coherence vectors [9]? Category # OR lst of tags # of mage 1 wldlfe anmal anmals cat cats dog dogs brd brds flower flowers grafft sgn sgns surf surfng nght food buldng buldngs goldengate goldengatebrdge baseball Total # of Images (ote mages may have multple tags) 253,460 Table 1: The mage database and ts 12 categores

4 We evaluate these dfferent retreval confguratons purely based on mage smlarty as perceved by a number of users wthout any specal context nowledge. 2. DERIVIG VISUAL WORDS plsa was orgnally derved n the context of document retreval, where words are the elementary parts of a document. For mages our vsual documents we need comparable elementary parts we call vsual words. In ths wor we use the popular SIFT features [5] to fnd salent vsual parts n each mage. SIFT features are calculated n a two-step process: Frst, a sparse set of salent areas n an mage are determned and descrbed by poston, scale, and orentaton. Then for each salent pont we derve a 128-dmensonal edge-based feature vector to descrbe the unque grayscale content of that salent area n a scale- and orentaton-nvarant manner. Snce SIFT feature vectors can tae on almost every value n 128!, we wsh to fnd a small set of representatve feature vectors to become our vsual words. Thus the problem of dervng vsual words s as follows: Gven a set of mages I={d } wth I = I = # of mages a set of feature F={f l } wth F = F = # of features (here 128-dm. SIFT features) derved from I mages a set C = {c r } of mage categores (see Table 1) wth C = C categores n total (here C =12) derve a vocabulary V = {v } of V = V vsual words. Fndng the structure n such a large set of data (mllons of mages, thousands of salent ponts per mage) s computatonally expensve. We nvestgate three ways to determne the V vsual words and we wll evaluate ther utlty later n ths paper: (v1) Random: Select all V sample features randomly from the set F of all features. (v2) K-means clusterng (wth subselecton): Randomly select S sample features from the set F of all features. Apply K-means clusterng to each set of S samples to derve (V /C ) vsual words. Perform ths subselecton C tmes. In total ths wll result n C * (V /C ) = V vsual words. (v3) Tag subselecton: For each of the C categores derve (V /C ) vsual words by means of K-means clusterng by randomly samplng S sample features from mages n each category only. In total ths wll result n C * (V /C ) = V vsual words. Method (v2) s the approach commonly used n mage retreval [2][8]. Snce K-means clusterng s computatonally expensve (quadratc n the number of samples and the number of clusters), t s more effcent to brea up (C * S ) samples nto C subsets of S samples and fnd V /C clusters from ths subset nstead of determnng all V clusters on the entre set of (C * S ) samples drectly. For our 12 categores (C = 12) the speedup s C * C = 144 tmes. The ratonal behnd method (v3) s to explore whether the tags n the database provde useful nformaton for dervng vsual words. Wthn each category the mages should be less dverse and thus mae t easer for K-means clusterng to fnd the domnant vsual words. The better the vsual words, the better plsa should wor and thus mprove retreval. Concepts that have no representatve vsual words cannot be learned. The random method (v1) s added to answer the queston whether K-means clusterng s necessary at all. The answer to ths queston has a few mportant mplcatons: Frstly, clusterng s often the slowest part of the learnng algorthm. If t can be spped wthout harm, t would greatly reduce the computatonal complexty. Secondly, f clusterng s not necessary, the set of vsual vocabulary can easly be extended any tme needed by addtonal random samples. In each experment we derved 12 * 200 = 2400 vsual words that are used to descrbe each mage n our database. In Secton 4 we wll compare these three methods based on ther smlarty retreval results n user studes. 3. MEASURIG IMAGE SIMILARITY 3.1 Term-Document Matrx Usng the vsual vocabulary V, each feature f l of F can be quantzed by ts most smlar feature vector n V. Thus we represent each mage d as an mage document consstng of L nstances of the vsual words {w 1,,w L }, w p V. Gven the collecton of I mage documents I={d } wth F vsual words W = {w } from the vocabulary V and gven that we gnore the sequental orderng of the word occurrences n the mages (the so-called bag-of-word model), the mage data can be summarzed by an I! V matrx of vsual word occurrence counts = (n(d,v )), where n(d,v ) specfes the number of tmes the word v occurred n document d. The resultng table s called the termdocument matrx (see Fgure 1). ote by normalzng each document vector to 1 usng the L1-norm, the document vector of d becomes the estmated mass probablty dstrbuton v d ). The smlarty between two documents can be calculated usng the cosne metrc between two document vectors a=d

5 and b=d p. The cosne metrc between to vector a and b s defned as < a, b > CSMetrc( a, b) = a! b It s commonly used n text retreval [1]. 3.2 plsa Fgure 1: Term-document matrx Each L1-normalzed row n the term-document matrx descrbes the dstrbuton of the vsual words n each document,.e., v d ). The dea of plsa s to ntroduce a medator nown as aspects or concepts between the document and the words. Thus, every word occurrng n a document s generated by an unobservable aspect varable z leadng to the followng generatve model for the document vector [4]: (1) Pc a document d wth pror probablty d ) (2) Select a latent concept z wth probablty z d ) (3) Generate a word v wth probablty v z ) An mportant aspect of ths model s that word occurrences are condtonally ndependent from the document gven the unobservable aspects. Thus K! = 1 v d ) = d ) v z ) z d ). In addton, every document s modeled as consstng of one or more aspects. Ths s very natural snce mages consst of multple obects and thus multple aspects n dfferent mage areas. plsa can model ths fact very effcently. For nstance, an mage wth a lon and eep each obect descrbed by a set of SIFT features mght be descrbed by two hdden aspects lon and eep. Dependent on the aspects the probabltes of each vsual word v s dfferent. We learn the unobservable probablty dstrbutons z d ) and v z ) from the data usng the Expectaton- Maxmzaton-Algorthm (EM-Algorthm) [3][4]: E-Step: M-Step: z d, v v z ) z d ) = ) = K! l = 1! = 1!! = M v z ) z d ) m= 1 = 1 M! (, = n d 1 l v z ) z d ) n( d, v ) z d, v ) l n( d, v ) z d, v ) v ) z n( d ) d, v ) Gven a new test mage d test, we estmate the aspect probabltes, smlar to above, from the observed words. The only dfference s that the learned condtonal word dstrbutons v z ) are never updated. The smlarty between two documents s calculated usng the cosne metrc between two the two aspect vectors a= (z d )) and b= (z d p )). We model the collecton of vsual words wth 48 aspects n total analogous to a 48- mxture Gaussan mxture model. 3.3 Color Coherence Vectors As a baselne for comparson we use one of the best tradtonal global color features: Color Coherence Vectors (CCVs) [9]. It s computed by frst quantzng each pxel s color by usng the 2 most sgnfcant bts per color channel, resultng n only 64 possble dfferent color values. Then, for each pxel we measure the area connected (wth an 8- neghborhood) of the same quantzed color. If the area s above a threshold (usually 1% of the pxel count n the mage), then the pxel s added to the coherent hstogram, otherwse to the ncoherent color hstogram. Combnng both hstograms results n a 128-dmensonal vector. Dssmlarty between two CCV vectors a and b s computed based on the L1-norm. 4. EXPERIMETAL RESULTS Performance Metrc: For evaluaton we selected randomly 5 query mages from each of our 12 categores,.e., 60 query mages n total. Then, for each query mage each retreval technque was used to return the top 20 most smlar mages. In each of the three experments below, three rval technques were compared based on the udgments of a number of users: For each query, the retreval results (top 20 mages, tled 5 by 4 on a sheet of paper) for the three technques under comparson were shown to the user. The user had to put the results from each query mage nto an order from best to worst retreval result. The technque wth the best retreval result receved 2 ponts, the second best 1 pont, and the worst 0 ponts. We computed the average score for each technque over the 60 samples queres to assgn a sngle performance number. The technque wth the hghest score obvously performs best.

6 (a) (b) (c) Fgure 2: Results from the three experments: a) cosne smlarty on word-hstogram feature, b) cosne smlarty on plsa, and c) comparng best cosne methods wth CCV baselne algorthm. Exp. 1: In ths experment we compared the three vsual word extracton technques (v1), (v2), and (v3) aganst each other by usng the document vectors from the termdocument matrx wth the cosne metrc for smlarty retreval. Fgure 2a shows the average scores for 8 dfferent subects. Exp. 2: In ths experment we compared the three vsual word extracton technques (v1), (v2), and (v3) aganst each other by usng the aspect vectors of the mage documents wth the cosne metrc for smlarty retreval. Fgure 2b shows the average scores for 8 dfferent subects. In both of these experments dervng vsual words usng plan clusterng produced the best results. Selectng vsual words completely at random s computatonally cheap, and should wor well asymptotcally, but not evdently at ths level. We are surprsed that dervng specfc vsual words based on category subsets dd not produce an overall beneft, but an nformal analyss suggests that these category-specfc words helped for a category le dogs. Exp. 3: In our fnal experment we compared the randomsubset vsual word selecton approaches that won from Exp.1 (cosne metrc of word hstograms) and Exp. 2 (cosne metrc of plsa hstograms) to a baselne usng CCV features. Ths test s dffcult for subects because n such a large database the matches n a color space are at frst glance dentcal to the query. It s only when the pcture s studed does one realzes that the obects shown are so dfferent. Ths s especally true when we loo color smlarty wth our full 2.5M mage database. The results of ths test are shown n Fgure 2c. Seven subects udged that mages found by usng a cosne metrc n plsa space are more smlar to the query mage than a drect comparson n word space, or the baselne CCV approach. Much le t does n text-based retreval, calculatng smlarty n subspace formed by the aspect model gves better results. 5. COCLUSIO In ths paper we have shown that the aspect model, usng an approach le plsa, s as mportant for mage-retreval as t s for text-retreval [1]. The aspect model learns the probablty of each vsual word gven an unobserved aspect. We have extended Bosch s wor [2] by showng that plsa mproves performance on a smlarty tas. The dmensonalty reducton due to an aspect model s mportant as we go to larger databases. In future wor we want to verfy our results wth a larger number of subects, and we want to test the smlarty on the full 2.5M mage database. References [1] Rcardo Baeza-Yates, Berther Rbero-eto. Modern Informaton Retreval. Addson-Wesley, [2] A. Bosch, A. Zsserman and X. Munoz. Scene Classfcaton va plsa. Proceedngs of the European Conference on Computer Vson (2006). [3] Dempster, A. P.,. M. Lard, and D. B. Rubn. Maxmum Lelhood from Incomplete Data va the EM Algorthm. Journal of the Loyal Statstcal Socety,B.39, [4] Thomas Hoffmann. Unsupervsed Learnng by Probablstc Latent Semantc Analyss. Machne Learnng, Vol. 42, Issue 1 2, pp , [5] D. Lowe. Dstnctve mage features from scale nvarant eyponts. In IJCV 60(2):91 110, [6] K. Molaczy, T. Tuytelaars, C. Schmd, A. Zsserman, J. Matas, F. Schaffaltzy, T. Kadr and L. Van Gool. A comparson of affne regon detectors. In IJCV 65(1/2):43 72, [7] K. Molaczy, C. Schmd. A performance evaluaton of local descrptors. In PAMI 27(10): [8] P. Quelhas, F. Monay, J.-M Odobez, D. Gatca-Perez, T. Tuytelaars, L. Van Gool. Modelng scenes wth local descrptors and latent aspects. ICCV 2005, Vol. 1, pp , Oct [9] G. Pass, R. Zabh, and J. Mller. Comparng Images Usng Color Coherence Vectors. In Proc. of the 4th ACM Int. Conf. on Multmeda, Boston, MA, pages 65 73, 1996.

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng