A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE LEARNING

Size: px

Start display at page:

Download "A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE LEARNING"

Berenice Montgomery
5 years ago
Views:

A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE 2 1 Tao Wang, 2 Wenqng Chen and 3 Balng Wang 1 Department of Computer Scence and Technology,Shaoxng Unversty College of of Computer Informaton

1 A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE 2 1 Tao Wang, 2 Wenqng Chen and 3 Balng Wang 1 Department of Computer Scence and Technology,Shaoxng Unversty College of of Computer Informaton Scence Engneerng, and Technology, Shaoxng Vocatonal & Techncal College, 3 School of Computer Scence and Technology,Harbn Insttute of Technology S: Emals: TULANG4582@126.COM taoht@usx.edu.cn Submtted: May 17, 2014 Accepted: Oct. 12, 2014 Publshed: Dec. 1, 2014 Abstract- In Bag of Words mage presentaton model, vsual words are generated by unsupervsed clusterng, whch leaves out the spatal relatons between words and results n such shortng comngs as lmted semantc descrpton and weak dscrmnaton. To solve ths problem, we propose to substtute vsual words by vsual phrases n ths artcle. Vsual phrases bult accordng to spatal relatons between words are semantc dstranable, and they can mprove the accuracy of Bag of Words model. Consderng the tradtonal classfcaton method based on Bag of Words model s vulnerable to the background, block and scalar varance of an mage, we propose n ths artcle a multple vsual words learnng method for mage classfcaton, whch s based on the concept of vsual phrases combned wth Multple Instance Learnng. The fnal classfcaton model s able to show the spatal features of mage classes. Experments performed on standard mage testng sets, Caltech 101 and Scene 15, show the satsfyng performance of ths algorthm. Index terms: Image Classfcaton, Multple Kernel Learnng, Bag of Vsual Words, Spatal Pyramd Matchng 1470

2 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE INTRODUCTION The eyes are the wndows of the mnd, and humans acheve more nformaton from vson than through other approaches, thus vson s the most mportant sensaton of humans. It was shown by statstcs that 80% of all nformaton we obtan came from vson. Vsual nformaton s a knd of multmeda nformaton, and t s popular due to such advantages as vvdness n expresson and abundance n content. We can acheve a lot of mportant knowledge from vsual nformaton and our lfe s thus greatly mproved. Meanwhle, wth the development of IT and multmeda technology, large nformaton source lbrares such as the Internet come nto beng. In partcular, snce the late 1990s, the thrve of dgtal cameras and the development of the Internet have made more and more multmeda nformaton frequently appear n our daly lfe. Theren, nformaton manly conveyed by mages, audos and vdeos has become the manstream of nformaton servce and exchange and plays mportant roles n research, daly lfe, communcaton and so on. Images have descrptve ablty and abundant content beyond the capacty of texts, so people prefer to transmt and acheve nformaton va mages. An mage contans a large amount of nformaton, and the storage of mages n the Internet ncreases astonshngly. Related researches show that currently the amount of data computers all over the world generate and store doubles every other month. There was a research predcted from databases whch were fve years before that the amount of nformaton n the Internet would ncrease exponentally. In addton, certan materals show that ths pattern wll be followed for a long tme. Therefore, besdes convenence, the mass data has brought many abstruse problems as well: so much vsual nformaton has exceeded the receptve capacty of us and the large amount of data have made t harder and harder to process and acheve useful nformaton. In the nformaton explosve ocean, we fnd t dffcult to fnd requred nformaton effcently. Users may spend large amount of tme and energy to fnd related nformaton, whch only turns out unsatsfyng. Then, how to pck out the truly useful nformaton n mass data? Vsual Intellgence,.e. Computer Vson, s an mportant component of Artfcal Intellgence. It s a dscplne workng on how to make a machne see and can help us wth nformaton searchng n mass data. In the process of searchng, the frst step s to obtan the mage nformaton, then the content of the mage s analyzed and comprehended and related knowledge s obtan va the descrpton 1471

3 of the mage. Image analyss and comprehenson s the prmary task durng mage nformaton analyss n computers and an mportant part of Vsual Intellgence. Meanwhle, the classfcaton of vsual objects s crtcal n mage analyss and comprehenson. Thus, n the face of ncreasng mass of mages, the automatc classfcaton and extracton of vsual objects have become more and more urgent. Vsual object classfcaton s to automatcally perform object classfcaton on an mage or to judge whether an mage belongs to a certan class as well as locate and extract target objects from sequental mages. It s a hot and dffcult problem n Computer Vson and Pattern Recognton and crtcal to mage comprehenson and retreval [1]. For human bengs, vson s a learnng process of thnkng and percepton. Its basc approach s: vsual organs receve nformaton from outsde as sensors, and then they transmt the receved stmulus to the brans, whch generates a vvd semantc descrpton for vsual objects after processng the nformaton. Any vsual object has ts features and the descrpton of the features s an mportant prerequste of vsual object classfcaton [2]. Smlar to vsual nformaton process va human eyes and brans, vsual object classfcaton performs learnng and sortng on semantc concepts by the features of bottom-layer vsual objects and then bulds complcated object classfcaton model. The basc dea s: frst descrbe the vsual objects and buld a model, then adapt the model based on the class of vsual objects va Machne Learnng, and fnally use the adapted model to classfy unknown objects. The descrpton of vsual objects s to fnd the objects and descrbe them va ther features. Ths process manly faces two challenges. The frst one resdes n the features of vsual objects. Up to now there have been a lot of researches on mage features for classfcaton at home and abroad, n whch the contents of mages are presented by objectve vsual features such as colors, textures, shapes, spatal relatons and so on. However, computers are not capable of hgh-level semantc descrpton lke human beng, and the low-level features they can descrbe are laggng far behnd the hgh-level semantc features humans can understand,.e. there s a huge semantc gap between low-level vsual features n mages and semantc descrpton. The exstence of semantc gap dsables computers to descrbe vsual objects effectvely and brngs challenges to vsual object classfcaton. The second challenge s the dffculty n mage object extracton due to vsual angle varance, brghtness varance, scale varance, object transformaton, partal blockng, complcated background and varance wthn the same class of objects. In addton, the contnung ncrease of mage data makes tradtonal manual extracton extremely nfeasble. All 1472

4 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE the above nfluences also add to the semantc gap. Theren, brghtness varance s caused by the change of lght, scale varance by the rotaton of objects and change of shootng angle, and varance wthn the same class by dfferent postures and appearances of the same object. The tradtonal method of vsual object classfcaton s to manually label the objects n mages and extract features as well as performs classfcaton accordng to the labels. Its short comng s that t costs an enormous amount of tme and effort to work on large or even mass amount of mage data and t s also prone to generate errors that we cannot accept. Moreover, n ths nformaton exploson era, new mages emerge ceaselessly, whch makes the nstant update of manual labels mpossble. So t s nfeasble to label all mages once for all. In addton, manual labelng causes dvergence n labels. An mage contans abundant nformaton, and dfferent person have dfferent ponts of vew to the mage. For the same semantc nformaton, dfferent person gve out dfferent descrptve words. Recently researchers have proposed a seres of unsupervsed algorthms for vsual object recognton. All mages are not labeled, so these algorthms brng the hghest ambguty. Unsupervsed algorthms are effcent dealng wth mages sets wth sold-color scenes and small number of classes. But they fal to yeld deal results when the color of scenes changes a lot and there are many classes. Recently, Bag of Words mage presentaton model has drawn more and more attentons n the area of mage classfcaton. It s orgnally a smplfed postulated model to process natural language and retreve nformaton. In ths model, a text can be presented as an unordered set of words, regardless of grammar and word sequences. It was prmarly appled to text classfcaton (Lews, 1998). Now, more and more Bag of Words models are wdely employed n vsual object classfcaton. The basc dea s to extract local features of an mage set n the frst place, then quantfy these features, and fnally present the mages as a set of several vsual words. Essentally the model s dvded nto two parts: (1) codng process, whch substtutes descrptors wth expressons more sutable for classfcaton; (2) mergng process, whch summarzes features after codng n wder ranges. In the codng process, K-Means, GMM, Sparse Codng are generally used. In the mergng process, SPM and Spatal-LTM are frequently employed. K- Means s a dstance based clusterng algorthm,.e. the closer two objects are, the more smlar they are supposed to be, and fnally all data can be dvded nto k clusters. Sparse Codng s popular these years. It uses as few as possble data ponts for codng, provded that sparse effectveness s mantaned. However, some problems exst n these popular codng methods, 1473

5 manly: (1) they leave out the spatal relatons among local feature, whch makes the generated vsual words weak n dscrmnaton and spatal descrpton; (2) they can only judge the exstence of an object, but not ts locaton, whch causes weak labelng of mage. To address the above problems,.e. the tme-consumng effort n manual labelng and the unsutablty of unsupervsed methods n real mage classfcaton, Multple Instance Learnng (MIL) s appled to classfcaton of weakly labeled mages (.e. only the exstence of objects s known). In ths knd of learnng, every mage n a set s regarded as a bag. Extract the features n blocks and regons or local as samples of ths bag so that each bag corresponds to a sample set. Each bag has a tranng label, whch s postve when the mage contans the target object and negatve otherwse. Samples do not have tranng labels. In partcular, MIL algorthms are dscrmnatve classfers rather than generatve classfers, whch avod complcate deducton and have hgher classfcaton accuracy. Thus MIL has drawn wde attentons n the area of computer vson. For example, Maron et al [3] employed Dverse Densty (DD) algorthm to perform classfcaton on natural scenes. DD-SVM model uses DD algorthm to challenge the orgnal model and uses SVM n the orgnal model to classfy bags. In most MIL algorthms, t s assumed that there s at least one postve sample n a postve bag and all samples n negatve bags are negatve. Although ths assumpton s effectve n drug actvty predcton, t faces great lmt n other applcatons, especally n mages. On the other hand, Chen et al dd not make such assumpton n MILES algorthm; rather, they mapped bags to feature spaces and used 1-norm SVM as the bag classfer. Ths process transforms an MIL problem nto a standard supervsed learnng problem, whch s more robust aganst labelng uncertanty. Consderng the great achevement Bag of Words model has made n mage classfcaton, the research n ths artcle s apt to nclude ts dea n MIL. However, ths model leaves out the spatal relatons among local features and makes the generated vsual words weak n dscrmnaton and spatal descrpton. Many artcles have probed nto the spatal relatons of vsual words [4-6]. The current tendency of the development of ths model s manly represented by ntermedate features between mage features and bottom ones. For example, ntermedate presentaton generated va mage dssecton, Spatal Pyramd and Vsual Phrase or Vsual Synset proposed va the geometrc smlarty and spatal relatons of local features. Some recent works[7-9] bult hgh order mage features based on vsual words and showed good results n experments [7] selected mportant features and screened low order features for hgh order features usng dscrete AdaBoost 1474

6 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE algorthm. Compared to tradtonal method of enumeraton, ths method avods a large amount of calculaton. However, ths method s complcated that every tme new features are to be selected, AdaBoost algorthm s used to classfy tranng data. [9] proposed a method that searched for meanngful vsual phrase vocabulary. Frst of all, each vsual word are combned wth ts K- nearest neghbors to form vsual word group. Then the redundancy n vsual phrases s dealt wth va mode summary and top-to-bottom update method. Vsual Synset [8] conssts of vsual phrases. It s compared wth tradtonal Bag of Words model and proved to be better than the latter. However, smlar to the method proposed by [9], K-nearest Neghbor s also used to construct vsual phrases n ths method. MIL can address problems n weak labelng effectvely. It s a dscrmnatve tranng model and brngs hgh-precson classfcaton. But some problems exst n the large numbers of MIL algorthms, so they cannot be appled to vsual object classfcaton drectory. To overcome the shortcomngs n Bag of Words model that unsupervsed clusterng durng generaton of vsual words results n lack of spatal nformaton, lmted sematc descrpton and weak dscrmnaton, we substtute vsual words wth vsual phrases that s more semantcally dscrmnatve and contans spatal relatons. Ths mproves the accuracy of Bag of Words model. In addton, consderng the tradtonal classfcaton method based on Bag of Words model s vulnerable to the background, block and scalar varance of an mage, we propose n ths artcle a multple vsual words learnng method for mage classfcaton, whch s based on the concept of vsual phrases combned wth Multple Instance Learnng. The fnal classfcaton model s able to show the spatal features of mage classes. Experments are performed on standard mage testng sets, Caltech 101 and Scene 15 [10]. RELATED WORKS In ths chapter, we wll ntroduce the framework of vsual object classfcaton based on SPM algorthm [11]. Then we wll ntroduce SIFT descrptor [12], whch s a frequent used local feature n ths artcle. In the end, an mportant algorthm n vsual object classfcaton, Multple Instance Learnng (MIL), s presented. A. Framework of vsual object classfcaton based on SPM algorthm To address the problem mentoned n the frst chapter that Bag of Words model neglects spatal relatons, Lazebnk et al proposed Spatal Pyramd Matchng (SPM) method n 2006, whch s applcable to natural scene classfcaton. In ths method, an mage s ntersected nto 1475

7 several sub-blocks and the local feature hstogram of each sub-block s made. Frst of all, suppose X and Y are two sets of feature vectors n a d-dmensonal space. Construct grds n the level of 0,,L that there are 2l grds when the level s l. Assgn H l X and H l Y respectvely as the hstograms of X and Y n Level l, then H l X() and H l Y() are the numbers of X and Y n the th grd. The matchng functon of Level l s defned as: =å (1) D L( H, H ) mn( H ( ), H ( )) l l l l X Y X Y = 1 To combne all sub-blocks, the pyramd matchng kernel s defned as: 1 k ( X, Y ) I ( I I ) L-1 L L L L+ 1 = + å - L-l l= 0 2 (2) L L = I + L å I L- l+ 1 2 l= 1 2 Pyramd wth three levels s shown as fgure 1: Fgure 1 Spatal pyramd matchng wth three Levels SPM algorthm has an everlastng nfluence on object recognton. It s shown that ths algorthm n a way ncludes the spatal relatons of local features. As depcted n Fgure 1, the weght of every level s fxed when the number of levels s 3. Although fxed weghts smplfes the learnng process, t s not proper for the whole, because the weght of regon contanng target object s surely ncreased. B. A bref ntroducton of SIFT SIFT descrptor (Lowe, 2004) s a local features, based on scale space and nvarant to scalng, rotaton and even affne transformaton. It s forwarded by Lowe n 2004 on the bass of present feature detecton methods based on nvarant technques. Its full name s Scale Invarant Feature Transform. Due to ts robustness, SIFT s an mportant feature throughout ths artcle. In the last secton, the local feature used n SPM s a SIFT descrptor. Ths secton wll brefly ntroduce ts strong ponts and generaton. SIFT descrptor has the followng advantages: 1476

8 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE (1) SIFT s a local feature, whch s nvarant to rotaton, scalng and change of lght and stable n a certan extent of changes n vsual angle, affne transformaton and nose. (2) It contans abundant nformaton and has great specfcty, so t s applcable to fast and accurate matchng among mass feature data. (3) It has a large quantty that even a few objects can generate a number of SIFT features. (4) It s hgh-speed, whch makes extracton convenent and fast and satsfes the requrement of real-tme. (5) It s extensble and easy to combne wth other feature vectors. Before the extracton of SIFT, an mage s presented n multple scales. Gauss Kernel Convoluton s the only transformaton kernel to realze scalar changes of mages, and t s the only lnear kernel: ( )/2 (,, ) x + G x y s = y 2ps e (3) Therefore, the presentaton of a two-dmensonal mage n dfferent scales can be obtaned from the convoluton of the mage and Gauss Kernel: L( x, y, s ) = G( x, y, s )* I( x, y) (4) Theren, σ s called Scale Space Factor. The smaller σ s, the less smooth the mage s and the smaller the scale s. Large scale corresponds to general features of the mage, and small scale to detaled features. Frstly, SIFT algorthm detects features n the scale space and confrms the locaton and scale of key ponts. Then, t sets the drecton of gradent as the drecton of the pont. Thus the scale and drecton nvarance of the operator are realzed. Key ponts are the extreme values n both the two-dmensonal space of the mage and DoG scale space. DoG operator s calculated as: D( x, y, s ) = L( x, y, ks )- L( x, y, s ) (5) Fnd out all local extremes n DoG space and group them as canddate key ponts. Then delete the low-contrast key ponts and unstable skrt response ponts (for DoG algorthm wll generate strong skrt responses) to strengthen the stablty and resstance to nose of matchng. After obtanng the canddate key ponts, accurately determne the locatons of key ponts usng surroundng data ponts. Theren, the locatons and scales of key ponts are determned va fttng three dmensonal quadratc functons. 1477

9 Calculate the drecton of a key pont va the dstrbuton of the drecton of gradent of ts neghborng pxels and set the drecton parameters for each key pont to ensure the rotaton nvarance of the operator. The algorthm samples n the wndow centered at the key pont and calculate the drecton of gradent n the neghborng area va hstogram. The range of gradent hstogram s 0~360 degree, n whch each column spans 10 degrees and there are 36 columns. The peak value of the hstogram s the prncpal drecton of the neghborng gradent,.e. the drecton of ths key pont. In ths hstogram, f a peak value that s 80% of the prncpal peak exts, the correspondng drecton s the auxlary drecton of ths key pont. A key pont may be assgned to several drectons (one prncpal and more than one auxlary), whch can ncrease the robustness of SIFT feature. Up to now, the detecton of key ponts s completed. Each key pont has three parameters: locaton, scale and drecton. Thus an SIFT feature regon can be determned. A SIFT feature vector contans the nformaton of drecton n neghborhood and therefore enhances the antnose ablty as well as provdes a good fault tolerance for matchng wth devated locatons. Get an 8 8 wndow centered on the key pont. Frst of all, rotate the axs to the drecton of key pont. Then assgn a weght to each drecton usng Gauss Smooth Flter and calculate the drecton hstogram of every sub-wndow. Generally, each key pont s descrbed by 4 4 seed ponts. Thus, 128 data ponts are generated for one key pont. Now SIFT vector s free from the nfluence of geometrc transformatons such as scale changes and rotaton. Normalze the length of the feature vector, and the nfluence of lght s elmnated. C. Multple Instance Learnng Due to ts merts n processng weakly label data and performng hgh-precson classfcaton, Multple Instance Learnng (MIL) s extensvely used n Computer Vson. Snce the 1990s, MIL has been regarded as the most promsng machne learnng method. It was proposed by Detterch et al n the 1990s to judge whether a medcal molecule s a Musk molecule. MIL s the fourth learnng framework of machne learnng, the other three are Supervsed Learnng, Unsupervsed Learnng and Renforcement Learnng. Supervse Learnng s the most tradtonal one. It performs learnng on labeled tranng samples and predcts the non-tranng data as accurately as possble. In such learnng framework, all tranng samples are labeled to ensure the lowest ambguty. However, ths method costs a lot of tme and effort. Unsupervsed Learnng can perform learnng drectly on unlabeled tranng samples and dscover the hdden data structure durng the learnng process. None of the samples s labeled, so ths model brngs 1478

10 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE the hghest ambguty. Compared to Supervsed Learnng n whch all samples are labeled, MIL does not have the concept label. Compared to Unsupervsed Learnng where none of the samples s unlabeled, MIL has labeled bag. Unlke Renforcement Learnng, MIL does not have the concept of tme delay. More mportantly, n all learnng frameworks, a sample s an nstance,.e. there s a one-to-one relaton between samples and nstances. The ambguty n Supervsed Learnng, Unsupervsed Learnng and Renforcement Learnng s dfferent from that n MIL, whch makes the former three unsutable to deal wth such problems. MIL has unque propertes and promsng applcaton prospect and t s a green area n Machne Learnng, so t s popular at home and abroad and regarded as a new learnng framework. It has drawn the attenton of many researchers, and a lot of new theores and strkng applcaton results emerge n a short tme. After the proposal of MIL by Detterch et al, a lot of effort has been put to ths learnng framework n the area of Machne Learnng, so researches on MIL are very actve. In 1998, O. Maron et al put forward Dverse Densty (DD) algorthm. Generally, every nstance contaned n a canddate molecule s presented as a feature vector n an n-dmensonal space. The pont correspondng to the feature vector wll form a trajectory n the n-dmensonal space wth the change of the molecule s shape. In a postve bag, there s at least one postve nstance,.e. along the trajectory of a postve bag there s at least one pont whose correspondng molecular shape fts the shape of the target molecule. Whle n a negatve bag, there s no such postve nstance and thus there s no such pont along the trajectory. In ths artcle, we use an MIL algorthm developed from DD algorthm and MILES algorthm. In 1998, O. Maron and T. Lozano-Perez proposed DD based Mult-nstance algorthm. The algorthm ams to fnd the pont wth the largest dverse densty,.e. target feature t, whch s defned as the followng: for a pont n the feature space, the more postve bags appears around t and the farther negatve nstances are from t, the larger ts dverse densty s. Set B + as the th postve bag, B j + as the j th nstance n B + and B jk + the value of the k th feature of B j +. Smlarly, we can defne B -, B j - and B jk - n negatve bags. Let t present the pont wth the largest dverse densty, then t can be determned va maxmzaton of Pr(t B 1 +,,B l+ +,B 1 -,,B l- - ). Suppose the bags are ndependent on each other, then the formula can be expressed accordng to Bayes: Õ Õ l + + l x x t - - = 1 x= t = = 1 arg max Pr( B ) Pr( B ) (6) Nose-or model can be employed to specalze the product n the above formula: 1479

11 + Bj - t + Pr( x= t B ) µ max exp( - ) (7) 2 s - Bj - t - Pr( x= t B ) µ 1- max exp( - ) (8) 2 s 2 2 In ths formula, σ s the parameter to adjust the scale. Snce there are multple local extremes n the dverse densty space, the above formula can be solved by Expectaton Maxmzaton (EM). Each postve nstance s selected as the startng pont for searchng, thus when t s located va Gradent Descent algorthm, selecton can be performed on each property of nstance features. However, due to the problem n calculaton amount, DD algorthm several rounds of searchng. Y. X Chen et al descrbed DD algorthm from the aspect of feature selecton and further assumed every nstance n the tranng bag as canddate, thus they put forward MILES (Multple Instance Learnng va Embedded Instance Selecton) algorthm. In MILES, all nstances are selected as canddate feature,.e. C={x k }, k=1,2,,n, n s the number of nstances n the bag. Defne the feature vector of Bag B as: m B s x B s x B s x B (9) 1 2 n T ( ) = [ (, ), (, ),..., (, )] The smlarty between Instance x k and Bag B s measured by: k xj - x k s( x, B ) = max j exp( - ) (10) 2 s 2 Gven l + postve bags and l - negatve bags, all the smlarty measurements of between bags and nstances form the followng matrx: é s( x, B1 )... s( x, B - ) ù l ê ú s( x, 1 )... s( x, ) l [ m1,..., m, m1,..., m ] ê B B ú (11) + - = l l ê ú ê n + n - ú ês( x, 1 )... s( x, - ë B B ) l úû Each row of the matrx s the smlartes between an nstance and every bag. In ths nstanceto-bag mappng space, the MIL problem s converted to a Supervsed Learnng problem. Perform learnng on the above matrx va SVM or 1-Norm SVM as follows: 1480

12 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE mn x hl w + c x + c h w, b,, k 1 2 j k= 1 = 1 j= 1 T T s. t.( w m + b) + x ³ 1, T T -( w m + b) + h ³ 1, j + - n l l x, h ³ 0, = 1,..., l, j=,..., l j å å å j + - (12) c 1 and c 2 are penalty factors used to address the problem of unequal numbers of postve and negatve bags. For a new bag B, ts judgment can be determned by: å f ( B') = w s( x k, B ') + b (13) k= 1,2,..., n k MIL was orgnally used to predct drug actvty. Wth the on-gong of MIL researches, ts applcaton area s expandng ceaselessly. Maron et al were the frst to apply DD algorthm to mage classfcaton n In ths applcaton, mages contanng fgures were labeled postve, otherwse labeled negatve. Several mage blocks were sampled as nstances of a bag. Yang et al appled MIL to content based mage retreval n They ponted out that compared to natural scene mages, MIL were more sutable for object based mage retreval. In 2002, [13] compared DD wth EM-DD on ther performance n mage retreval. Andrews et al (2002) and Huang et al (2002) [14] also employed MIL n mage retreval. In 2002, Andrews et al employed SVM (Support Vector Machne) based MIL to text classfcaton. They presented each chunk of text as a bag consstng of overlappng segments, and the nstances were segments contanng 50 letters. Ther work was tested on TREC9 text classfcaton dataset. Zhou et al appled MIL to Web Index Page Recommendaton n They treated the unon of all lnks n an ndex page as a bag and lnks as nstances of the bag, thus the problem was converted to an MIL problem. Based on ths, they proposed Fretct-kNN algorthm to solve the page recommendaton problem. OBJECT RECOGNITION BASED ON MULTI VISUAL PHASES Image classfcaton s the prerequste of vsual object classfcaton. It ams at automatcally classfy a set of mages accordng to ther sematc contents and t s a hot pont and dffcult n Computer Vson. In the frst chapter, we mentoned that the complcated dstrbuton of mages and the nfluences of lght, vew angle as well as screenng put obstacles to model constructon. Therefore, mage classfcaton s always a challengng problem n Computer Vson. A large number of researches on the features used n vsual object classfcaton have been carred out at home and abroad. Objectve features such as colors, textures, shapes and spatal relatons of 1481

13 objects are used to descrbe the content of mages. However, a huge semantc gap exsts between the bottom-layer features and the semantc expresson. Therefore, Machne Learnng s needed to perform sortng and learnng on semantc concepts accordng to bottom-layer features and construct complcated classfcaton model. Consderng Bag of Words model and related methods such as plsa [15] (probablstc Latent Semantc Analyss), LDA (Latent Drchlet Allocaton) [16] n recent years, ts basc dea s as follows: frstly extract bottom-layer features and quantze them to present mages n Bag of Words model, then perform object recognton and mage classfcaton usng some technques n text analyss and retreval. Due to ts great success n text analyss and retreval, Bag of Words model has attracted profound attenton of researchers. Inspred by Bag of Words model, researchers have started to employ smlar models for mage descrpton and acheved some postve results. Cao et al (2007) proposed Spatally Coherent Latent Topc model (Spatal-LTM) [17] to make sure the coherence of latent topcs n the same block va spatal nterdependency. Lazebnket et al (2006) put forward a scene classfcaton algorthm based on approxmate global geometrc relatons, whch can dvde an mage nto a pyramd and summarze the local features of sub-blocks. Ths sort of methods usually deal wth local features of mages, such as SIFT (Scale Invarant Feature Transformaton) (Lowe, 2004) and HOG (Hstogram of Gradent) [18], and form vsual words va spatal dvson of features. For example, K-means can be employed to cluster SIFT features and the clusters centers are selected as vsual words. For a gven mage, ts local feature can be roughly represented by the nearest vsual word. Thus, smlar to the applcaton on text analyss, the dstrbuton of local features on dfferent vsual words can construct the Bag of Words model of mages. It should be notced that although the Bag of Words model of mages s smlar to that of texts, there are stll obvous dfferences. The man problem s that vsual words are generally quantfed drectly from bottom-layer features va unsupervsed learnng and the unsupervsed quantfcaton method always leads to lmted dscrmnatve ablty of vsual words. In addton, ths model leaves out the spatal relatons of local features, so the descrptve ablty s restrcted. There Bag of Words model suffers from weak semantc dscrmnaton and spatal descrpton and the generated vsual words face the phenomena of polysemy and synonyms. Due to polysemy and synonyms, features descrbng dfferent objects cannot be dstngushed and are matched. For another, mage classes are often than not spatal related but Bag of Words model neglects the 1482

14 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE spatal relatons of vsual words. Related studes have shown that local regons n an mage are nterdependent and constructng models usng spatal relatons of regons can mprove the accuracy of classfcaton [8]. For example, n Caltech 101 mage set, class-nformaton contans the exstence of specfc vsual objects or regons. Even f n scene related mage classfcaton, such as Scene 15 mage set, the smlarty among the same class s reflected on some local regons. Moreover, the dstrbuton of mage contents s complcated and vared, dfferent mages have large dvergence on the locaton, scale and orentaton of related regons. Lackng ths regon-related nformaton, present Bag of Words model s obtaned through statstcs of the whole mage. Therefore correspondng classfcaton model s not accurate enough because t s not capable of demonstratng the regonal propertes of dfferent mage classes and vulnerable to mage background, the poston of a regon and scale. To address ths problem, people proposed the methods of constructng vsual phrases or vsual synset. These methods manly use the spatal coherent relatons of vsual words to generate hgh-order vsual features. To address the problems of lmted dscrmnaton and descrpton n Bag of Words model as well as the nfluences of background and screenng, we propose Multple Vsual Phrase Learnng (MVPL) mage classfcaton method. We substtute vsual words wth vsual phrases that are more semantcally dscrmnatve and contan spatal relatons to mprove the accuracy of Bag of Words model. In order to depct regonal propertes n the fnal model, we propose n ths artcle a multple vsual words learnng method for mage classfcaton, whch s based on Multple Instance Learnng. The detaled constructon process s as follows: frst of all, construct semantc dscrmnatve and spatal related vsual phrases n place of vsual words to mprove the accuracy of mage presentaton. On ths bass, to depct regonal propertes of mage classes n the fnal model, we propose n ths artcle a multple vsual words learnng method for mage classfcaton, whch s based on MIL and uses the vsual phrase lbrary as new feature space. To test the functon of MVPL proposed n ths artcle, we conducted many experments on standard testng datasets. Frstly, we compare the nfluences of vsual words and phrases on mage classfcaton. Then we compare our method wth some present methods, such as SPM and Spatal-LTM, and the comparsons are performed on some standard testng sets such as Caltech 101 and Scene 15 to prove the effectveness of our method. A. Vsual phrases Vsual words are generated va Unsupervsed Learnng, so the related Bag of Words model suffers from lmted semantc dscrmnaton. In addton, spatal relatons of words are not 1483

15 consdered durng the generaton process. For a gven dataset, vsual words wth certan semantc dscrmnaton are generated n assocaton wth class-nformaton. Then vsually consstent regons are formed va mage over-segmentaton. Spatally confne vsual words va these regons and construct a vsual word set, thus the spatal relatons of vsual words are realzed. In the end, generate vsual phrases from the vsual word sets of all regons va Frequent Item Set Mnng. The detaled generaton process wll be gven as follows. Gven a labeled mage, we frst detect key ponts usng the method proposed by [19-21] and extract SIFT features of the key ponts. Then, cluster features of mages n dfferent classes separately va K-Means. In the end, group all clusterng centers to form the class-related vsual work vocabulary. Assgn every nstance accordng to these centers and gve t correspondng label. A vsually consstent regon means a regon wthn whch the color, texture et al are coherent on some level. Hereby we use Mult-Scale Ncuts proposed by T. Cour et al to ntersect mages, meanwhle a threshold value s used to flter out small regons. Please notce that the am to generate vsually consstent regons s to analyze the spatal relatons of vsual words, and t does not depend on the accuracy of mage ntersecton, the reason of whch s to ensure that each regon s a part of an object. We employ Mult-Scale Ncuts because t algorthm takes mage nformaton under dfferent scales nto consderaton, whch s possble for parallel calculaton to mprove the speed. In addton, the number of ntersected regons can be controlled manually. Durng our experments, each mage s dvded nto 30 blocks and 30 or so vsually consstent regons are generated eventually. Vsual phrases are the combnaton of vsual words whch appear frequently n vsually consstent regons. Here related technques n data mnng are used to generate vsual phrases. An mage X s dvded nto J vsually consstent regons, R ={r j }, j=1,,j, and the vsual words belongng to the j th regon r j s presented as a set G j ={w 1,,w Kj }, n whch K j s the total number of words n r j. For N mages, the set contanng all vsual words s G={G } (=1,2,,N), n whch G ={G j } (j=1,2,,j), and the set of vsual phrases s VP={vp k } (k=1,,k). A vsual phrase vp s generated from G accordng to the frequency of the coexstence of related vsual words. If regard G as a database, then the constructon of vsual phrases can be converted nto the problem of mnng frequent tems n data mnng,.e. searchng for coherent set from canddate sets consstng of vsual words. In ths work we use FP-growth (Frequent Pattern growth) 1484

16 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE algorthm to accomplsh ths task. FP-growth constructs a tree based on the frequency of unts after scannng the database, and frequent tem mnng s performed by traversng ths tree structure. The storage requrements and calculaton rate of FP-growth make t sutable for large datasets. B. Multple vsual phrases Ths secton wll propose a Multple Vsual Phrase Learnng (MVPL) method based on Multple Instance Learnng (MIL). In MIL, a tranng sample s a bag consstng of multple nstances. A bag has a label whle a nstance does not. It s generally assumed that there s at least one postve nstance n a postve bag and on postve nstance n a negatve bag. The process of MIL not only requres related model parameters, but also the confrmaton of postve nstances n postve bags. As s mentoned before, the classes of mages are regon related and present marks often only show the exstence of certan regon, so the mage classfcaton can be essentally converted nto an MIL problem. We proposed MVPL startng from vsual phrase based Bag of Words model. Gven a vsual phrase set C vp ={p k k=1,2,,k} n whch p k ={w p k p=1,,p k }, regard C vp as an nstance space. Suppose that an mage X j s dvded nto J regons R ={r 1, r 2,,r J } and vsual words n r j form the word set G j ={w 1,,W Kj }, n whch K j s the number of vsual words n r j k. Regard X as a bag and G j as an nstance n t: X ={G j j=1,2,,j}. Defne the smlarty between the Vsual Phrase p k and Bag X as:. ì1, f pkî X s( pk, X ) =í (14) î 0, otherwse In addton, consderng the hstogram of vsual phrases s better for mage presentaton, the smlarty can be defned as: =å s( p, X ) h ( p ) (15) k j k j In the above formula, h j (p k ) s the number of vsual phrases contaned n the j th regon of X and Σh j s the total number of vsual phrases contaned n the mage. Tradtonal method can be used to solve the problem. Consderng the mappng matrx s a sparse one, L1-norm SVM s stll used for mage classfcaton. MIL based on the above formula s MVPL, short for Multple Vsual Phrase Learnng. Hstogram of vsual phrases can better dstngush the descrptve nformaton of dfferent classes 1485

of mages. In addton, vsual phrases contan the relatons of vsual words extracted from mages, so they are more dscrmnatve and descrptve than vsual words.

17 of mages. In addton, vsual phrases contan the relatons of vsual words extracted from mages, so they are more dscrmnatve and descrptve than vsual words. EXPERIEMNTS AND ANALYSIS Image sets used n the experments are Caltech 101 and Scene 15, whch ths secton wll ntroduce brefly. As ntroduced n the last chapter, Caltech 101 contans mages related to vsual objects such as anmals, vehcles and flowers. There are 102 mage classes and each class contans 31 to 800 mages, whose szes are pxels or so. Scene 15 contans ndoor scenes (such as offces, ktchens and lvng room), outdoor scenes (such as the appearances of buldngs, ctes and streets) and natural scenes (such as dunes, mountans and forests). Each class contans 200 to 400 mages, whose average sze s pxels. Fgure 2 shows three randomly pcked mages from each class. Fgure 2 Examples of Scene-15 For a gven mage set, detect the key ponts of every mage va Harrs-Affne algorthm and extract the related SIFT descrptors. Perform clusterng on SIFT features belongng to the same class of mages va K-Means and generate related vsual words, meanwhle construct vsual phrases n combnaton of mage ntersecton. When vsual phrase set s gven, mages can be expressed by related Bag of Words model. In the meantme, we compare our method to some present algorthms such as SPM and Spatal-LTM to show the effectveness of our method. In the experments on Caltech 101, we use the same confguraton as n the artcle of Cao et al (2007),.e. 28 classes of objects are selected, wth each class contanng at least 60 mages. We randomly pck 30 mages as the tranng set and 30 as testng set. The experment s repeated for 10 tmes and the average accuracy and the devaton are used to measure the functons of algorthms. For Scene 15, we apply the expermental confguraton of SPM,.e. 100 mages of each class are randomly selected as the tranng set and the remanng as testng set. We not only compare our method wth some classcal ones, but also compare our method wth method usng vsual words, 1486

18 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE whch s wrtten as MVWL. In MVWL, each vsual word s an nstance and the related Bag of Words model s the bag B. The smlarty s defned as: ì1, f wk Î X s( wk, X ) =í (16) î 0, otherwse A Expermental results Caltech 101 contans 102 mage classes such as anmals, vehcles, flowers and backgrounds, and each class contans 31 to 800 mages, whose szes are pxels or so. To compare our method wth Spatal-LTM proposed by Cao et al (2007), we use the same expermental confguraton as n ther artcle,.e. select 28 classes whch contan more than 60 classes and randomly select 30 mages as tranng set and 30 as testng set n each class. We observed that these 28 classes contan mages wth ragged edges such as sunflower, narrow mages such as flamngo and mages wth surroundng such as cup. The dversty of mage contents bascally reflects the attrbute of Caltech 101. The experment s repeated for 10 tmes, each tme we record the average accuracy of the 28 classes and the average as well as devaton of these 10 experments are calculated. Quantze the SIFT features of the 28 classes of tranng mage nto vsual words wth the length of N1 va K-Means, and we obtan 28*N1 vsual words eventually. Dvde each mage nto 30 regons va Mult-Scale Ncut and group the vsual words from the same regon. All mages form Set G and extract vsual phrases effcently from G usng FP-growth. Frst of all, we experment on the length of vsual word set va VM-MIL and the result s shown n Table 1. It s shown that wth the ncrease of vsual word length, the recognton accuracy gradually ncreases, but the amount of calculaton also goes up. Therefore, we fx N1 to 200 n the followng experments, thus the length of vsual word set s set to Table 1 The average classfcaton accuraces of dfferent lengths of vsual word set Length Accuracy On the bass of vsual word set wth the length of 5600, we construct a vsual phrase vocabulary whose length s about 40,000 wth the vsual phrase generaton algorthm proposed n ths chapter. The comparsons among vsual phrases based MIL, Spatal-LTM and SPM are shown n Table 2. It s shown that VW-MIL s already obvously better than Spatal-LTM and SPM. The accuracy of MVWL s almost **% hgher than VW-MIL and nearly **% hgher than 1487

19 Spatal-LTM and SPM. The classfcaton performance of vsual phrases s even better than vsual words. Table 2 The average accuraces of dfferent methods Spatal-LTM 68.9% SPM 67.9% VW-MIL 70.1% MVWL 76.31% MVPL 76.20% The second experment s performed on 15 Scene. There are 15 classes of scenes n 15 Scene and each contans 200 to 400 mages, whose average sze s pxels. Scene 15 contans ndoor scenes (such as offces, ktchens and lvng room), outdoor scenes (such as the appearances of buldngs, ctes and streets), natural scenes (such as dunes, mountans and forests) and artfcal scenes (such as hghway and suburb), and t s the most thorough scene dataset. The experment s repeated for 10 tmes and ts confguraton s the same as n SPM algorthm (Lazebnk et al, 2006),.e. randomly pck 100 mages from each class as tranng set and the remanng as testng set. Quantze the SIFT features of the 15 classes of tranng mage nto vsual words wth the length of N2 va K-Means, and we obtan 15*N1 vsual words eventually. Dvde each mage nto 20 regons va Mult-Scale Ncut and group the vsual words from the same regon. All mages form Set G and extract vsual phrases effcently from G usng FP-growth. Frst of all, we experment on the length of vsual word set va VM-MIL. We quantze the SIFT features of each mage class nto vsual words va K-Means, and the number of vsual words n each class N2 s ncreased gradually from 200 to 1000, correspondngly the total number of vsual words ncreases from 3000 to The result s n Table 3. It s shown that when the total length of vsual words s we have the hghest accuracy. Consderng the calculaton effcency, we fx N2 to 1000 n the followng experments, thus the length of vsual word set s set to Table 3 The average classfcaton accuraces of dfferent lengths of vsual word set Length Accuracy On the bass of vsual word set wth the length of 15000, we construct a vsual phrase vocabulary whose length s about 25,000. The comparsons among the best results of VW-MIL, MVWL, MVPL and SPM are shown n Table

20 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE Table 4 The average accuraces of dfferent methods SPM 81.3% VW-MIL 81.5 % MVWL 86.1% MVPL 88.3% Next we concentrate on the complextes of SPM and MVPL. For the experment on Caltech 101, SPM follows the standard process as shown n lterature (Lazebnk et al, 2006),.e. an mage s descrbed by a 3-layer pyramd. Suppose the sze of vsual word group s 5600, then the dmenson of mage features s 5600*21= Meanwhle HIK (Hstogram Intersecton Kernel) s used to measure the smlarty of dfferent mages, whose calculaton complexty s much hgher than lnear kernel calculaton. In contrast, we select around 40,000 vsual phrases va FP-Tree n MVPL and perform learnng va L1-norm SVM. Thus the operaton complexty of MVPL s lower than SPM. CONCLUSIONS In ths artcle, we proposed a vsual phrase generaton method based on MIL framework and put forward MVPL based on ths. Vsual phrases are constructed from vsual word sets n the same regon and are capable to show the spatal relatons of vsual words. Thus vsual phrases are more dscrmnatve and descrptve. MVPL uses vsual phrase vocabulary as the feature space to make the classfcaton result more accurate. Experments on two mage sets have proved that our method s more effectve than present classcal algorthms.. REFERENCES [1] Ilads, M. ; Seunghwan Yoo ; Xn Xn ; Katsaggelos, A.K,Vrtual tourng: A Content Based Image Retreval applcaton, 2013 IEEE Internatonal Conference on Multmeda and Expo Workshops (ICMEW), pp.1-4, [2] Chuca Y ; YngL Tan,Localzng Text n Scene Images by Boundary Clusterng, Stroke Segmentaton, and Strng Fragment Classfcaton, IEEE Transactons on Image Processng,Volume: 21, Issue: 9,pp ,2012. [3] Maron O, Lozano-Perez T. A Framework for Multple-Instance Learnng [C]. Proceedngs of Neural Informaton Processng Systems, 10: ,

21 [4] Assam Bekkar, et al., SVM Classfcaton of Urban Hgh-Resoluton Imagery Usng Composte Kernels and Contour Informaton, Internatonal Journal of Advanced Computer Scence and Applcatons, vol. 4, no. 7, [5] Wu Z, Ke QF, Sun J Bundlng features for large-scale partal-duplcate web mage search[c]. In Proc. CVPR, [6] Hong Pan ; Yapng Zhu ; Qn, A.K. ; Langzheng Xa,Mnng heterogeneous class-specfc codebook for categorcal object detecton and classfcaton,image Processng (ICIP), th IEEE Internatonal Conference on,pp , [7] Lu D, Hua G, Vola P, Chen T Integrated Feature Selecton and Hgher-order Spatal Feature Extracton for Object Categorzaton[C]. Proceedng of the 26th IEEE Conference on Computer Vson and Pattern Recognton, Anchorage, AK. [8] M.Iwahara, S.C.Mukhopadhyay, S.Yamada and F.P.Dawson, "Development of Passve Fault Current Lmter n Parallel Basng Mode", IEEE Transactons on Magnetcs, Vol. 35, No. 5, pp , September [9] Zheng YT, Zhao M, Neo SY, Chua TS, Tan Q Vsual Synset: towards a Hgher-level Vsual Representaton[C]. In Proc.CVPR, Achorage, Alaska, U.S. [10] Yuan YS, Wu Y, Yang M Dscovery of Collocaton Patterns: from Vsual Words to Vsual Phrases[C]. Proc. of the 25th IEEE Conference on Computer Vson and Pattern Recognton, 1-8. [11] L FF, Perona P A Bayesan Herarchcal Model for Learnng Natural Scene Categores [C]. Proceedng of the 23rd IEEE Computer Socety Conference on Computer Vson and Pattern Recognton, San Dego, CA, USA, [12] Lazebnk S, Schmd C, Ponce J Beyond Bags of Features: Spatal Pyramd Matchng forrecognton Natural Scene Categores [C]. Proceedng of the 24th IEEE Computer Socety Conference on Computer Vson and Pattern Recognton, New York, USA, 2: [13] Lowe D G Dstnctve Image Features form Scale-nvarant Keyponts [J]. Internatonal Journal of Computer Vson, 60(2): [14] G. Sen Gupta, S.C. Mukhopadhyay, Mchael Sutherland and Serge Demdenko, Wreless Sensor Network for Selectve Actvty Montorng n a home for the Elderly, Proceedngs of 2007 IEEE IMTC conference, Warsaw, Poland, (6 pages). 1490

22 Tao Wang, Wenqng Chen and Balng Wang,A SCENE RECOGNITION ALGORITHM BASED ON MULTI-INSTANCE [15] Zhang Q, Goldman S A EM-DD: an mproved multple-nstance learnng technque. Advances n Neural Informaton Processng Systems, Cambrdge, CA: MIT Press, [16] Huang X,Chen SC,Shy M, et.al User concept pattern dscovery usng relevance feedback and multple-nstance learnng for content-based mage retreval [C].MDM/KDD 2002 Workshop Edmonton, [17] Fergus R, L Fefe, Perona P, Zsserman A Learnng Object Categores from Google simage Search [C]. Proceedng of the 10th Internatonal Conference on Computer Vson (ICCV), [18] Ble D, Ng A, Jordan M Latent Drchlet Allocaton [J]. Journal of Machne Learnng Research. 3: [19] Subhas Chandra Mukhopadhyay, and Chen-Hung Lu, Desgnng an Integrated Currculum Platform for Engneerng Educaton: A Hybrd Magnetc Bearng System, Internatonal Journal on Technology and Engneerng Educaton, Vol.7, No.1, pp July [20] Cao LL, L FF Spatally Coherent Latent Topc Model for Concurrent Object Segmentaton and Classfcaton[C]. Proceedng of the 11th IEEE Internatonal Conference on Computer Vson. Ro de Janero, Brazl, [21] Dalal N, Trggs B Hstograms of orented gradents for human detecton Computer [C]. Proceedngs of the 2005 IEEE Computer Socety Conference on Computer Vson and Pattern Recognton, 1(1): [22] S. C. Mukhopadhyay, G. Sen Gupta and S. Demdenko, Intellgent Method of Teachng Eletectromagnetcs Theory Measurement Under Vrtual Envronment, Internatonal Journal on Smart Sensng and Intellgent Systems, Vol. 1, No. 2, June 2008, pp [23] Kadr T, Brady M Scale, Salency and Image Descrpton [J]. Internatonal Journal of Computer Vson, 45(2): [24] N. K. Suryadevara, S. C. Mukhopadhyay. R.K. Rayudu and Y. M. Huang, Sensor Data Fuson to determne Wellness of an Elderly n Intellgent Home Montorng Envronment, Proceedngs of IEEE I2MTC 2012 conference, IEEE Catalog number CFP12MT-CDR, ISBN , May 13-16, 2012, Graz, Austra, pp [25] Yanmn LUO, Pezhong LIU and Mnghong LIAO, AN ARTIFICIAL IMMUNE NETWORK CLUSTERING ALGORITHM FOR MANGROVES REMOTE SENSING, Internatonal Journal on Smart Sensng and Intellgent Systems, VOL. 7, NO. 1, pp ,

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School