Categorizing objects: of appearance

Categorzng objects: global and part-based models of appearance UT Austn Generc categorzaton problem 1

Challenges: robustness Realstc scenes are crowded, cluttered, have overlappng objects. Generc category recognton: basc framework Buld/tran object model Choose a representaton Learn or ft parameters of model / classfer Generate canddates n new mage Score the canddates 2

Generc category recognton: representaton choce Wndow based Part based Wndow-based models Buldng an object model Smple holstc descrptons of mage content grayscale / color hstogram vector of pxel ntenstes 3

Wndow-based models Buldng an object model Pxel-based representatons senstve to small shfts Color or grayscale-based appearance descrpton can be senstve to llumnaton and ntra-class appearance varaton Wndow-based models Buldng an object model Consder edges, contours, and (orented) ntensty gradents 4

Wndow-based models Buldng an object model Consder edges, contours, and (orented) ntensty gradents Summarze local dstrbuton of gradents wth hstogram Locally orderless: offers nvarance to small shfts and rotatons Contrast-normalzaton: try to correct for varable llumnaton Wndow-based models Buldng an object model Gven the representaton, tran a bnary classfer Car/non-car Classfer No, Yes, not car. a car. 5

Dscrmnatve classfer constructon Nearest neghbor Neural networks 10 6 examples Shakhnarovch, Vola, Darrell 2003 Berg, Berg, Malk 2005... LeCun, Bottou, Bengo, Haffner 1998 Rowley, Baluja, Kanade 1998 Support Vector Machnes Boostng Condtonal Random Felds Guyon, Vapnk Hesele, Serre, Poggo, 2001, Vola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006, McCallum, Fretag, Perera 2000; Kumar, Hebert 2003 Slde adapted from Antono Torralba Generc category recognton: basc framework Buld/tran object model Choose a representaton Learn or ft parameters of model / classfer Generate canddates n new mage Score the canddates 6

Wndow-based models Generatng and scorng canddates Car/non-car Classfer Wndow-based object detecton: recap Tranng: 1. Obtan tranng data 2. Defne features 3. Defne classfer Gven new mage: 1. Slde wndow 2. Score by classfer Tranng examples Car/non-car Classfer Feature extracton 7

What classfer? Factors n choosng: Issues Generatve or dscrmnatve model? Data resources how much tranng data? How s the labeled data prepared? Tranng tme allowance Test tme requrements real-tme? Ft wth the representaton Issues What classfer? What features or representatons? How to make t affordable? What categores are amenable? 8

Issues What categores are amenable? Smlar to specfc object matchng, we expect spatal layout to be farly rgdly preserved. Unlke specfc object matchng, by tranng classfers we attempt to capture ntra-class varaton or determne requred dscrmnatve features. What categores are amenable to wndow-based reps? 9

Wndow-based models: Three case studes Boostng + face detecton NN + scene Gst classfcaton SVM + person detecton Vola & Jones e.g., Hays & Efros e.g., Dalal & Trggs Man dea: Vola-Jones face detector Represent local texture wth effcently computable rectangular features wthn wndow of nterest Select dscrmnatve features to be weak classfers Use boosted combnaton of them as fnal classfer Form a cascade of such classfers, rejectng clear negatves quckly 10

Boostng ntuton Weak Classfer 1 Slde credt: Paul Vola Boostng llustraton Weghts Increased 11

Boostng llustraton Weak Classfer 2 Boostng llustraton Weghts Increased 12

Boostng llustraton Weak Classfer 3 Boostng llustraton Fnal classfer s a combnaton of weak classfers 13

Boostng: tranng Intally, weght each tranng example equally In each boostng round: Fnd the weak learner that acheves the lowest weghted tranng error Rase weghts of tranng examples msclassfed by current weak learner Compute fnal classfer as lnear combnaton of all weak learners (weght of each learner s drectly proportonal to ts accuracy) Exact formulas for re-weghtng and combnng weak learners depend on the partcular boostng scheme (e.g., AdaBoost) Slde credt: Lana Lazebnk Boostng: pros and cons Advantages of boostng Integrates classfcaton wth feature selecton Complexty of tranng s lnear n the number of tranng examples Flexblty n the choce of weak learners, boostng scheme Testng s fast Easy to mplement Dsadvantages Needs many tranng examples Often found not to work as well as an alternatve dscrmnatve classfer, support vector machne (SVM) especally for many-class problems Slde credt: Lana Lazebnk 14

Computng the ntegral mage (x, y-1) s(x-1, y) (x, y) Cumulatve row sum: s(x, y) = s(x 1, y) + (x, y) Integral mage: (x, y) = (x, y 1) + s(x, y) Lana Lazebnk Computng sum wthn a rectangle Let A,B,C,D be the values of the ntegral mage at the corners of a rectangle Then the sum of orgnal mage values wthn the rectangle can be computed as: sum = A B C + D Only 3 addtons are requred for any sze of rectangle! D C B A Lana Lazebnk 16

Vola-Jones detector: features Rectangular flters Feature output s dfference between adjacent regons Effcently computable wth ntegral mage: any sum can be computed n constant tme Avod scalng mages scale features drectly for same cost Value at (x,y) s sum of pxels above and to the left of (x,y) Integral mage Vola-Jones detector: features Consderng all possble flter parameters: poston, scale, and type: 180,000+ possble features assocated wth each 24 x 24 wndow Whch subset of these features should we use to determne f a wndow has a face? Use AdaBoost both to select the nformatve features and to form the classfer 17

Vola-Jones detector: AdaBoost Want to select the sngle rectangle feature and threshold that best separates postve (faces) and negatve (nonfaces) tranng examples, n terms of weghted error. Resultng weak classfer: Outputs of a possble rectangle feature on faces and non-faces. For next round, reweght the examples accordng to errors, choose another flter/threshold combo. Vola-Jones Face Detector: Results ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens Frst two features selected 18

Even f the flters are fast to compute, each new mage has a lot of possble wndows to search. How to make the detecton more effcent? Cascadng classfers for detecton Form a cascade wth low false negatve rates early on Apply less accurate but faster classfers frst to mmedately dscard wndows that clearly appear to be negatve 19

Vola-Jones detector: summary Tran cascade of classfers wth AdaBoost Faces New mage Non-faces Selected features, thresholds, and weghts Tran wth 5K postves, 350M negatves Real tme detector usng 38 layer cascade 6061 features n all layers [Implementaton avalable n OpenCV: http://www.ntel.com/technology/computng/opencv/] Vola-Jones detector: summary A semnal approach to real-tme object detecton Tranng s slow, but detecton s very fast Key deas Integral mages for fast feature evaluaton Boostng for feature selecton Attentonal cascade of classfers for fast rejecton of nonface wndows P. Vola and M. Jones. Rapd object detecton usng a boosted cascade of smple features. CVPR 2001. P. Vola and M. Jones. Robust real-tme face detecton. IJCV 57(2), 2004. 20

Vola-Jones Face Detector: Results Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng Vola-Jones Face Detector: Results Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng 21

Vola-Jones Face Detector: Results Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng Detectng profle faces? Can we use the same detector? Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng 22

Vola-Jones Face Detector: Results Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng Paul Vola, ICCV tutoral Example usng Vola Jones detector Frontal faces detected and then tracked, character names nferred wth algnment of scrpt and subttles. Everngham, M., Svc, J. and Zsserman, A. "Hello! My name s... Buffy" - Automatc namng of characters n TV vdeo, BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/ndex.html 23

Consumer applcaton: Photo http://www.apple.com/lfe/photo/ Slde credt: Lana Lazebnk 24

Consumer applcaton: Photo Thngs Photo thnks are faces Slde credt: Lana Lazebnk Consumer applcaton: Photo Can be traned to recognze pets! http://www.maclfe.com/artcle/news/photos_faces_recognzes_cats Slde credt: Lana Lazebnk 25

Wndow-based models: Three case studes Boostng + face detecton NN + scene Gst classfcaton SVM + person detecton Vola & Jones e.g., Hays & Efros e.g., Dalal & Trggs Nearest Neghbor classfcaton Assgn label of nearest tranng data pont to each test data pont Black = negatve Red = postve from Duda et al. Novel test example Closest to a postve example from the tranng set, so classfy t as postve. Vorono parttonng of feature space for 2-category 2D data 26

K-Nearest Neghbors classfcaton For a new pont, fnd the k closest ponts from tranng data Labels of the k ponts vote to classfy Black = negatve Red = postve k= 5 If query lands here, the 5 NN consst of 3 negatves and 2 postves, so we classfy t as negatve. Source: D. Lowe A nearest neghbor recognton example 27

Where n the World? [Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] Where n the World? 28

Where n the World? 6+ mllon geotagged photos by 109,788 photographers Annotated by Flckr users 29

6+ mllon geotagged photos by 109,788 photographers Annotated by Flckr users Whch scene propertes are relevant? 30

Spatal Envelope Theory of Scene Representaton Olva & Torralba (2001) A scene s a sngle surface that can be represented by global (statstcal) descrptors Slde Credt: Aude Olva Global texture: capturng the Gst of the scene Capture global mage propertes whle keepng some spatal nformaton Olva & Torralba IJCV 2001, Torralba et al. CVPR 2003 Gst descrptor 31

Whch scene propertes are relevant? Gst scene descrptor Color Hstograms L*A*B* 4x14x14 hstograms Texton Hstograms 512 entry, flter bank based Lne Features Hstograms of straght lne stats Scene Matches [Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] 32

Scene Matches [Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] 33

[Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] Scene Matches [Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] 34

[Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] Quanttatve Evaluaton Test Set 35

The Importance of Data [Hays and Efros. m2gps: Estmatng Geographc Informaton from a Sngle Image. CVPR 2008.] Nearest neghbors: pros and cons Pros: Smple to mplement Flexble to feature / dstance choces Naturally handles mult-class cases Can do well n practce wth enough representatve data Cons: Large search problem to fnd nearest neghbors Storage of data Must know we have a meanngful dstance functon 36

Wndow-based models: Three case studes Boostng + face detecton NN + scene Gst classfcaton SVM + person detecton Vola & Jones e.g., Hays & Efros e.g., Dalal & Trggs Lnear classfers 37

Lnear classfers Fnd lnear functon to separate postve and negatve examples x postve : x negatve : x w b 0 x w b 0 Whch lne s best? Support Vector Machnes (SVMs) Dscrmnatve classfer based on optmal separatng lne (for 2d case) Maxmze the margn Maxmze the margn between the postve and negatve tranng examples 38

Support vector machnes Want lne that maxmzes the margn. x postve ( y x negatve( y 1) : 1) : x w b 1 x w b 1 For support, vectors, x w b 1 Support vectors Margn C. Burges, A Tutoral on Support Vector Machnes for Pattern Recognton, Data Mnng and Knowledge Dscovery, 1998 Support vector machnes Want lne that maxmzes the margn. x postve ( y x negatve( y 1) : 1) : x w b 1 x w b 1 Support vectors Margn M For support, vectors, x w b 1 Dstance between pont x w b and lne: w For support vectors: Τ w x b 1 M w w 1 1 w w 2 w 39

Support vector machnes Want lne that maxmzes the margn. x postve ( y x negatve( y 1) : 1) : x w b 1 x w b 1 Support vectors Margn M For support, vectors, x w b 1 Dstance between pont x w b and lne: w Therefore, the margn s 2 / w Fndng the maxmum margn lne 1. Maxmze margn 2/ w 2. Correctly classfy all tranng data ponts: x postve ( y x negatve ( y 1) : x w b 1 Quadratc optmzaton problem: Mnmze 1) : 1 w T w 2 Subject to y (w x +b) 1 x w b 1 40

Fndng the maxmum margn lne Soluton: w y x learned weght Support vector Fndng the maxmum margn lne Soluton: w y x b = y w x (for any support vector) w x b y x x Classfcaton functon: f ( x) sgn ( w x b) sgn x x b b If f(x) < 0, classfy as negatve, f f(x) > 0, classfy as postve C. Burges, A Tutoral on Support Vector Machnes for Pattern Recognton, Data Mnng and Knowledge Dscovery, 1 41

Person detecton wth HoG s & lnear SVM s Dalal & Trggs, CVPR 2005 Map each grd cell n the nput wndow to a hstogram countng the gradents per orentaton. Tran a lnear SVM usng tranng set of pedestran vs. non-pedestran wndows. Code avalable: http://pascal.nralpes.fr/soft/olt/ 42

HoG descrptor Dalal & Trggs, CVPR 2005 Code avalable: http://pascal.nralpes.fr/soft/olt/ Person detecton wth HoGs & lnear SVMs Hstograms of Orented Gradents for Human Detecton, Navneet Dalal, Bll Trggs, Internatonal Conference on Computer Vson & Pattern Recognton - June 2005 http://lear.nralpes.fr/pubs/2005/dt05/ 43

Questons What f the data s not lnearly separable? What f we have more than just two categores? Non lnear SVMs Datasets that are lnearly separable wth some nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher-dmensonal space: x 2 0 x 44

Non lnear SVMs: feature spaces General dea: the orgnal nput space can be mapped to some hgher-dmensonal feature space where the tranng set s separable: Φ: x φ(x) Slde from Andrew Moore s tutoral: http://www.autonlab.org/tutorals/svm.html The Kernel Trck The lnear classfer reles on dot product between vectors K(x,x j )=x T x j If every data pont s mapped nto hgh-dmensonal space va some transformaton Φ: x φ(x), the dot product becomes: K(x,x j )= φ(x ) T φ(x j ) A kernel functon s smlarty functon that corresponds to an nner product n some expanded feature space. Slde from Andrew Moore s tutoral: http://www.autonlab.org/tutorals/svm.html 45

Example 2-dmensonal vectors x=[x 1 x 2 ]; let K(x,x j)=(1 + x T x j) 2 Need to show that K(x,x j )= φ(x ) T φ(x j ): K(x,x j )=(1 + x T x j ) 2, = 1+ x 12 x 2 j1 + 2 x 1 x j1 x 2 x j2 + x 22 x 2 j2 + 2x 1 x j1 + 2x 2 x j2 = [1 x 2 2 T 1 2 x 1 x 2 x 2 2x 1 2x 2 ] [1 x 2 j1 2 x j1 x j2 x 2 j2 2x j1 2x j2 ] = φ(x ) T φ(x j ), where φ(x) = [1 x 2 1 2 x 1 x 2 x 2 2 2x 1 2x 2 ] Nonlnear SVMs The kernel trck: nstead of explctly computng the lftng transformaton φ(x), defne a kernel functon K such that K(x,x j j) = φ(x ) φ(x j ) Ths gves a nonlnear decson boundary n the orgnal feature space: yk ( x, x ) b 46

Examples of kernel functons Lnear: K( x, x j ) x T x j Gaussan RBF: x x j K( x,x j ) exp( 2 2 2 ) Hstogram ntersecton: K ( x, x j ) mn( x ( k), x j ( k)) k SVMs for recognton 1. Defne your representaton for each example. 2. Select a kernel functon. 3. Compute parwse kernel values between labeled examples 4. Use ths kernel matrx to solve for SVM support vectors & weghts. 5. To classfy a new example: compute kernel values between new nput and support vectors, apply weghts, check sgn of output. 47

Questons What f the data s not lnearly separable? What f we have more than just two categores? Mult-class SVMs Acheve mult-class classfer by combnng a number of bnary classfers One vs. all Tranng: learn an SVM for each class vs. the rest Testng: apply each SVM to test example and assgn to t the class of the SVM that returns the hghest decson value One vs. one Tranng: learn an SVM for each par of classes Testng: each learned SVM votes for a class to assgn to the test example 48

SVMs: Pros and cons Pros Kernel-based framework s very powerful, flexble Often a sparse set of support vectors compact at test tme Work very well n practce, even wth very small tranng sample szes Cons No drect mult-class SVM, must combne two-class SVMs Can be trcky to select best kernel functon for a problem Computaton, memory Durng tranng tme, must compute matrx of kernel values for every par of examples Learnng can take a very long tme for large-scale problems Adapted from Lana Lazebnk Scorng a sldng wndow detector If predcton and ground truth are boundng boxes, when do we have a correct detecton? 49

Scorng a sldng wndow detector B p a o 0. 5 correct B gt We ll say the detecton s correct (a true postve ) f the ntersecton of the boundng boxes, dvded by ther unon, s > 50%. Scorng an object detector If the detector can produce a confdence score on the detectons, then we can plot the rate of true vs. false postves as a threshold on the confdence s vared. TPR= fracton of postve examples that are correctly labeled. FPR=fracton of negatve examples that are msclassfed as postve. 50

Wndow-based detecton: strengths ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens Sldng wndow detecton and global appearance descrptors: Smple detecton protocol to mplement Good feature choces crtcal Past successes for certan classes Wndow-based detecton: Lmtatons ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens Hgh computatonal complexty For example: 250,000 locatons x 30 orentatons x 4 scales = 30,000,000 evaluatons! If tranng bnary detectors ndependently, means cost ncreases lnearly wth number of classes Wth so many wndows, false postve rate better be low 51

Lmtatons (contnued) Not all objects are box shaped Vsual Perceptual Object and Recog Sens gnton ory Augmented Tutoral Comput ng Lmtatons (contnued) ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens Non-rgd, deformable objects not captured well wth representatons assumng a fxed 2d structure; or must assume fxed vewpont Objects wth less-regular textures not captured well wth holstc appearance-based descrptons 52

Lmtatons (contnued) If consderng wndows n solaton, context s lost ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens Fgure credt: Derek Hoem Sldng wndow Detector s vew Lmtatons (contnued) ng gnton ory Augmented Tutoral Comput Vsual Perceptual Object and Recog Sens In practce, often entals large, cropped tranng set (expensve) Requrng good match to a global appearance descrpton can lead to senstvty to partal occlusons Image credt: Adam, Rvln, & Shmshon 53

Summary Basc ppelne for wndow-based detecton Model/representaton/classfer choce Sldng wndow and classfer scorng Dscrmnatve classfers for wndow-based representatons Boostng Vola-Jones face detector example Nearest neghbors Scene recognton example Support vector machnes HOG person detecton example Pros and cons of wndow-based detecton 54