Bag of Words or Bag of Features. Lecture-16 (Slides Credit: Cordelia Schmid LEAR INRIA Grenoble)

Size: px

Start display at page:

Download "Bag of Words or Bag of Features. Lecture-16 (Slides Credit: Cordelia Schmid LEAR INRIA Grenoble)"

Colin Charles
5 years ago
Views:

1 Bag of Words or Bag of Features Lecture-16 (Slides Credit: Cordelia Schmid LEAR INRIA Grenoble)

2 Contents Interest Point Detector Interest Point Descriptor K-means clustering Support Vector Machine (SVM) classifier Evaluation Metrics: Precision & Recall

3 Image classification Image classification: assigning a class label to the image Car: present Cow: present Bike: not present Horse: not present

4 Image Tasks classification Image classification: assigning a class label to the image Car: present Cow: present Bike: not present Horse: not present Object localization: define the location and the category Car Cow Location Category

5 Difficulties: within object variations Variability: Camera position, Illumination,Internal parameters Within-object variations

6 Difficulties: within class variations

7 Given Image classification Positive training images containing an object class Negative training images that don t Classify A test image as to whether it contains the object class or not?

8 Bag-of-features Origin: texture recognition Texture is characterized by the repetition of basic elements or textons Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

9 Bag-of-features Origin: texture recognition histogram Universal texton dictionary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003

10 West Bank water row The last Uncovering duel the trenches Palestinians have accused After quarrelling I know quite over a a lot bank about these Israel of Morning diverting sickness water away loan, two people, men took through part documentary in the from their Just towns a few in of order the opening to keep bars last fatal evidence duel staged relating on to them and Jewish settlements are enough to in transport the many Scottish contemporary soil. BBC News's accounts of their occupied unsuspecting territories fully souls back to the James Landale times. I can retraces identify the the ship my supplied. school Israel assembly denies the halls of their Documents Documents steps of Doc-1 ancestor his ancestor, Doc-2 served Doc-3 who on Doc-4 in the charge childhood. saying Palestinian Words Doc-1 Doc-2 Doc-3 Doc-4 made that Seven final Years' challenge. War, his father's farmers are to blame for using Bank house on Holy Island and illegal connections to irrigate 0 0 probably the Bank Bank : 1 : their fields. Loan Bank Bank : 1 0 : Words Bank Loan Water Doc-1 Doc-1 Loan Loan : 1 : Water Water : 0 : Farmer Farmer : 0 : Doc-2 Water Doc-2 Loan Loan : 0 0 : 0 Water Water : 2 2 : 2 Farmer Farmer : 0 0 : Farmer Farmer

11 Feature Detection 11 Image Dataset Local Patches

12 Local Patches Descriptors Clustering Words Generate the visual vocabulary 12

13 13 Image Dataset Local Patches Visual Words Represent an image as a histogram or bag of words

14 Bag-of-features Origin: bag-of-words (text) Orderless document representation: frequencies of words from a dictionary Classification to determine document categories Bag-of-words Common People Sculpture

15 Bag-of-features for image classification SVM Extract regions or Compute descriptors Find clusters and frequencies Compute distance matrix Classification Interest Points

16 Bag-of-features for image classification SVM Extract regions Compute descriptors Find clusters and frequencies Compute distance matrix Step 1 Step 2 Step 3 Classification

17 Step 1: feature extraction Detect Interest Points SIFT Harris Dense (take every nth pixel as interest point) Compute Descriptor around each interest point SIFT HOG

18 Dense features

19 Bag-of-features for image classification SVM Extract regions Compute descriptors Find clusters and frequencies Compute distance matrix Step 1 Step 2 Step 3 Classification

20 Step 2: Quantization Visual vocabulary Clustering

21 Step 2: Quantization Cluster descriptors K-means Assign each visual word to a cluster Build frequency histogram

22 K-Means

23 K-means Clustering: Step 1 Algorithm: k-means, Distance Metric: Euclidean Distance k 1 2 k k 3 From unknown source on internet

24 K-means Clustering: Step 2 Algorithm: k-means, Distance Metric: Euclidean Distance k 1 2 k k 3 From unknown source on internet

25 K-means Clustering: Step 3 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k k 2 k From unknown source on internet

26 K-means Clustering: Step 4 Algorithm: k-means, Distance Metric: Euclidean Distance 5 4 k k 2 k From unknown source on internet

27 expression in condition 2 K-means Clustering: Step 5 Algorithm: k-means, Distance Metric: Euclidean Distance k 2 k expression in condition 1 k 1 From unknown source on internet

28 Example: 3-means Clustering Convergence in 3 steps from Duda et al.

29 Airplanes Motorbikes Faces Wild Cats Leaves People Bikes Examples for visual words

30 Image representation frequency.. codewords each image is represented by a vector, typically dimension,

31 Bag-of-features for image classification SVM Extract regions Compute descriptors Find clusters and frequencies Compute distance matrix Step 1 Step 2 Step 3 Classification

32 Step 3: Classification Learn a decision rule (classifier) assigning bag-offeatures representations of images to different classes Decision boundary Zebra Non-zebra

33 Training data Vectors are histograms, one from each training image positive negative Train classifier,e.g.svm

34 Support Vector Machines (SVM)

35 Application Pattern recognition Object classification/detection

36 Usage The classifier must be trained using a set of negative and positive examples. The classifier learns the regularities in the data If training was successful classifier is capable of classifying an unknown example with a high degree of accuracy.

37 Linear Classifier Binary classifier Task of separating classes in feature space w T x + b > 0 w T x + b = 0 w T x + b < 0 f(x) = sign(w T x + b)

38 Linear Classifier cont d Which of the linear separators is optimal?

39 Margin Distance from example to the separator is (Point to Plane Distance Equation) Examples closest to the hyperplane are support vectors. r w T x b w Margin 2γ of the separator is the width of separation between classes. r 2γ

40 Maximum Margin Classification Maximizing the margin is good according to intuition. Implies that only support vectors are important; other training examples are ignorable.

41 LibSVM SVM implementation

42 Kernels for bags of features Histogram intersection kernel: Generalized Gaussian kernel: D can be Euclidean distance RBF kernel D can be χ 2 distance Earth mover s distance N i i h i h h h I )) ( ), ( min( ), ( ), ( 1 exp ), ( h h D A h h K N i i h i h i h i h h h D ) ( ) ( ) ( ) ( ), (

43 Combining features SVM with multi-channel chi-square kernel Channel c is a combination of detector, descriptor D H, H ) is the chi-square distance between histograms A c c( i j D c 1 m 2 ( H1, H 2) [( h i i h i h i h ) ( 1 2i )] 2 is the mean value of the distances between all training sample Extension: learning of the weights, for example with Multiple Kernel Learning (MKL) J. Zhang, M. Marszalek, S. Lazebnik and C. Schmid. Local features and kernels for classification of texture and object categories: a comprehensive study, IJCV 2007.

44 Multi-class SVMs Various direct formulations exist, but they are not widely used in practice. It is more common to obtain multi-class SVMs by combining two-class SVMs in various ways. One versus all: Training: learn an SVM for each class versus the others Testing: apply each SVM to test example and assign to it the class of the SVM that returns the highest decision value One versus one: Training: learn an SVM for each pair of classes Testing: each learned SVM votes for a class to assign to the test example

45 Why does SVM learning work? Learns foreground and background visual words foreground words high weight background words low weight

Correct Image: 37 20 40 60 80 100 120 20

150 200 Correct Image: 38 Correct Image:

foreground word more probable background

46 Illustration Localization according to visual word probability Correct Image: 35 Correct Image: Correct Image: 38 Correct Image: foreground word more probable background word more probable

47 Illustration A linear SVM trained from positive and negative window descriptors A few of the highest weighted descriptor vector dimensions (= 'PAS + tile') + lie on object boundary (= local shape structures common to many training exemplars)

48 Bag-of-features for image classification Excellent results in the presence of background clutter bikes books building cars people phones trees

49 Examples for misclassified images Books- misclassified into faces, faces, buildings Buildings- misclassified into faces, trees, trees Cars- misclassified into buildings, phones, phones

50 Bag of visual words summary Advantages: largely unaffected by position and orientation of object in image fixed length vector irrespective of number of detections very successful in classifying images according to the objects they contain Disadvantages: no explicit use of configuration of visual word positions poor at localizing objects within an image

51 Evaluation of image classification PASCAL VOC [05-10] datasets PASCAL VOC 2007 Training and test dataset available Used to report state-of-the-art results Collected January 2007 from Flickr images downloaded and random subset selected 20 classes Class labels per image + bounding boxes 5011 training images, 4952 test images Evaluation measure: average precision

52 PASCAL 2007 dataset

53 PASCAL 2007 dataset

54 Evaluation Metrics GT RM precision RM GT RM recall GT TP RM TP GT GT RM TP precision RM FP TP GT RM TP recall GT TP FN Ground Truth (GT) Results of Method (RM) True Positives (TP) True Negatives (TN) False Negatives (FN) False Positives (FP)

55 Evaluation

56 Results for PASCAL 2007 Winner of PASCAL 2007 [Marszalek et al.] : map 59.4 Combination of several different channels (dense + interest points, SIFT + color descriptors, spatial grids) Non-linear SVM with Gaussian kernel Multiple kernel learning [Yang et al. 2009] : map 62.2 Combination of several features Group-based MKL approach Combining object localization and classification [Harzallah et al. 09] : map Use detection results to improve classification

57 Spatial pyramid matching Add spatial information to the bag-of-features Perform matching in 2D image space [Lazebnik, Schmid & Ponce, CVPR 2006]

58 Related work Similar approaches: Subblock description [Szummer & Picard, 1997] SIFT [Lowe, 1999] GIST [Torralba et al., 2003] SIFT Gist Szummer & Picard (1997) Lowe (1999, 2004) Torralba et al. (2003)

59 Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0

60 Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0 level 1

61 Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0 level 1 level 2

62 Spatial pyramid matching Combination of spatial levels with pyramid match kernel [Grauman & Darell 05] Intersect histograms, more weight to finer grids

63 Scene dataset [Labzenik et al. 06] Coast Forest Mountain Open country Highway Inside city Tall building Street Suburb Bedroom Kitchen Living room Office Store Industrial 4385 images 15 categories

64 Scene classification L Single-level Pyramid 0(1x1) 72.2±0.6 1(2x2) 77.9± ±0.5 2(4x4) 79.4± ±0.3 3(8x8) 77.2± ±0.3

65 Retrieval examples

66 Category classification CalTech101 L Single-level Pyramid 0(1x1) 41.2±1.2 1(2x2) 55.9± ±0.8 2(4x4) 63.6± ±0.8 3(8x8) 60.3± ±0.7

67 Discussion Summary Spatial pyramid representation: appearance of local image patches + coarse global position information Substantial improvement over bag of features Depends on the similarity of image layout Extensions Flexible, object-centered grid

68 Human Action Recognition

69 Weizmann Action Dataset 10 actions 9 actors per action Bend Jack Wave 2 hands Wave 1 hand Jump in place Sidestep Hop Walk Skip Run

70 KTH Data Set Six Categories, 25 actors, 4 instances, 600. clips Boxi ng Hand Clapping Hand Waving Joggi Runn Walk

71 UCF Sports Action Dataset 9 actions, 142 videos. Bench Swing Div e Swing Run Kick Lift Ride Golf Swing Skate

72 IXMAS Multi view Data Set 13 action categories, 4 camera views, 10 actors, 3 instances. View 1 View 2 View 3 View 4 Check Watch Get Up

73 UCF YouTube Action Dataset (UCF 11) Cycling Diving Golf Swinging Riding Juggling Basketball Shooting Swinging Tennis Swinging Volleyball Spiking Trampoline Jumping Walking Dog

74 UCF50 Baseball Pitch Basketball Shooting Bench Press Biking Billiards Breaststroke Clean and Jerk Diving Drumming Fencing Golf Swing High Jump Horse Race Horse Riding Hula Hoop Javelin Throw Juggling Balls Jumping Jack Jump Rope Kayaking Lunges Military Parade Nun chucks Mixing Batter Pizza Tossing

75 UCF50 Playing Guitar Playing Piano Playing Tabla Playing Violin Pole Vault Pommel Horse Pull Ups Punch Push Ups Rock Climbing Indoor Rope Climbing Rowing Salsa Spins Skate Boarding Skiing Ski jet Soccer Juggling Swing TaiChi Tennis Swing ThrowDiscus Trampoline Jumping Walking with a dog Volleyball Spiking Yo Yo

76 Feature Detector 3D cuboids extraction Feature Clustering Visual Word B Visual Word A Video-words histogram Visual vocabulary

77 Bag of Visual Words model (II) e.g.3d Harris, 3.D. SIFT e.g.hof, 3DHOG Feature Detector 3D cuboids extraction Feature Clustering Visual Word B Visual Word A Video-words histogram Visual vocabulary

78 Histogram of Optical Flow (HOF)

79 Optical flow Optical flow provides very useful motion information for Action Recognition

80 Histogram of Optical flow (HOF) Very similar to HOG Optical flow Speed Optical flow Direction

81 Histogram of Optical flow (HOF) Quantize the orientation into 9 bins (0-180) without Interpolation The vote is magnitude

82 HOF Steps Compute optical flow over the whole video Detect interest point locations or use densely sampled locations Space Time Interest points Densely sampled points

83 HOF Steps Extract space time volumes in the neighborhood of interest points. Further divide neighborhood into small cuboids. Compute HOF in each small cuboids Concatenate histograms of all cuboids to make HOF descriptor of an interest point.

84 HOF Steps Video Interest Point Neighborhood Cuboids Make 1D HOF vector..

Bag-of-features. Cordelia Schmid

Bag-of-features. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Visual search Particular objects and scenes, large databases Category recognition Image classification: assigning a class label to the image