CLASSIFICATION Experiments January 27,2015 CS3710: Visual Recognition Bhavin Modi
Bag of features Object Bag of words
1. Extract features 2. Learn visual vocabulary Bag of features: outline 3. Quantize features using visual vocabulary 4. Represent images by frequencies of visual words Slide Credits: Li Fei-Fei
Bag of features Summary
What about Spatial Information Slide Credits:Cordelia Schmid
Beyond Bag of features Slide Credits: Li Fei-Fei
Spatial Pyramid Matching
Image Representation Slide Credits: Li Fei-Fei
Kernel Function Histogram Intersection Function: Pyramid Match Kernel: Final Kernel is the sum of the separate channels:
Spatial Pyramid Vector Dimensions
Weakness of the model
Experiments Conducted #3 Datasets Used: 15 Scene, Caltech-101, and Graz. Strong Features: SIFT descriptors of 16x16 pixel patches computed over a grid with spacing of 8 pixels. Weak Features: Oriented Edge Points,i.e., points whose gradient magnitude in a given direction exceeds a minimum threshold. Dictionary Size and Levels are tested for different values, M=200,400 and L=0,1,2,3 (not in all cases)
15 Scene One of the most complete scene categories at the time. Each Category has 200 to 400 images. Conclusions made: Using all levels together confers a statistically significant benefit. For strong features single level performance drops as we go from L=2 to L=3, while weak features improve. Performance at L=2 and L=3 is almost equivalent, moving from M=200 to M=400 has a very small performance increase. Performs Better with 13 classes (74.7%) than 15 classes(72.2%) at L=0.
Caltech-101 Has geometric stability and lack of clutter. Contains 31 to 800 images per category. Slide Credits: Cordelia Schmid
Caltech-101 Conclusions: Prone to intra-class variations. Results shown for M=200, M=400 shows no significant improvement. Best performance 64.6% with L=2, M=200 with strong features. Best Classification Rate for 15 scene was 72.2% and it is 64.6% for Caltech-101.
Graz Dataset Has 2 object categories Bikes and People with heavy clutter and pose changes. M=200, L=0 and L=2 for strong features. Conclusions: Improvement for L=0 to L=2 is relatively small since it is difficult to find useful global features. Performance at 86.3% is higher than 15 Scene and Caltech-101.
New Experiments Conducted 1.Used the Caltech-256 dataset (256 Categories) to check if performance decreases on increasing the number of classes. 2. Vary the size of dictionary, M, to see the effects on accuracy. Values used M=10,50, and 200. (200 is said to be the optimal) Control Parameters present (Default Shown): Image size=1000 Grid spacing=8 Patch size=16 Dictionary Size=200 Number of Texton Images=50 Pyramid Levels=3
Why Caltech-256? CALTECH-101 Weaknesses: The dataset is too clean: images are very uniform in presentation, aligned from left to right, and usually not occluded. Limited number of categories. Some categories contain few images: certain categories are not represented as well as others, containing as few as 31 images. For example binocular (33), wild cat (34) Caltech-256 is another image dataset created at the California Institute of technology in 2007, a successor to Caltech-101. It is intended to address some of the weaknesses inherent to Caltech-101.
Slide Credits: Vision.Caltech.edu
Results Experiment 1: Dataset Caltech-256, multiple categories considered Training images=30 per category Test Images=50 per category L=3 (0,1,2) M=200. Experiment 2: Same as above but categories considered=256 1. M=10 2. M=50 3. M=200
14 12 Category: 10 Accuracy: 13.2 10 Accuracy % 8 6 4 Category: 50 Accuracy: 3.64 2 Category: 100 Accuracy: 1.64 Category: 160 Accuracy: 1.125 Category: 256 Accuracy: 1.094 0 0 50 100 150 200 250 300 Number of Categories
1.1 M: 200 Accuracy: 1.094 1 0.9 Accuracy % 0.8 0.7 0.6 M: 50 Accuracy: 0.7109 0.5 0.4 M: 10 Accuracy: 0.3125 0 50 100 150 200 M (Dictionary Size)
Problems As we can see the accuracy% is very low. Which leads to believe that there is some error in implementation, so we try to figure out the reason by performing three debugging steps: All debugging is done on the Catech-256 dataset, for 100 Categories, M=200, L=3, No. of training images=30 per category, No. of testing images=50 per category. Accuracy on Test Set=1.64% (82/5000) Accuracy on Train Set=87.1667% (2615/3000) 1. Compute the Big Kernel 2. Using the inbuilt Linear Kernel and RBF Kernel 3. Calculating Kernel Means Values
1. Calculating the Big Kernel Accuracy=1.64%, No Change Debugging Results 2. Using a Linear or RBF Kernel on the test data and doing a Sanity Check on the training data. Train Set Test Set Linear Kernel 8.4% 0.92% RBF Kernel 8.267% 1% 3. Calculating the ratio of the *mean* K(sample, other samples from same class) values and the *mean* K(sample, samples from different classes ratio) values, for both the train and test kernels. Train Set Test Set Mean K Same Class 0.5267 0.5335 Mean K Diff. Class 0.5270 1.1923
Debugging II We check the predicted Labels on the test set to see the which category was assigned to majority of the images. We see category that 6-Basketball Hoops and 59-Drinking Straw have more than 1000 images assigned to these two categories.
Evaluation on Other Datasets Slide Credits:Cordelia Schmid
Summary Discussion Spatial pyramid representation: appearance of local image patches + coarse global position information Substantial improvement over bag of features Depends on the similarity of image layout
Future Work Done Packing More Information in the Pyramid: 1.Bosch et al. (2007), Used descriptors PHOW and PHOG. 2. Germett et al. (2008), Kernel Codebook uses a Gaussian kernel over every centroid w, every bin gets 1 if descriptor ri is assigned(nearest) to its centroid w, every descriptor contributes some information to every bin(depending on σ). 3.Shengye Yan et al. (2012), Beyond Spatial Pyramid uses a two level feature extraction method using encoding and pooling procedures on the window-based features to acquire new image features.
References 1. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories-Svetlana Lazebnik, Cordelia Schmid, Jean Ponce. 2.Part 1: Bag-of-words models ppt by Li Fei-Fei (Princeton). 3. Recent Advancements on the Bag of Visual Words Model for image classification and concept detection- Costantino Grana and Giuseppe Serra. 4. Bag-of-features for category classification-cordelia Schmid, INRIA. 5. Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27, 2011. 6.Caltech-256 Dataset-www.vision.Caltech.edu
Thank You