Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

Size: px

Start display at page:

Download "Fast, Accurate Detection of 100,000 Object Classes on a Single Machine"

Blake Garrett
5 years ago
Views:

1 Fast, Accurate Detection of 100,000 Object Classes on a Single Machine Thomas Dean etal. Google, Mountain View, CA CVPR 2013 best paper award Presented by: Zhenhua Wang

2 Outline Background This Work Experiments Conclusion

3 Outline Background Object Detection Overview Deformable Part Model This Work Experiments Conclusion

4 Object Detection

5 Detection Workflow

6 Progress of Research Harr wavelet + Linear SVM, Papageorgiou & Poggio, ICCV Harr wavelet + adaboost + cascade, Viola & Jones, CVPR 2001(Longuet-Higgins Prize in CVPR 2011). HOG + Linear SVM, Dalal & Triggs, CVPR HOG + DPM + latent SVM, Felzenszwalb etal., CVPR 2008(Lifetime achievement prize in PASCAL VOC 2010).

7 Outline Background Object Detection Overview Deformable Part Model HOG features Model Training This Work Experiments Conclusion

8 Outline Background Object Detection Overview Deformable Part Model HOG features Model Training This Work Experiments Conclusion

9 Histogram Of Gradient(HOG) Features Image is partitioned into 8x8 pixel cells In each cell we compute a histogram of gradient orientations Invariant to changes in lighting, small deformations, etc. Compute features at different resolutions (multi-scale)

10 HOG Filter Array of weights for features in subwindow of HOG pyramid Learned by Linear SVM in training stage. In detection, score is dot product of filter and feature vector F φ ( H, p) (, ) H p φ : the concatenation of HOG features from subwindow specified by p

11 Dalas & Triggs Detector: HOG + Linear SVM F φ ( H, p) < 0 F φ ( H, p) > 0 Filter: F There is much more background than objects Start with random negatives and repeat: 1) Train a model 2) Harvest false positives to define hard negatives

12 Outline Background Object Detection Overview Deformable Part Model HOG features Model Training This Work Experiments Conclusion

13 Deformable Part Model A Model: a root filter: F 0 n deformable parts P i a filter F i an anchor position v i a deformation cost weight d i

14 Object Hypothesis Locations of root filter and all part filters in the feature pyrimid, z = ( p0, p1,, p n ) where p = ( x, y, l ) k k k k Multiscale model captures features at two resolutions

15 Score of a Hypothesis To detect object, we compute the score: β = ( F,, F, d,, d ) 0 n 1 n Ψ ( H, z) = ( φ( H, p ),, φ( H, p ), ( dx, dy ),, ( dx, dy )) n 1 1 n n

16 Matching Sliding window Approach, each position correspond to a root position p 0 Define an overall score for root location of current window Based on best placement of parts overall score( p ) = max score( p,, p ) 0 0 p, p,, p 1 2 High scoring root locations define detections Efficient computation Dynamic programming + generalized distance transforms Complexity: from O(nk 2 ) to O(nk), where n is the total number of parts, k is the number of locations in feature pyramid. n n

Efficient Computation of Score Input image Response

( H,( xyl,, )) i-th part filter: F i O(k)

y + dy) il, il, dx, dy d ( dx, dy)) i O(k) Rewrite

x, y) + D (2( x, y) + v ) n 0 l il, s i i= 1 v i :

17 Efficient Computation of Score Input image Response of i-th filter in l-th pyramid: Ril, ( xy, ) = Fi φ ( H,( xyl,, )) i-th part filter: F i O(k) Transformed response: D ( x, y) = max( R ( x + dx, y + dy) il, il, dx, dy d ( dx, dy)) i O(k) Rewrite the overall score: overall score( x, y, l) = R, ( x, y) + D (2( x, y) + v ) n 0 l il, s i i= 1 v i : the anchor position for part i relative to the root position

18 For pyramid l: R 1,l-s R n,l-s R 0,l D 1,l-s D 1,l-s

19 Mixture of DPM To capture viewpoint variance Several components, and each component has root template + deformable parts Run detection algorithm for each component independently

20 Two component bicycle model

21 Outline Background Object Detection Overview Deformable Part Model HOG features Model Training This Work Experiments Conclusion

22 Training Training images with labeled bounding boxes. What to learn: Model structures:#components, #parts, anchor locations etc. Model parameters: 0 n 1 β = ( F,, F, d,, d ) n

23 (not)learning Model Structures Heuristics, cross validation, insight (from humans) (Refer to Model Initialization for more details)

24 Learning Model Parameters Weakly labeled training data Several latent (unobserved) variables(of examples) Part filter placement Accurate location of bounding box Component label of each example

25 Recall the SVM objective n 1 2 arg min L ( β) = β + C max(0,1 y0 ( β Ψ( x ))) β Convex D i i 2 i= 1 Hinge Loss 1 y ( β Ψ( x )) is convex since it is a linear function of i i The hinge loss is convex since the maximum of two convex function is convex. Fully supervised We d like to extend it to handle latent variables. Latent SVM β

26 Latent SVM n 1 2 arg min L( β) = β + C max(0,1 0 yf ( x)) β D i β i 2 i= 1 where f ( x) = max β Ψ( xz, ) β z Z( x) Hinge Loss ββ are model parameters zz are latent values For example, f is the overall score, and z is the placements of part filters. Not convex why?

27 Semi-convexity Property: the maximum of a set of convex functions is convex, e.g. g(x)=max(f 1 (x),, f n (x)) 1 2 arg min LD( β) = β + C max(0,1 + fβ ( xi)) β 2 + C max(0,1 f ( x )) fβ ( x) = max β Φ( xz, ) is convex for ββ z Z( x) The hinge loss max(0,1 + yf i β ( xi) is convex only for negative examples. Thus the overall function is not convex. i N i P β i Not convex!

28 Latent SVM training n 1 2 arg min L( β) = β + C max(0,1 yf ( x)) β D i β i 2 i= 1 z Z( x) Convex if we fix the latent value for positive examples Optimization(local): Initialize ββ and iterate: Relabel (Determine latent values of positive examples): for current ββ, pick the best z for each positive example. Optimize ββ: stochastic gradient descent with hard example mining. f ( x) = max β Ψ( xz, ) β

29 Stochastic gradient descent

30 Stochastic gradient descent

31 Hard Example Mining Incorrected classified or Inside the margin Corrected classified

32 Hard Example Mining C t is the cache of training examples in iteration t D is the set of whole training examples

33 After the loop in the blue box, we get best model ββ under the current specified z to positive examples. Model Training Algorithm caches of hard examples Relable (Determine latent values of positive examples) f ( x) = max β Ψ( xz, ) β z Z( x) Optimize ββ with mining hard examples for negative examples

34 Model Initialization Phase 1: Initializing root filter Split positive examples according to aspect ratio in m groups P k. the shape of root filter F k 0 for P k : The mean aspect ratio of boxes P k The largest area that not larger than 80% of the boxes in P k Train F k 0 for P k using classical SVM.

35 Model Initialization Phase 1: Initializing root filter Split positive examples according to aspect ratio in m groups P k. the shape of root filter F k 0 for P k : The mean aspect ratio of boxes P k The largest area that not larger than 80% of the boxes in P k Train F k 0 for P k using classical SVM. Phase 2: Retrain the root filter by mixture model Train mixture of models with no part using full dataset by Latent SVM.(latent: component label and root position)

36 Model Initialization Phase 1: Initializing root filter Split positive examples according to aspect ratio in m groups P k. the shape of root filter F k 0 for P k : The mean aspect ratio of boxes P k The largest area that not larger than 80% of the boxes in P k Train F k 0 for P k using classical SVM. Phase 2: Retrain the root filter by mixture model Train mixture of models with no part using full dataset by Latent SVM.(latent: component label and root position) Phase 3: Initializing part models from root filter Part structure(number, shape and anchor locations(v i )): initialize 6 rectangles to cover high energy of the root filter by greedy search. Anchor along central vertical axis or at symmetric positions according to the central vertical axis. Part filter(f k i): Interpolating the root filter to twice the spatial resolution. Deformable parameter: d i = (1,1)

37 Outline Background This Work Observation WTA Hashing WTA with DPM Experiments Conclusion

38 Observation Time consuming!

39 Recent Approach Recent work considers various way to reduce complexity. Reduce #locations Reduce #parts By hardware But overall complexity is still #classes

40 Intuitive Idea For each location, we want to directly obtain the correct part filters of certain class! How to? hash indexing.

41 Outline Background This Work Observation WTA Hashing WTA with DPM Experiments Conclusion

42 Hashing Input vector x (e.g. flattened HOG window) Hash code h(x) is produced by hash function h. A Locality-sensitive Hash(LSH) means similar inputs will give similar hash codes.

43 Hash table

44 Winner Takes All Hashing WTA provides a way to convert arbitrary feature vectors into compact binary codes.[ Yagnick et al. ICCV 2011] Preserves the rank correlation not sensitive to absolute values of each dimension, but implicit ordering of values(ordinal space)., where Hamming distance between WTA hashes approximates the rank correlation dot product(correlation) Ordinal dot product(hamming distance)

45 Computation of WTA

46 Computation of WTA

47 Computation of WTA

48 Computation of WTA

49 Computation of WTA

50 Outline Background This Work Observation WTA Hashing WTA with DPM Experiments Conclusion

51 Key Idea For positive examples of a class, the filter score of part filter P i of this class will be high at some pyramid layer l. score ( P)=F φ ( H,( x, y, l)) l i i FF ii and φφ(hh, (xx, yy, ll)) is highly correlated, implying their rank correlation is high They have similar WTA hash codes. We can use WTA to construct hash table for part filters after training, and use HOG feature window to retrieval correct part filters at detection.

52 WTA with DPM

53 Outline Background This Work Experiments Conclusion

54 Setups

55 PASCAL VOC 2007

56 Accuracy vs. Memory and Time

57 Accuracy vs. Number of Classes

58 100k Object Classes

59 100K Human Evaluation

60 Outline Background This Work Experiment Conclusion

61 Conclusion Contribution Scalable approach to replace dot product with LSH Demonstrate scaling of DPM to 100k object classes. Applicable to a variety of recognition method that use dot product. Limitation No root filter loss in detection performance All part filters must be the same size

62 Thanks!

Development in Object Detection. Junyuan Lin May 4th

Development in Object Detection Junyuan Lin May 4th Line of Research [1] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR 2005. HOG Feature template [2] P. Felzenszwalb,