Multi-View Visual Recognition of Imperfect Testing Data

Size: px

Start display at page:

Download "Multi-View Visual Recognition of Imperfect Testing Data"

Scarlett Butler
5 years ago
Views:

1 Multi-View Visual Recognition of Imperfect Testing Data MM'15, October 26 30, 2015, Brisbane, Australia Qilin Zhang Stevens Institute of Technology 1 Castle Point Terrace Hoboken, NJ, USA, Gang Hua Microsoft Research Asia No. 5 Danling Street, Haidian District Beijing, China

Multi-View Sensor & Data Contamination Multi-modality cameras captures

(distance) image Multi-view learning with test data contaminations RGB+Depth

2 Multi-View Sensor & Data Contamination Multi-modality cameras captures multi-view data 1 st view: regular RGB image 2 nd view: infrared (IR)/depth (distance) image Multi-view learning with test data contaminations RGB+Depth and RGB+IR images Data loss due to interferences, e.g., sunlight interferences Data loss due to faulty transmission, e.g., distortion or bandwidth limits Backlog of single view history data Saturated IR image due to direct sunlight 2

Multi-view learning SVM-2K, [Farquhar et al

Problems Multisensory Training 0 Multisensory

Transfer learning ITML, [Davis et al, 07],

Learning with privileged/side information SVM+

3 Multi-view learning SVM-2K, [Farquhar et al 05], Convex MTFL [Argyriou et al 07] Related Problems Multisensory Training 0 Multisensory Testing =? Transfer learning ITML, [Davis et al, 07], Visual category adapting, [Saenko et al, 10] Learning with privileged/side information SVM+ [Vapnik, 09] Source Domain Training Domain Transfer Target Domain Training Data =? 3

4 Problem Comparison Classical Multi-View Learning Paired, labeled training data Paired testing data Multi-View Learning with Missing Data Paired, labeled training data Single-view testing data Legend labelled training image unlabeled test image RGB-IR paring 4

Trimming training data, disregard the 2 nd view

5 Challenges RGB+Depth training images RGB testing images Train-test mismatch Conventional solution: Trimming training data, disregard the 2 nd view Drawback Loss of discriminative information Disregard 2 nd view 5

6 Resolve Train-test Mismatch 1 st View Latent Space Model 2 nd View Projections filter noises out of the common semantics In the common latent space, View-wise differences are eliminated 6

7 Similarity Learning CCA overview 7

8 Differences Explicitly incorporating labels 8

c Inner products in a normalized latent space (hypersphere) If points b and c differ in class labels y i y j, their

9 Similarity Criterion a b If points a and b share the same class label y i = y j, their similarity should be high, i.e., the angle formed should be small. a, b c ij Large inner product ensures small distance and high similarity. c Inner products in a normalized latent space (hypersphere) If points b and c differ in class labels y i y j, their similarity should be low, i.e., the angle formed should be large. b, c c ij Small inner product ensures large distance and low similarity. 9

10 Similarity Constraints Inner product of x i (v), xj (v ) in latent space R vv v i, j = κ T i A v A v v κ j A v : v th view projection matrix K (v) = κ 1 v,, κ n v : v th view Gram matrix Thresholding based on similarity constraints R vv i, j & c ij, y i = y j & c ij, y i y j 10

11 Ideal Optimization Target max G tr K 1 K 2 L 2 GG T L 1 T & s.t. tr κ i 1 T L 1 GG T L 2 T κ j 2 tr κ i 1 T L 1 GG T L 1 T κ j 2 tr κ i 1 T L 2 GG T L 2 T κ j 2 c ij, y i = y j c ij, y i y j c ij, y i = y j c ij, y i y j c ij, y i = y j c ij, y i y j Explicitly enforcing similarity constraints G T Γ K G = I where L 1 = [I n, 0 n ], L 2 = [0 n, I n ], and G = A 1 T, A 2 T T, Γ K = K 1 K 1 T + λi 0 0 K (2) K (2)T + λi 11

Solution The generic Quadratically Constrained Quadratic Program (QCQP) problem is NP hard Relaxation into alternating optimization of Linear Programming and

12 Solution The generic Quadratically Constrained Quadratic Program (QCQP) problem is NP hard Relaxation into alternating optimization of Linear Programming and eigen-decomposition procedure Optimize w.r.t. augmented variable M G With fixed M G With fixed G Optimize w.r.t. original variable G Diagram of alternating optimization 12

c ij, y i = y j c ij, y i y j 1 tr κ T i L 2 M G L T 2 c ij, y i = y j 2 κ j c ij, y

13 Alternating Solution max M G tr K 1 K 2 L 2 M G L 1 T + μtr G T M G G & s.t. tr κ i 1 T L 1 M G L 2 T κ j 2 tr κ i 1 T L 1 M G L 1 T κ j 2 c ij, y i = y j c ij, y i y j c ij, y i = y j c ij, y i y j 1 tr κ T i L 2 M G L T 2 c ij, y i = y j 2 κ j c ij, y i y j tr Γ K M G = d, M G = M T G, G T M G G = I max tr G GT M G G s.t. G T G = I 13

14 Experimental Settings Missing 2 nd view recognition Cross-modal verification w.r.t. expression, gender, race Competing Algorithms SVM: Disregard 2 nd view data SVM-2K*: [J. Farquhar et al, 05] KCCA: standard kernel CCA RGCCA: [A. Tenenhaus et al, 11] DCCA: [Q. Zhang, et al. 14] Missing randomly recognition, with equal chance of missing either view 14

15 UW RGBD Instance Recognition RGBD Object Dataset [K. Lia, et al, 11] RGB and depth images captured by Kinect sensor, from various viewing angles Common household objects of 51 categories Several instance for each category Tasks: instance level classification with missing 2 nd view or missing randomly 15

UW RGBD Instance Recognition Multi-view RGBD Object Instance Recognition For both case (a) Missing 2nd View and case (b) Missing 2nd View, the accuracies of SVM baseline are situated at the bottom,

16 UW RGBD Instance Recognition Multi-view RGBD Object Instance Recognition For both case (a) Missing 2nd View and case (b) Missing 2nd View, the accuracies of SVM baseline are situated at the bottom, while the differences of subtracting SVM accuracies from the corresponding, remaining algorithms are situated at the top. The Average accuracies across all 51 categories for both cases are situated furthest to the right. 16

NYU Indoor Scenes Recognition Settings GIST, Missing 2 nd View GIST, Missing Randomly Views L+D RGB+D L+D RGB+D SVM 59.61 (3.42) 61.33 (3.32) 42.17 (4.72) 43.43 (5.87) SVM2K 58.26 (3.71) 60.92 (3.

17 NYU Indoor Scenes Recognition Settings GIST, Missing 2 nd View GIST, Missing Randomly Views L+D RGB+D L+D RGB+D SVM (3.42) (3.32) (4.72) (5.87) SVM2K (3.71) (3.80) (4.33) (4.82) KCCA (6.92) (6.28) (5.44) (4.32) RGCCA (5.82) (4.85) (4.52) (5.11) DCCA (4.23) (4.37) (6.10) (4.08) SLCCA (6.89) (5.92) (5.12) (4.71) NYU Depth V1 Indoor Scenes Dataset Classification, mean and standard deviation, in % 17 NYU Depth V1 Indoor Scenes Dataset [N. Silberman, R. Fergus, 11] Indoor scenes captured by Kinect 7 scene types such as bedroom, kitchen, living room, etc. 1 st view: grayscale/rgb image 2 nd view: depth map Features: GIST/ SIFT, Spatial Pyramid

18 Multi-Spectral Scene Recognition Multi-Spectral Scene Dataset [M. Brown, S. Susstrunk, 11] Scenes captured by modified DSLR, GIST feature 1 st view: LAB color /grayscale image, 2 nd view: near infrared image Settings Missing 2 nd View Missing Randomly Views LAB+I L+I LAB+I L+I SVM (5.25) (3.77) (5.56) (4.69) SVM2K (5.58) (4.59) (5.42) (4.30) KCCA (4.76) (4.81) (4.89) (5.61) RGCCA (3.94) (3.96) (3.36) (5.28) DCCA (2.37) (4.57) (5.40) (4.61) SLCCA (3.26) (4.68) (4.92) (5.21) Multi-Spectral Scene Dataset Classification, mean and standard deviation, in % 18

19 Binghamton 3D facial expression 3D facial expression dataset [L. Yin et al, 06] Human face models (image + 3D models): 1 st view: frontal image, 2 nd view: frontal depth image Tasks Missing data (missing 2 nd view or missing randomly) classification with respect to expression/gender/race Cross-modal verification 19

20 Binghamton 3D facial expression Settings Recognition: Missing 2 nd View Algorithms SVM KCCA DCCA SLCCA Expression 72.2 (4.1) 73.1 (3.7) 74.5 (4.1) 75.8 (3.2) Gender 92.1 (3.2) 89.4 (4.1) 92.6 (5.3) 92.8 (3.8) Race 72.0 (3.9) 74.1 (4.2) 75.2 (5.2) 78.1 (4.3) Settings Recognition: Missing Randomly Expression 69.1 (3.9) 69.5 (3.3) 71.9 (5.9) 73.8 (4.2) Gender 87.4 (4.2) 88.6 (3.5) 89.5 (4.4) 89.6 (3.5) Race 64.0 (4.2) 66.2 (3.7) 66.5 (5.3) 68.4 (4.3) Settings Cross-modal Verification Algorithms Raw KCCA DCCA SLCCA Accuracy 81.1 (4.3) 82.3 (3.9) 82.6 (4.4) 86.4 (5.4) Missing data recognition and cross-modal verification on Binghamton 3D facial expression dataset, mean and standard deviation, in % 20

21 Conclusion Adapted a recognition framework based on common semantic latent space Proposed the SLCCA, which explicitly preserves the class information in the latent space Proposed the solution to the SLCCA: relaxation and alternating optimization Verified performance advantages on four public datasets with various recognition tasks 21

22 Major References Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16(12) (2004) B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages IEEE, K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems, pages , E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng. Distance metric learning with application to clustering with side-information. In Advances in neural information processing systems, pages , A. Qualizza, P. Belotti, and F. Margot. Linear programming relaxations of quadratically constrained quadratic programs. In Mixed Integer Nonlinear Programming, pages Springer, H. D. Sherali and B. M. Fraticelli. Enhancing rlt relaxations via a new class of semidefinite cuts. Journal of Global Optimization, 22(1-4): , K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pages IEEE, N. Silberman and R. Fergus. Indoor scene segmentation using a structured light sensor. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages IEEE, M. Brown and S. Susstrunk. Multi-spectral sift for scene category recognition. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages IEEE, L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato. A 3d facial expression database for facial behavior research. In Automatic face and gesture recognition, FGR th international conference on, pages IEEE,

23 Thank You! Questions? 23

Multi-View Visual Recognition of Imperfect Testing Data

Multi-View Visual Recognition of Imperfect Testing Data Qilin Zhang 1 1 Stevens Institute of Technology 1 Castle Point Terrace Hoboken, NJ, USA, 07030 qzhang5@stevens.edu Gang Hua 1,2 2 Microsoft Research