CS231M Mobile Computer Vision Structure from motion

CS231M Mobile Computer Vision Structure from motion - Cameras - Epipolar geometry - Structure from motion

Pinhole camera Pinhole perspective projection f o f = focal length o = center of the camera

z y f y' z f ' y P z y P Pinhole camera Derived using similar triangles ) z y,f z (f z) y,, (

From retina plane to images Piels, bottom-left coordinate systems

From retina plane to images y c c

Converting to piels y y c 1. Off set C=[c, c y ] c y (, y, z) (f c, f c y ) z z

Converting to piels y y c c 1. Off set 2. From metric to piels y (, y, z) (f k c, f l c y ) z z C=[c, c y ] Units: k,l : piel/m f : m Non-square piels, : piel

Camera Matri y y c y (, y, z) ( c, c y z z ) c Matri form? C=[c, c y ]

Homogeneous coordinates For details see lecture on transformations in CS131A homogeneous image coordinates homogeneous scene coordinates Converting from homogeneous coordinates

Camera Matri ) c z y, c z ( z) y,, ( y 1 0 1 0 0 0 0 0 0 ' z y c c z z c y z c P y y y c y c C=[c, c y ] Camera matri K

Camera Skew 1 0 1 0 0 0 sin 0 0 cot ' z y c c P y y c y c C = [c, c y ] P P M ' P I K 0 K has 5 degrees of freedom!

World reference system R,T j w k w O w i w The mapping so far is defined within the camera reference system What if an object is represented in the world reference system

World reference system R,T j w P k w O w P i w In 4D homogeneous coordinates: P R T 1 0 4 4 P w I P P K 0 Internal parameters R 0 T 1 ' K I 0 Pw 44 K Eternal parameters R M T P w

Properties of Projection Points project to points Lines project to lines Distant objects look smaller

Properties of Projection Angles are not preserved Parallel lines meet! Vanishing point

Camera Calibration Calibration rig j C P' KI 0 P K R T P w P 1 P n with known positions in [O w,i w,j w,k w ] p 1, p n known positions in the image Goal: compute intrinsic and etrinsic parameters

Camera Calibration http://docs.opencv.org/_downloads/camera_calibration.cpp Calibration rig image j C P' KI 0 P K R T P w P 1 P n with known positions in [O w,i w,j w,k w ] p 1, p n known positions in the image Goal: compute intrinsic and etrinsic parameters

CS231M Mobile Computer Vision Structure from motion - Cameras - Epipolar geometry - Structure from motion

Can we recover the structure from a single view? Pinhole perspective projection P p O w Calibration rig Scene C Camera K Why is it so difficult? Intrinsic ambiguity of the mapping from 3D to image (2D)

Can we recover the structure from a single view? Intrinsic ambiguity of the mapping from 3D to image (2D) Courtesy slide S. Lazebnik

Two eyes help! X? K =known 2 1 R, T O 2 O 1 This is called triangulation K =known

Structure from motion problem X j 1j M m M 1 mj 2j M 2 Given m images of n fied 3D points ij = M i X j, i = 1,, m, j = 1,, n

Structure from motion problem X j 1j M m M 1 mj 2j M 2 From the mn correspondences ij, can we estimate: m projection matrices M i n 3D points X j motion structure

Study relationship between X, 1 and 2 X K =known 2 1 R, T O 2 O 1 Epipolar geometry! K =known

Epipolar geometry X 1 2 e 1 e 2 Epipolar Plane Epipoles e 1, e 2 Baseline Epipolar Lines O 1 O 2 = intersections of baseline with image planes = projections of the other camera center

Epipolar geometry X For details see CS131A Lecture 9 1 2 e 1 e 2 Epipolar Plane Epipoles e 1, e 2 Baseline Epipolar Lines O 1 O 2 = intersections of baseline with image planes = projections of the other camera center

Eample: Converging image planes e e

Eample: Parallel image planes X e 1 1 2 e 2 O 1 O 2 Baseline intersects the image plane at infinity Epipoles are at infinity Epipolar lines are parallel to ais

Eample: Parallel Image Planes e at infinity e at infinity

Why are epipolar constraints useful? - Two views of the same object - Suppose I know the camera positions and camera matrices - Given a point on left image, how can I find the corresponding point on right image?

Why are epipolar constraints useful? X O 1 O 2

Essential matri P p p O R, T O Assume camera matrices are known p T E E p T R 0 E = essential matri (Longuet-Higgins, 1981)

b a b a ] [ 0 0 0 z y y z y z b b b a a a a a a Cross product as matri multiplication

Epipolar Constraint P p p O R, T O p T F K F p 0 T 1 T R K F = Fundamental Matri (Faugeras and Luong, 1992)

Epipolar Constraint p T F p 1 2 0 P p 1 p 2 e 1 e 2 O 1 O 2 F p 2 is the epipolar line associated with p 2 (l 1 = F p 2 ) F T p 1 is the epipolar line associated with 1 (l 2 = F T p 1 ) F e 2 = 0 and F T e 1 = 0 F is 33 matri; 7 DOF F is singular (rank two)

Why F is useful? l = F T - Suppose F is known - No additional information about the scene and camera is given - Given a point on left image, how can I find the corresponding point on right image?

Why F is useful? F captures information about the epipolar geometry of 2 views + camera parameters MORE IMPORTANTLY: F gives constraints on how the scene changes under view point transformation (without reconstructing the scene!) Powerful tool in: 3D reconstruction Multi-view object/scene matching

O O p p P Estimating F The Eight-Point Algorithm 1 v u p P 1 v u p P 0 F p p T (Hartley, 1995) (Longuet-Higgins, 1981)

Estimating F OPENCV: findfundamentalmat

CS231M Mobile Computer Vision Structure from motion - Cameras - Epipolar geometry - Structure from motion

Structure from motion problem X j 1j M m M 1 mj 2j M 2 From the mn correspondences ij, can we estimate: m projection matrices M i n 3D points X j motion structure

Similarity Ambiguity The scene is determined by the images only up a similarity transformation (rotation, translation and scaling) This is called metric reconstruction Similarity The ambiguity eists even for (intrinsically) calibrated cameras For calibrated cameras, the similarity ambiguity is the only ambiguity [Longuet-Higgins 81]

Similarity Ambiguity It is impossible based on the images alone to estimate the absolute scale of the scene (i.e. house height) http://www.robots.o.ac.uk/~vgg/projects/singleview/models/hut/hutme.wrl

Structure from Motion Ambiguities Projective In the general case (nothing is known) the ambiguity is epressed by an arbitrary affine or projective transformation M j i X j M K i i R i T i H X j M H 1 j j M i X j -1 M H H X i j

Projective Ambiguity R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd edition, 2003

Metric reconstruction (upgrade) The problem of recovering the metric reconstruction from the perspective one is called self-calibration Stratified reconstruction: from perspective to affine from affine to metric

Mobile SFM Intrinsic camera parameters are known or can be calibrated. For calibrated cameras, the similarity ambiguity is the only ambiguity [Longuet-Higgins 81] No need for stratified solution or auto-calibration Similarity Metric reconstruction can be determined if a calibration pattern is used or the absolute size of an known object is given.

Structure-from-Motion Algorithms Algebraic approach (by fundamental matri) Factorization method (by SVD) Bundle adjustment

Algebraic approach (2-view case) 1j M 1 2j M i j i X j M 2 Apply a projective transformation H such that: M H 1 I 0 1 M H A b 1 Canonical perspective cameras 2

Algebraic approach (2-view case) 1. Compute the fundamental matri F from two views (eg. 8 point algorithm) 2. Compute b and A from F Compute b as least sq. solution of F b = 0, with b =1 using SVD; b is an epipole A= [b ] F 3. Use b and A to estimate projective cameras M 0 M [b ]F b 1 I 2 4. Use these cameras to triangulate and estimate points in 3D For details, see CS231A, lecture 7

Structure-from-Motion Algorithms Algebraic approach (by fundamental matri) Factorization method (by SVD) Bundle adjustment

Affine structure from motion (simpler problem) Image World Image From the mn correspondences ij, estimate: m projection matrices M i (affine cameras) n 3D points X j

Affine cameras X Camera matri M for the affine case u v M X 1 AX b; M A b

Centering the data X j ˆ ij Normalize points w.r.t. centroids of measurements from each image ij AX j b ˆ A ij i X j

mn m m n n D ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ 2 1 2 22 21 1 12 11 cameras (2m ) points (n ) A factorization method - factorization Let s create a 2m n data (measurement) matri:

Let s create a 2m n data (measurement) matri: n m mn m m n n X X X A A A D 2 1 2 1 2 1 2 22 21 1 12 11 ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ cameras (2m 3) points (3 n ) The measurement matri D = M S has rank 3 (it s a product of a 2m3 matri and 3n matri) A factorization method - factorization (2m n) M S

2m Factorizing the Measurement Matri = Measurements Motion Structure 3 n 3 n D = MS

2m Factorizing the Measurement Matri Singular value decomposition of D: = D U W V T n n n n n

Factorizing the Measurement Matri Since rank (D)=3, there are only 3 non-zero singular values

Factorizing the Measurement Matri S = structure M = Motion (cameras)

Factorizing the Measurement Matri What is the issue here? D has rank>3 because of: measurement noise affine approimation T Theorem: When D has a rank greater than p, U pwpvp is the best possible rank- p approimation of A in the sense of the Frobenius norm. T A0 U3 D U3W3V3 P W V T 0 3 3

Affine Ambiguity = D M C C -1 S The decomposition is not unique. We get the same D by using any 3 3 matri C and applying the transformations: M MC S C -1 S M S Additional constraints must be enforced to resolve this ambiguity

Reconstruction results C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: A factorization method. IJCV, 9(2):137-154, November 1992.

Structure-from-Motion Algorithms Algebraic approach (by fundamental matri) Factorization method (by SVD) Bundle adjustment

Bundle adjustment Non-linear method for refining structure and motion Minimizing re-projection error E(M, X) m i1 n j1 D 2, X ij M i j X j? M 1 X j 1j 3j P 1 M 2 X j 2j M 3 X j P 2 P 3

Bundle adjustment Non-linear method for refining structure and motion Minimizing re-projection error E(M, X) m i1 n j1 D 2, X ij M i Advantages Handle large number of views Handle missing data Can leverage standard optimization packaged such as Levenberg-Marquardt Limitations Large minimization problem (parameters grow with number of views) Requires good initial condition Used as the final step of SFM j

Results and applications Courtesy of Oford Visual Geometry Group Lucas & Kanade, 81 Chen & Medioni, 92 Debevec et al., 96 Levoy & Hanrahan, 96 Fitzgibbon & Zisserman, 98 Triggs et al., 99 Pollefeys et al., 99 Kutulakos & Seitz, 99 Levoy et al., 00 Hartley & Zisserman, 00 Dellaert et al., 00 Rusinkiewic et al., 02 Nistér, 04 Brown & Lowe, 04 Schindler et al, 04 Lourakis & Argyros, 04 Colombo et al. 05 Golparvar-Fard, et al. JAEI 10 Pandey et al. IFAC, 2010 Pandey et al. ICRA 2011 Microsoft s PhotoSynth Snavely et al., 06-08 Schindler et al., 08 Agarwal et al., 09 Frahm et al., 10

CS231M Mobile Computer Vision Net lecture: Eample of a SFM pipeline for mobile devices: the VSLAM pipeline