CAMERAS AND GRAVITY: ESTIMATING PLANAR OBJECT ORIENTATION. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

CAMERAS AND GRAVITY: ESTIMATING PLANAR OBJECT ORIENTATION Zhaoyin Jia, Anrew Gallagher, Tsuhan Chen School of Electrical an Computer Engineering, Cornell University ABSTRACT Photography on a mobile camera provies access to aitional sensors. In this paper, we estimate the absolute orientation of a planar object with respect to the groun, which can be a valuable prior for many vision tasks. To fin the planar object orientation, our novel algorithm combines information from a gravity sensor with a planar homography that matches a region of an image to a training image (e.g., of a company logo). We emonstrate our approach with an iphone application that recors the gravity irection for each capture image. We fin a homography that maps the training image to the test image, an propose a novel homography ecomposition to extract the rotation matrix. We believe this is the first paper to estimate absolute planar object orientation by combining the inertial sensor information with vision algorithms. Experiments show that our propose algorithm performs reliably. Inex Terms Image processing, Image motion analysis, Image etection 1. INTRODUCTION Many mobile phones are equippe with inertial sensors, such as accelerometers an gyroscopes, to provie relatively accurate measurements of the phone s position, such as the orientation of the phone with respect to the groun. This information is usually use to improve user interaction, for example, showing an application winow in lanscape when a user rotates the mobile phone horizontally. In this paper, we combine the mobile evice s camera with a gravity sensor an show that the gravity sensor ata can estimate not only the position of the phone itself, but also the orientation of a planar object capture by the camera. One example is shown in Fig. 1: given a planar target object to etect (in Fig. 1 (a), the training view) an the gravity irection with respect to the camera (Fig. 1 (b)), we match the object in the test image (in Fig. 1 (c), the test view), an estimate the orientation of the object with respect to the groun through a special homography ecomposition, shown in Fig. 1 (). To our knowlege, we are the first to combine a vision algorithm (planar matching) with the inertial sensor to etermine the orientation of the object. The orientation of an object is a useful prior for many vision tasks. For example, even with (a) (c) Fig. 1. (a) The training view (frontal, no istortion) of the target. (b) The testing setup. (c) The test image containing the target. () The homography from the training image to the testing image. We compute the absolute orientation of the planar object, e.g. θ = 15, inicating the angle between the target artwork an the groun plane. the same image shown in Fig. 1 (a), if the image is isplaye vertically (θ = 9 ), it is more likely to be a painting hanging on the wall; if it is horizontal (θ = ), then it may be a picture in a book on a table. Relate work: Hoiem et. al [1] show that estimating surface orientation helps occlusion bounary etection, epth estimation an object recognition. Further applications of estimating the camera position, vanishing lines an the horizon are presente in [2]. This prior work proposes a learning-base approach to fin the surface orientation, an roughly classify these surfaces into vertical or horizontal. Our algorithm preicts the object orientation more precisely in egree, an incorporates a gravity sensor with pixel ata. In aition, there are other applications with ifferent goals that combine inertial sensors with images for matching, photo enhancement an robotics, such as [3], [4] an [5], but none combine ho- () (b)

2.1. Two-view geometry with planar homography For a homography H that maps points x in the image of camera to points x 1 in the image of camera 1, i.e., x 1 = Hx, [6] shows that the inuce planar homography H for the plane [n T, ] T between the two cameras is: We efine H as: H = K 1 (R tnt )K 1, (1) Fig. 2. Two views introuce by a homography with the gravity vector. mography matching with gravity sensors to accurately measure planar surface absolute orientation. Our work is closely relate to homography ecomposition presente in [6] [7] [8] an [9]. However, our setting an task is ifferent. These algorithms use the camera intrinsic matrices for ecomposition. In our case, although the mobile phone for testing can be calibrate, the training images containing targets to match have no camera information: they can be images ownloae from Internet, as shown in Fig. 1 (a). Our erivation shows that we can synthesize the camera matrix for the training view an get the target orientation, as long as the training image is frontal with no istortion, a usually vali assumption. We believe we are the first to combine this ecomposition with a gravity sensor to estimate the planar object orientation. 2. ALGORITHM We introuce our efinition of the variables in Fig. 2: we have two cameras (camera, which captures the training target image, an camera 1, which captures the test image) relate by a planar homography. The intrinsic matrices are K for the training view, an K 1 for the test image; the worl coorinate is efine as camera s axis, i.e., camera has no rotation an zero translation. The planar object is place at istant with normal irection n. The extrinsic matrix for camera 1 is [ R t ]. The gravity vector g is provie in the camera 1 s coorinates uring testing. The goal is to fin the projection of the surface normal n in camera 1 s coorinates, an then compute its angle with the gravity vector g. To o this, we must fin the rotation matrix R from camera to camera 1, espite the challenge that camera has unknown internal parameters. H = K 1 1 HK = R tnt. (2) Unfortunately, we cannot irectly ecompose H into R an t to get the position of the target ([8] an [1]), because although we have the camera parameter K 1 for the testing view (camera 1), the training view s intrinsics, K, are unknown. 2.2. Depth an intrinsic matrix K With unknown intrinsic matrix K, we make use of our assumptions about the training images 1) they are frontal views of a planar object; 2) the camera has zero-skew; 3) it also has square pixels. These assumptions hol for most planar targets, such as paintings, logos, book or CD covers online. Then K becomes: K = f c x f c y. (3) 1 For any 3D point X that lies on the plane [n T, ] T, since the camera is taking the frontal view of the plane object, then X = [x, y,, 1] T. By efinition, camera has ientity rotation an zero translation, then the extrinsic matrix for camera is [ I 3 3 ]. In homogeneous coorinates, the 2D projection x of 3D point X to camera is : x = K [ I3 3 = 1 c x 1 c y 1 We can rewrite K as K : 1 c x K = 1 c y = 1 ] [x, y,, 1] T x y f [ c (4) ], (5) where c = [c x, c y, 1] T, an represent the epth as : = f. (6) For a frontal view of a planar object, the epth an the focal length are relate: we get the exact same image results

if we increase the focal length of the camera, an move the object further away. With this erivation, the equation for the homography H in ( 2 )becomes : f tnt 1, (7) H = K1 H c =R 2.3. Decompose H to R Fig. 3. Sample frontal views of our planar object images teste. To ecompose H, we efine the vectors u an v as: 1 u =, v = 1, (8) an multiply them by H in ( 7). For u, on the left sie of ( 7) we have: 1 H u = K1 1 H c = K1 1 Hu. (9) On the right sie, since nt u =, (remember, in camera the frontal view of the planar object has n = [,, 1]T ) we have: f tnt )u = Ru (1) (R Therefore by combining ( 9) an ( 1) we have K1 1 Hu = Ru. (11) Following the same erivation for vector v, we also have K1 1 Hv = Rv. (b) θgt =, 3 respectively Fig. 4. Experiment setting with (a) fixe object orientation, an ifferent camera positions, an (b) graually increasing object orientation from to 75. to the groun, we compute the projection of plane s normal vector n in camera 1 s projection, i.e., Rn, an calculate its angle to the gravity vector g. This gives the orientation θ. More formally, efine angle α as by applying Eq. 13: α = arccos(rn g), α, α π2. θ= π α, π2 < α < π (14) (15) (12) Since R is a rotation matrix an [u, v, u v] = I3 3, thus [Ru, Rv, (Ru) (Rv)] = R[u, v, u v] = R = [K1 1 Hu, K1 1 Hv, (K1 1 Hu) (K1 1 Hv)]. (a) θgt = 72.5 (13) In practice, ue to the noise in computing H an K1, the final rotation matrix R may not be orthogonal. We approximate R to the nearest orthogonal matrix R by the Frobenius norm, i.e., by taking SVD of the Eq. 13, R = U ΣV T, an then the final R = U V T. Eq. 13 shows that, uner our assumptions for the training images (frontal view, no istortion), the rotation matrix R is inepenent of the camera center c an the focal length f of camera, the epth an the translation t. Once we retrieve the homography H between the two views an know the intrinsic parameter K1 for the testing camera 1, we can etermine the rotation matrix R from camera to camera 1. 2.4. Planar object orientation θ During testing, the gravity irection g is in camera 1 s coorinates. To compute the planar object orientation θ with respect 3. EXPERIMENTS An iphone 4 is use for the experiments. The camera is calibrate to fin K1, an the inertial sensor ata is recore while capturing images. We assume the coorinates of the gravity vector are aligne with the camera axis, an irectly use the gravity vector from the inertial sensor as g. We use several ifferent types of planar images, all ownloae, for experiments. We choose five famous paintings (Mona Lisa, The Girl with a Pearl Earring, Marriage of Arnolfini, The Music Lesson an Birth of Venus), a logo (Starbucks), an a CD cover (Beatles). We rener the image on an ipa as the testing target, an capture the scene with the calibrate iphone 4 camera to estimate the absolute orientation of the ipa. The groun truth angle of the planar object is manually measure by a protractor. The homography is compute through SIFT point matching [11] an RANSAC algorithm with DLT [6]. Three types of experiments are performe (see Fig. 4 for the first two): 1) we keep the planar object at a fixe angle with respect to the groun, an take images with ifferent camera positions; 2) we keep the camera at a fix angle, an

gt 15 3 obj1-mean 4.6 17.1 29.8 obj1-var 2.4 2.7 3. obj2-mean 7.7 13.8 29. obj2-var 3.5 4.4 4.1 45 44.3 3.2 45. 2.4 6 61.2 4.3 56.8 2.1 75 76.8 3.5 72.2 1.8 Table 1. Results when having one object rotating from to 75. gt is the groun-truth angle. obj1 is the Starbuck logo, an obj2 is the Beatles CD cover. The table shows the mean an variance (var) for each image by sampling homographies with RANSAC. Fig. 5. Experiment result of ifferent object orientations. graually rotate the planar object from to 75 with respect to the groun; 3) we capture images in the real-worl to show some qualitative results. Fixe object, ifferent camera positions: In this setting we keep the object at a fixe angle, an take images from ifferent position. One example is shown in Fig. 4 (a). We take 1 images for each planar object at a certain angle. For each image, we compute homography 3 times using RANSAC, an compute the angle θ through each homography. The experiment results are shown in Fig. 5. The ashe lines inicates the groun-truth angle. Different colors inicate ifferent images at a specific angle. Because RANSAC prouces a ifferent homography each time, the error bar shows the stanar eviation of θ. Our propose algorithm preicts the angle of the planar object with small errors, that may be introuce by misalignment of the gravity vector axes, calibration of K1, an especially by the quality of homography. One example is shown in Fig. 6. A goo homography gives much smaller error in estimating the planar object s orientation. Fixe camera, ifferent object orientations: In another test we keep the camera roughly fixe, but graually increase the orientation angle θ of the planar object from to 75. One example is shown in Fig. 4 (b). The results are shown in Table 1. Our propose algorithm still robustly estimates the object orientation in this case. For these two scenarios, overall 88% of the testing cases are within 5 from the groun truth, an 98% of them are within 1 from the groun truth. Real worl example: We also test on real-worl images with a Starbucks logo for qualitative results, shown in Fig. 7. Our propose algorithm preicts the logo orientation θ accurately in ifferent situations. 4. CONCLUSION This paper estimates the absolute orientation of a planar object using the gravity sensor with a mobile camera. We mae (a) θgt = 42.5 (b) θ = 43. (c) θ = 55.1 Fig. 6. Homography quality affects the orientation θ. The groun-truth orientation (in (a)) is 42.5. A goo homography in (b) has a better preiction than a worse one in (c). weak assumptions about the training image, e.g., frontal view, zero skew an no istortion. During testing we estimate the rotation through a special homography ecomposition, an calculate the projection of the normal vector of the planar object, an its angle between the gravity vector as the object orientation. Experiments showe that our propose algorithm robustly preicts the object orientation in ifferent scenarios. Future applications can be achieve base on this algorithm, e.g., correcting the homography or improving object etection. Since we fin the rotation matrix of the testing camera, our algorithm can also estimate the in-plane rotation of the planar object. This can be use for other applications, e.g. to preict if the object is up-sie own. (a) θ = 87.5 (b) θ = 3.7 (c) θ = 6.6 Fig. 7. Preicting the angle of Starbucks logo in real-worl images: a) A Starbucks logo outsie the Cafe. The grountruth θgt = 9. b) A bag of coffee in han. θgt shoul be in between of an 9. c) A menu on the table. θgt =. θ uner each figure is the preiction from our algorithm.

5. REFERENCES [1] D. Hoiem, A. A. Efros, an M. Hebert, Closing the loop in scene interpretation, in CVPR, 28. [2] D. Hoiem, A. A. Efros, an M. Hebert, Putting objects in perspective, IJCV, vol. 8, 28. [3] D Kurz an S Himane, Inertial sensor-aligne visual feature escriptors, in CVPR, 211. [4] H Lee, E Shechtman, J Wang, an S Lee, Automatic upright ajustment of photographs, in CVPR, 212. [5] J. Lobo an J. Dias, Vision an inertial sensor cooperation using gravity as a vertical reference, PAMI, vol. 25, no. 12, 23. [6] A. Harltey an A. Zisserman, Multiple view geometry in computer vision (2. e.), Cambrige University Press, 26. [7] Y. Ma, S. Soatto, J. Kosecka, an S. Sastry, An Introuction to 3D Vision, Springer, 23. [8] O. D. Faugeras an F. Lustman, Motion an structure from motion in a piecewise planar environment, International Journal of Pattern Recognition an Artificial Intelligence, 1988. [9] Z. Zhang an A.R. Hanson, 3 reconstruction base on homography mapping, Proc. ARPA96, 1996. [1] E. Malis an M. Vargas, Deeper unerstaning of the homography ecomposition for vision-base control, 28. [11] D. G. Lowe, Distinctive image features from scaleinvariant keypoints, IJCV, 24.