A Quantitative Comparison o 4 Algorithms or Recovering Dense Accurate Depth Baozhong Tian and John L. Barron Dept. o Computer Science University o Western Ontario London, Ontario, Canada {btian,barron}@csd.uwo.ca Abstract: We report on 4 algorithms or recovering dense depth maps rom long image sequences, where the camera motion is known a priori. All methods use a Kalman ilter to integrate intensity derivatives or optical low over time to increase accuracy. 1 Introduction This comparison work is motivated by a real-world application: we wish to play records (SP and LP) by computing dynamic depth maps o the groove walls o a record and then converting the derived time-varying wall orientations into sound. In this way, we can play old vinyl records in a noncontact manner, minimizing urther deterioration o the records. We look at 4 prominent dense depth algorithms in the literature that appear to give good results and perorm a quantitative error analysis on them using ray-traced textured objects. The irst 2 algorithms use time-varying intensity derivative data, I x, I y and I t, computed by Simoncelli s lowpass and highpass ilters [8] while the last 2 algorithms use optical low computed by Lucas and Kanade s least squares method [6]. The idea here is that our quantitative analysis o the algorithms may provide us with a ramework to compute dense depth maps or optical record playing. 2 The Algorithms We did a survey o recent algorithms or dense depth maps (rom image velocity or intensity derivatives) the 4 algorithms we present below appeared to give the best results. A 5 th algorithm by Xiong and Shater [11] is under implementation. All o these algorithms assume known camera translation and rotation (or can be made to have this assumption). We irst present brie summaries o the 4 algorithms by Heel [3], Matthies et al. [7], Hung and Ho [4] and Barron et al. [2], ollowed by experimental results and conclusions. 2.1 Heel s Algorithm Heel [3] proposed the recovery o motion and dense depth maps using intensity gradients. In our work, we assume the 3D sensor translation U = (U 1, U 2, U 3 ) and 3D sensor rotation ω = (ω 1, ω 2, ω 3 ) is known. Heel assumed that Z is constant within a small neighbourhood around a given pixel and that the recovery o Z could be posed as a least squares problem: min Z x y ( ) 2 s U Z + q ω + I t, (1) where s = ( I x, I y, xi x +yi y ) and q = (xyi x +(1+ y 2 )I y, xyi y (1+x 2 )I x, yi x xi y ), which yields the ollowing Z: Z = 2 x y ( s U) x y ( q ω + I t)( s U) (2) The job o the update stage is to predict a new depth Z + k and variance p+ k using a new measurement Z k with variance p k and the current estimate Z k with variance p k. The update equation is: Z + k = p k Z k + p k Z k p k + p k, (3) which is weighted average o the depth values using the variances as weights. The new estimated depth variance is: p k p k p + k = p k + p k (4)
Heel computes the measured variance as; c p = Σ x Σ y, (5) [(I t + q ω)( s U)] 2 where c is a scaling constant. 2.2 Hung and Ho s Algorithm Hung and Ho s approach [4] is a dense depth calculation rom intensity derivatives with known sensor motion. They assume that the image intensity o corresponding points in the 3D scene is not changed by motion over the image sequence. The standard image velocity equation can be written as: Then Z can be computed as: s U Z + q ω = I t. (6) s U Z = (7) q ω + I t Over space and time, Hung and Ho assumes Z varies as: Z k+1 = G k Z k + u k + θ k, (8) where G k = 1 ω 1 ty + ω 2 tx, u k = U 3 t Z Z x x y y and θ k is taken to include e(x, y, k+ 1), the error in the Taylor series expansion used in the derivation o this equation, as well as the error generated when estimating the terms Z x and Z y Z Z y in u(k). The terms x and y can only be estimated ater the depth map has attained some degree o smoothness. θ is approximately Gaussian random noise with zero mean and variance Q. By introducing a measurement noise n with variance R 1 equation (6) can be re-written as: x Y 1k = H 1k Z k + n k (9) where Y 1k = s U and H 1k = q ω + I t. 2.2.1 Incorporating Surace Structure and Smoothing Hung and Ho s approach assumes that the depth Z(x, y) or every particular point (x, y) in the image has some local structural property among its neighbouring pixels. The model depth as; Y 2 = Z(x, y) + n 2, (1) where n 2 is an error term. Hung and Ho compute Y 2 (x, y) or a pixel (x, y) as ollows: Y 2 (x, y) = 1 2 [Z e(x 1, y) + Z e (x, y 1)], (11) where Z e is an estimated Z value. They note this measurement is spatially biased and may produce propagation eects due to the diagonalization o Y 2 values. To overcome this, each image rame is iltered our times, top to bottom or bottom to top and let to right or right to let according to which o our possible corners o the image the calculation is started at or each time to produce our dierent versions o Y 2, Y2 1, Y2 2, Y2 3 and Y2 4. The inal smoothed estimate or Y 2 is then taken to be: Y 2 = 1 4 [ Y 1 2 + Y2 2 + Y 2 3 + Y 2 4 ]. (12) 2.2.2 The Kalman Filter equations With the extra measurement Y 2, equations (9) and (1) can be combined to give: where Y = [ Y1 Y 2 ], H = Y = HZ + n, (13) [ H1 1 ] and n = [ n1 n 2 ]. (14) Based on equation (13), a set o standard Kalman ilter equations or generating an estimate or the depth Z k in Hung and Ho s approach can be then determined. Please reer to Barron, Ngai and Spies [2] or details. 2.3 Matthies, Kanade and Szeliski Matthies, Kanade and Szeliski [7] proposed a pixelbased (iconic) algorithm that estimated depth and depth uncertainty at each pixel and incrementally reines these estimates over time using Kalman ilter. This algorithm has our main stages: disparity measuring, disparity updating, smoothing and disparity predicting. 2.3.1 Measuring Disparity First we must compute a displacement vector (disparity) at each pixel. Matthies et al. suggested a simple correlation-based matching (the sum o squared dierences (SSD)) algorithm plus cubic interpolation o scan lines be used to compute disparities. However, since our image motions are small we replaced correlation by Lucas and Kanade s optical low to obtained more accurate displacements [1]. Indeed, since the sensor motion is let to right, we used the x component o the optical low as the scalar value o the disparity d. Thus or this setup, d
is the inverse depth 1 Z. The variance o the disparity measurement is computed as: var(e) = 2σ2 n a, (15) where σ 2 n is the variance o the image noise process. Since we used σ n = 1 and a = 1 the measurement variance var(e) is always 1. 2.3.2 Updating the Disparity Map Similar to that we have showed in other algorithms, the updated disparity estimate is a linear combination o the predicted and measured values, inversely weighted by their respective variances. To update a disparity, the disparity variance is computed as: p k σ2 d p + k = [(p k ) 1 + (σd 2 ) 1 ] 1 = p k + σ2 d (16) where σd 2 is the measured variance using equation (15) and p k is the previous estimated variance. The Kalman ilter gain K is computed as: K = p+ k σ 2 d Then the disparity is updated as: = p k p k +. (17) σ2 d u + k = u k + K(d u k ), (18) where u t and u + t are the predicted and updated disparity estimates and d is the new disparity measurement. 2.3.3 Smoothing the map Matthies et al. use a generalized piecewise spline under tension in a inite element ramework to smooth their correlation ields. Since we are using the x component o optical low, we replace this smoothing by an application o a 5 5 median ilter. 2.3.4 Predicting the Next Disparity Map In the prediction stage o the Kalman ilter, both disparity and its uncertainty must be predicted. Given x component o optical low, we can predict the new pixel location in the next rame as x k+1 = x k + x k and y k+1 = y k + y k. Combining this with computed depth and camera motion, we can predict the new disparity as: where α = 1 ω x y i + ω y x i, and ω x, ω y and U 1 are camera rotation (zero in our case) and translation speed (as U 2 = U 3 = in our case). Since new positions are normally subpixel values in the next image, we need to resample them to obtain predicted disparities at integer pixel locations. The predicted variance is inlated by a multiplicative actor: p k+1 = (1 + ǫ)p+ k. (2) We use ǫ =.. 2.4 Barron, Ngai and Spies Barron, Ngai and Spies [2] proposed a Kalman ilter ramework or recovering dense depth map rom the time-varying optical low ields generated by a camera translating over a scene by a known amount. They assumed local neighbourhood planarity to avoid having to compute non-pixel correspondences. That is, surace orientation (o a plane) is what is tracked over time. The standard image velocity equations [5] relate a velocity vector measured at image location Y = (y 1, y 2, ) = P/Z, to the 3D sensor translation U and 3D sensor rotation ω: v( Y, t) = v T ( Y, t) + v R ( Y, t) where v T and v R are the translational and rotational components o image velocity: v T ( Y, t) = A 1 ( Y ) U X 3 with A 2 ( Y ) = A 1 ( Y ) = and ṽ R (Ỹ, t) = A2(Ỹ) ω(t), ( ) y1 y 2 ( y1y 2 ( + y2 1 ) y 2 ( + y2 2 ) y1y2 y 1 (21) and (22) ). (23) We deine the depth scaled camera translation as u( Y, t) = U(t) P(t) 2 = ûµ( Y, t), (24) where û = Û = (u 1, u 2, u 3 ) is the normalized direction o translation and µ( Y, t) = U 2 P 2 = U 2 X 3 Y 2 is the depth scaled sensor speed at Y at time t. The ocal length is assumed to be known. I we deine 2 vectors: r( Y ) = (r 1, r 2 ) = v A 2 ( Y ) ω and (25) u + k u k+1 = α U 1 u + k (19) d( Y ) = (d 1, d 2 ) = A 1 ( Y )û Y 2, (26)
(a) (b) (c) Figure 1: Synthetic test data: (a) A marble-texture cube (b) A marble-texture cylinder and (c) A marbletexture sphere. where A means each element in the vector is replaced by its absolute value. Then we can solve or µ as: ( ) r1 v 1 d 1 + r2 v2 d 2 µ =. (27) v 1 + v 2 2.4.1 Planar Orientation rom Relative Depth We compute the local surace orientation as a unit normal vector, ˆα = (α 1, α 2, α 3 ) rom µ values as: ˆα Y = cµ Y 2 U 2 (28) We can solve or ˆα c by setting up a linear system o equations, one or each pixel in a n n neighbourhood where planarity has been assumed and using a standard least squares solution method. 2.4.2 The Overall Calculation At the initial time, t = 1: 1. We compute all the µ s as described in equation (27). 2. In each n n neighbourhood centered at a pixel (i, j) we compute ( ˆα c ) (i,j) at that pixel using equation (28). We call these computed ˆα c s the measurements and denote them as g M(i,j). 3. Given these measurements, we use the g M(i,j) to recompute the µ (i,j) s as: µ(i, j) = ( g M (i,j) Y (i,j) ) U 2 Y (i,j) 2 (29) We apply a median ilter to the µ(i, j) within 5 5 neighbourhoods to remove outliers. We repeat step 2 with these values. At time t 2: 1. We compute µ at each pixel location and then compute all g M(i,j) s in the same way described above or the new optical low ield. Using the image velocity measurements at time t = i, we use the best estimate o surace orientation at time t = i 1 at location Y v ( t = 1) plus the measurement at Y and its covariance matrix to obtain a new best estimate at Y at time t = i. We do this at all Y locations (where possible), recompute the µ values via equation (29) and output these as the 3D shape o the scene. At time t = i we proceed as or time t = 2, except we use the best µ estimates rom time t = i 1 instead o time t = 1 in the Kalman ilter updating. 2.4.3 The Kalman Filter Equations Note that the components o ˆα c in equation (28) are not independent, thus we have a covariance matrix with non-zero o-diagonal elements in the Kalman ilter equations. We use a standard set o Kalman ilter equations to integrate the surace orientations (and hence depth) over time. Please reer to Barron, Ngai and Spies [2] or details o the Kalman ilter equations.
3 Experimental Technique We generated ray-traced cube, cylinder and sphere image sequences with the camera translating to the let by ( 1,, ), as shown in Figure 1. We marble textured this sequence so that optical low could be used. The texture is kept ixed to the object. We also generated a second set o image sequences with the same objects but with sinusoidal patterns instead o marble texture. These sequences allowed the correct derivatives to be computed, we use these or the correct optical low to conirm the correctness o our implementations. We compute error distributions or number o depth ranges 5%, between 5% and 15% and 15% or 4 rames (the 7 th rame at the beginning o the sequences, the 19 th (just beore Hung and Ho turn on their smoothing) and the 27 th and 36 th rames near the end o the sequences). We also compute the average error (as a percentage) and its standard deviation or the 4 rames. Finally, we show raw (un-textured, unsmoothed) depth maps computed or rame 27 or the 4 methods. 4 Error Distributions and Depth Maps 7 th cu 13.95 24.7 61.98 51.8±32.33 19 th cu 33.91 38.8 27.29 14.97±19.42 27 th cu.35 25.47 74.17 27.65±2.12 36 th cu. 1.8 98.2 38.9±18.59 7 th cy 16.4 25.89 57.71 48.12±33.42 19 th cy 38.64 36.47 24.89 14.±18.91 27 th cy 1.42 21.58 77.1 26.94±18.15 36 th cy.1 3.18 96.81 37.23±16.41 7 th sp 13.17 23.57 63.26 53.61±31.9 19 th sp 25.47 38.43 36.9 16.94±18.66 27 th sp 1.76 22.98 75.26 26.39±18.5 36 th sp.1 1.85 98.14 37.23±17.27 Table 2: The percentage o the estimated depth values the experiments or the cube, cylinder and sphere using Hung and Ho s algorithm. In the table, cu =cube, cy =cylinder and sp =sphere. The last column shows the mean error and its standard deviation σ. 7 th cu 34.47 4.48 25.5 3.85±35.59 19 th cu 46.28 39.46 14.25 8.54±1.71 27 th cu 47.8 39.24 12.96 8.±9.68 36 th cu 49.68 38.36 11.96 7.64±9.34 7 th cy 38.14 36.88 24.98 34.79±37.7 19 th cy 49.5 34.99 15.51 9.23±13.23 27 th cy 5.84 34.65 14.51 8.97±13.11 36 th cy 51.77 34.55 13.68 8.56±12.43 7 th sp 27.54 37.7 35.39 4.25±37.5 19 th sp 39.13 39.84 21.2 11.38±14.72 27 th sp 4.33 39.89 19.77 1.89±14.17 36 th sp 41.59 39.83 18.59 1.53±13.92 Table 1: The percentage o the estimated depth values the experiments or the cube, cylinder and sphere using Heel s algorithm. In the table, cu =cube, cy =cylinder and sp =sphere. The last column shows the mean error and its standard deviation σ. Tables 1 to 4 show the error distributions or the 4 methods while Table 5 show the error distribution or Hung and Ho with their smoothing calculation or Y 2 turned o. Figures 2 to 4 show the raw depth maps or the 3 objects or the 4 methods. 7 th cu 22.8 35.94 41.26 22.8±24.2 19 th cu 55.73 26.65 17.63 11.56±19.37 27 th cu 51.59 27.84 2.57 11.7±16.54 36 th cu 47.1 3.17 22.82 1.99±14.44 7 th cy 25.56 35.26 39.18 22.23±25.23 19 th cy 59.87 22.89 17.24 1.82±18.71 27 th cy 55.48 24.39 2.14 1.42±15.92 36 th cy 51.14 27.18 21.68 1.21±13.83 7 th sp 19.2 32.32 48.66 27.93±27.62 19 th sp 51.7 26.47 22.46 13.57±2.7 27 th sp 51.61 24.99 23.4 11.66±16.9 36 th sp 49.44 26.4 24.17 11.1±14.71 Table 3: The percentage o the estimated depth values the experiments or the cube, cylinder and sphere using Matthies et al. s algorithm. In the table, cu =cube, cy =cylinder and sp =sphere. The last column shows the mean error and its standard deviation σ.
1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (a) (b) 1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (c) (d) Figure 2: (a) Heel, (b) Barron et al., (c) Matthies et al., (d) Hung and Ho depth map or marble cube at the 27 th step in the Kalman iltering. 7 th cu 33.5 41.27 25.23 12.16±13.84 19 th cu 51.63 37.56 1.81 7.19±8.51 27 th cu 52.18 38.16 9.66 6.88±7.88 36 th cu 52.76 35.53 11.71 7.23±8.67 7 th cy 36.71 37.86 25.43 12.55±15.26 19 th cy 54.26 33.25 12.48 7.29±8.87 27 th cy 52.91 33.58 13.51 7.52±8.95 36 th cy 52.1 34.67 13.23 7.46±8.5 7 th sp 26.3 36.38 37.59 16.91±18.18 19 th sp 41.45 38.82 19.73 9.64±1.49 27 th sp 43.99 38.33 17.68 9.1±9.91 36 th sp 45.75 37.59 16.66 8.68±9.75 Table 4: The percentage o the estimated depth values the experiments or the cube, cylinder and sphere using Barron et al. s algorithm. In the table, cu =cube, cy =cylinder and sp =sphere. The last column shows the mean error and its standard deviation σ. 7 th cu 13.95 24.7 61.98 51.8±32.33 19 th cu 33.91 38.8 27.29 14.97±19.42 27 th cu 39.71 4.23 2.5 14.34±21.42 36 th cu 42.72 4.79 16.49 15.37±24.44 7 th cy 16.4 25.89 57.71 48.12±33.42 19 th cy 38.64 36.47 24.89 14.±18.91 27 th cy 44.47 36.9 18.63 12.95±19.9 36 th cy 47.29 37.96 14.75 12.96±21.51 7 th sp 13.17 23.57 63.26 53.61±31.9 19 th sp 25.47 38.43 36.9 16.94±18.66 27 th sp 31.11 41.99 26.9 14.41±18.11 36 th sp 35.34 44.33 2.33 13.63±19.39 Table 5: The percentage o the estimated depth values experiments or the marble sphere using Hung and Ho s algorithm. Smoothing in the Kalman ilter was turned o. In the table, cu =cube, cy =cylinder and sp =sphere. The last column shows the mean error and its standard deviation σ.
1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (a) (b) 1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (c) (d) Figure 3: (a) Heel, (b) Barron et al., (c) Matthies et al., (d) Hung and Ho depth map or marble cylinder at the 27 th step in the Kalman iltering. 5 Discussion and Conclusion Quantitative results in Tables 4 and 3 show that the methods o Barron et al. [2] and Heel [3] are about the same and the best over all. Interestingly, results or Hung and Ho with their smoothing in Table 2 are worst than those when their smoothing is turned o in Table 5. However the smoothed depth maps always look better than the unsmoothed depth maps (not shown here due to space limitations): there is an obvious bias in the smoothed values. Overall. the recovered depth maps look quite good, with the possible exception o Heel, where there are some outliers at the object boundaries (simple iltering could remove this). This leaves Barron et al s as the best algorithm. We are currently testing our better algorithms on synthetic record groove images and on real groove images with encouraging results. Because the groove wall orientation can be described by 2 angles, one o which is constrained and because the vertical component o image velocity is always very small, we anticipate these constraints will yield better even result. For example Barron et al. s method could be modiied to use only horizontal velocities, like Matthies et al. [7] and eectively have one angle to track in the Kalman ilter. We inally note that we are using a 1X microscope to obtain our images and that there is suicient texture in the images to allow or a good optical low calculation. The results in this paper are only preliminary and analysis o time-varying synthetic and real imagery o a record groove wall is now our priority. Reerences [1] Barron J.L., D.J. Fleet and S.S. Beauchemin (1994), Perormance o Optical Flow Techniques, Int. Journal o Computer Vision (IJCV1994), 12:1, pp43-77. [2] Barron J.L., W.K.J. Ngai and H. Spies (23), Quantitative Depth Recovery rom Time- Varying Optical Flow in a Kalman Filter Framework, LNCS 2616 Theoretical Foundations o Computer Vision: Geometry, Morphology, and Computational Imaging, (Editors: T. Asano, R. Klette, and C. Ronse), pp344-355.
1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (a) (b) 1 2 3 4 5 6 2 4 1 2 3 4 5 6 2 4 6 6 8 6 6 8 1 1 5 5 4 4 3 3 2 2 1 1 (c) (d) Figure 4: (a) Heel, (b) Barron et al., (c) Matthies et al., (d) Hung and Ho depth map or marble sphere at the 27 th step in the Kalman iltering. [3] Heel J. (199), Direct Dynamic Motion Vision, Proc. IEEE Con. Robotics and Automation, pp1142-1147. [4] Hung Y.S. and Ho H.T. (1999), A Kalman Filter Approach to Direct Depth Estimation Incorporating Surace Structure, IEEE PAMI, June, pp57-576. [5] Longuet-Higgins H.C. and K. Prazdny (198), The Interpretation o a Moving Retinal Image, Proceeding o the Royal Society o London, B28:385-397. [6] Lucas, B. and Kanade, T. (1981), An iterative image registration technique with an application to stereo vision, Proc. DARPA IU Workshop, pp. 121-13. [7] Matthies L., R. Szeliski and T. Kanade (1989), Kalman Filter-Based Algorithms or Estimating Depth rom Image Sequences, International Journal o Computer Vision IJCV 3:3, pp. 29-238. [8] Simoncelli E.P. (1994), Design o multidimensional derivative ilters, IEEE Int. Con. Image Processing, Vol. 1, pp79-793. [9] Tian B., J. Barron, W.K.J. Ngai and H. Spies (23), A Comparison o 2 Methods or Recovering Dense Accurate Depth Using Known 3D Camera Motion, Vision Interace, pp229-236. [1] Weng J., T.S. Huang, and N. Ahuja (1989), Motion and Structure rom Two Perspective Views: Algorithms, Error Analysis, and Error Estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(5):451-476. [11] Xiong Y. and Shaer S. (1995), Dense Structure rom a Dense Optical Flow Sequence, Int. Symposium on Computer Vision, Coral Gables, Florida, pp1-6.