3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding

Size: px

Start display at page:

Download "3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding"

Clara McKinney
5 years ago
Views:

1 Signal Processing: Image Communication 14 (1999) 817}840 3D object articulation and motion estimation in model-based stereoscopic videoconference image sequence analysis and coding Dimitrios Tzovaras*, Ioannis Kompatsiaris, Michael G. Strintzis Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, Thessaloniki 54006, Greece Received 29 November 1996 Abstract This paper describes a procedure for model-based analysis and coding of both left and right channels of a stereoscopic image sequence. The proposed scheme starts with a hierarchical dynamic programming technique for matching across the epipolar line for e$cient disparity/depth estimation. Foreground/background segmentation is initially based on depth estimation and is improved using motion and luminance information. The model is initialised by the adaptation of a wireframe model to the consistent depth information. Robust classi"cation techniques are then used to obtain an articulated description of the foreground of the scene (head, neck, shoulders). The object articulation procedure is based on a novel scheme for the segmentation of the rigid 3D motion "elds of the triangle patches of the 3D model object. Spatial neighbourhood constraints are used to improve the reliability of the original triangle motion estimation. The motion estimation and motion "eld segmentation procedures are repeated iteratively until a satisfactory object articulation emerges. The rigid 3D motion is then re-computed for each sub-object and "nally, a novel technique is used to estimate #exible motion of the nodes of the wireframe from the rigid 3D motion vectors computed for the wireframe triangles containing each speci"c node. The performance of the resulting analysis and compression method is evaluated experimentally Elsevier Science B.V. All rights reserved. Keywords: Stereoscopic image sequence analysis; Model-based coding; Object articulation; Non-rigid 3D motion estimation 1. Introduction The transmission of full-motion video through limited capacity channels is critically dependent on the ability of the compression schemes to reach target bit-rates while still maintaining acceptable visual quality * Corresponding author. Tel.: # ; fax: # ; tzovaras@dion.ee.auth.gr This work was supported by the EU CEC Project ACTS PANORAMA (Package for New Autostereoscopic Multiview Systems and Applications, ACTS project 092) /99/$ - see front matter 1999 Elsevier Science B.V. All rights reserved. PII: S ( 9 8 )

2 818 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 [15]. In order to achieve this, motion estimation and motion compensated prediction are frequently used, so as to reduce temporal redundancy in image sequences [22]. Similarly in the coding of stereo and multiview images, prediction may be based on disparity compensation [33] or the best of motion and disparity compensation [34]. Stereoscopic video processing has recently been the focus of considerable attention in the literature [3,6,10,13,16,24,26,28,32]. A stereoscopic pair of image sequences, recorded with a di!erence in the view angle, allows the three-dimensional (3D) perception of the scene by the human observer, by exposing to each eye the respective image sequence. This creates an enhanced 3D feeling and increased `telepresencea in teleconferencing and several other (medical, entertainment, etc.) applications. In both monoscopic and stereoscopic vision, the ability of model-based techniques to describe a scene in a structural way has opened new areas of applications. Video production, realistic computer graphics, multimedia interfaces and medical visualisation are some of the applications that may bene"t by exploiting the potential of model-based schemes. Object-based techniques have been extensively investigated for monoscopic image sequence coding [4,11,12,21]. Several object-oriented coding schemes have also been proposed for stereoscopic image sequence coding [7,24,25,29,31,32]. The advantages of using model-based techniques for stereo image sequence coding were reviewed in [25], where a feature-based 3D motion estimation scheme was presented. In [24], disparity is estimated using a dynamic programming scheme and is subsequently used for object segmentation. The segmentation algorithm is based on region growing and the criterion used for the de"nition of each object is based on the homogeneity of the respective disparity "elds. In [10], the objects in the scene are identi"ed using a segmentation method based on the homogeneity of the 2D motion "eld computed by a block matching procedure. Then the 3D motion of each object is modeled using the approach presented in [1] with depth estimated from disparity. Finally, an interframe coding scheme based on 3D motion compensation is evaluated. A disadvantage of the segmentation technique used in this procedure, is its failure to guarantee high performance of the resulting 3D motion compensation method. Alternatively, 3D models of objects may be derived from stereo images. This usually requires estimation of dense disparity "elds, postprocessing to remove erroneous estimates and "tting of a parametrised surface model to the calculated depth map [14]. In [17] an algorithm was presented which optimally models the scenes using a hierarchically structured wire-frame model derived directly from intensity images. The wire-frame model consists of adjacent triangles that may be split into smaller ones over areas that need to be represented in higher detail. The motion of the model surface using both rigid and non-rigid body assumptions is estimated concurrently with depth parameters. Knowledge-based image sequence coding has also attracted much interest recently, especially for the coding of facial image sequences in videophone applications. In [2], one such method is based on the generation of a generic face model and the use of e$cient techniques for rigid and #exible 3D motion estimation. In the present paper, a procedure for model-based analysis and coding of both left and right channels of a stereoscopic image sequence is proposed. The methodology used, overcomes a major obstacle in stereoscopic video coding, caused by the di$cult problem of determining and handling coherently corresponding objects in the left and right images. This is achieved in this paper by de"ning segmentation and object articulation in the 3D space, thus ensuring that all ensuing operations remain coherent for both the left and the right aspects of the scene. Each object is described by a mesh consisting of a set of interconnected triangles. The 3D motion of each triangle is estimated using a robust algorithm for the minimisation of the least median of squares error and by imposing neighbourhood constraints, such as introduced in [18,19], to guarantee the smoothness of the resulting vector "eld. A novel iterative object articulation technique for stereoscopic image sequences is then used to segment the 3D vector "eld and thus to derive a foreground object articulation. Triangle motion estimation and classi"cation are repeated iteratively until satisfactory object articulation is achieved. Rigid 3D motion estimation is performed next for each resulting sub-object, using motion information from both left and right cameras. Finally, a procedure is proposed for the

3 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} estimation of the non-rigid motion of each wireframe node based on the 3D motion of the neighbouring wireframe triangles. The paper is organised as follows. In Section 2 the camera geometry of the stereoscopic system is described. Next, in Section 3 an overview of the proposed stereoscopic image sequence analysis system is presented. Section 4 presents the techniques used for disparity/depth estimation, foreground/background segmentation and model initialisation. The technique used for object articulation is examined in Section 6 while the 3D motion estimation procedure used is presented in Section 7. The rigid 3D motion estimation procedure for each articulated 3D object is discussed in Section 7.1. Finally, in Section 7.2, an approach is considered for non-rigid motion estimation based on the rigid 3D motion vectors of small surface patches, computed during the object articulation procedure. Experimental results given in Section 8 demonstrate the performance of the proposed methods. Conclusions are drawn in Section Camera geometry The geometry of the stereoscopic camera arrangement used is shown in Fig. 1, where three reference coordinate frames are de"ned: World reference frame, attached to the imaged scene. Camera reference frame, attached to the camera system. Notice that the Z-axis is the optical axis, while the X and > axes are parallel to the image plane. Here c refers to the respective camera, i.e. c"l, r for the left, respectively, right cameras. Image reference frame, where the X and > axes, respectively, de"ne the horizontal and vertical directions on the digital image, where again c"l, r refer to the images produced, respectively, by the left, right cameras. Fig. 1. Stereoscopic camera geometry.

4 820 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 The camera geometry is described by the following set of equations mapping the 3D world-coordinates (x, y, z ) of a generic point P into the 2D coordinates (X,> ) of its projection on the image planes: Change of reference frame from world-coordinates to camera-coordinates: P "x y z "R x y, (1) z #T where c"l, c"r, for the left and right cameras, respectively, and R and T are, respectively, the rotation matrix and the translation vector. Perspective projection of a scene point to the image plane (the centre of projection is the centre of the lens and the projection plane is the camera CCD sensor): P " X > "f z x y, (2) Change of coordinate frame from camera-coordinates (X,> ) to image coordinates (X,> ). This operation simply consists of a 2D translation and scale change X "C # X d, > "C # > d, (3) where d and d are the horizontal and vertical size of an image pixel, respectively, and (C, C ) are the image coordinates of the optical centre OC in camera c. As seen from the above description, the camera geometry is completely speci"ed by a small set of parameters estimated during camera calibration. 3. Overview of the stereoscopic image sequence analysis and coding scheme In the proposed model-based stereoscopic image sequence analysis and coding scheme (see Fig. 2), both left and right channels are coded using 3D rigid and non-rigid motion compensation. The approach taken is to de"ne fully 3D models of objects composed of interconnecting wire-mesh triangles. In this way, complete left-to-right object correspondence is intrinsically established. Fig. 2. The proposed stereoscopic image sequence coding scheme.

5 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} The left-to-right (LR) and right-to-left (RL) disparity "elds are estimated "rst, using a hierarchical dynamic programming disparity estimation procedure. The consistency of the computed disparity "elds is then checked, for the &points of interest' which are provided by the model initialisation procedure. A reliable disparity estimate is obtained for those points of interest with inconsistent left}right disparities. An initial foreground/background segmentation procedure follows, leading to a 3D wireframe model adapted to the foreground object using the reliable disparity estimates. In order to improve rigid 3D motion estimation the foreground object is articulated producing sub-objects de"ned by the homogeneity of their 3D motion. This rigid 3D motion is estimated using least median of squares minimisation of a cost function taking into account the reliability of the projected rigid 3D motion in both the left and right channels. Neighbourhood constraints are also imposed to improve the reliability of the motion estimation procedure. Following articulation, the rigid 3D motion of each sub-object produced is estimated using the same motion estimation procedure, without this time imposing neighbourhood constraints. Finally, non-rigid motion of each node of the wireframe is estimated from the rigid 3D motion of the wireframe triangles containing this node as a vertex. A block diagram of the proposed encoder is shown in Fig. 3. Its constituent components are described in detail in the ensuing sections. 4. Depth estimation-scene segmentation and model initialisation 4.1. Disparity/depth estimation Since the stereo camera con"guration is known, the depth estimation problem reduces to that of disparity estimation [3,28,30]. A dynamic programming algorithm, minimising a combined cost function for two corresponding lines of the stereoscopic image pair is used for disparity estimation. The basic algorithm adapts the results of [5,24] using blocks rather than pixels. Furthermore, a novel hierarchical version of this algorithm is implemented so as to speed up its execution. The cost function takes into consideration the displaced frame di!erence (DFD) as well as the smoothness of the resulting vector "eld in the following way. Due to the epipolar line constraint [3] the search area for each pixel p "(i, j ) of the right image is an interval S p on the epipolar line in the left image determined by a minimum and maximum allowed disparity. If p "(i, j )3S p is the pixel in the left image matching with pixel p of the right image and if d p is the disparity vector corresponding to this match, the following cumulative cost function is minimised with respect to d p for the path ending at the pixel p in each line i of the right image: C(i )"min C(i!1)#c(p, d d p ). (4) p The cost function c(p, d p ) is determined by c(p, d p )"R(p )DFD(p, d p )#SMF(p, d p ). (5) The "rst term in Eq. (5) contains the absolute di!erence of two corresponding image intensity blocks, centered at the working pixels (k, i) and (l, j) in the right and left images, respectively, DFD(p, d p )" I (i #X, j #>)!I (i #X, j #>), (6) W where W is a rectangular window. Multiplication with the reliability function R(d) relaxes the DFD weight, keeping only the second term active in homogeneous regions where the matching reliability is small. The

6 822 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 3. A block diagram of the proposed encoder. disparity vector is considered reliable whenever it corresponds to a pixel on an edge or in a highly textured area. For the detection of edges and textured areas a variant of the technique in [8] was used, based on the observation that highly textured areas exhibit high local intensity variance in all directions while on edges the intensity variance is higher across the direction of the edge. The second term in Eq. (5) is the smoothing function, SMF(d p )" d p!d R(d ), (7)

7 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} where d, n"1,2, N, are vectors neighbouring d. Multiplication by the factor R(d ) aims to attenuate the contribution of unreliable vectors to the smoothing function. Finally, the dynamic programming algorithm selects as the best path up to that stage, the one with the minimum cumulative cost (Eq. (4)). A hierarchical version of this approach was utilised in order to speed up the estimation process and to produce a smooth disparity "eld without discontinuities. In this version, the dynamic programming algorithm is applied at the coarse resolution level and an initial estimate for the disparity vectors is produced. The disparity information is then propagated to the next resolution level where it is corrected so that the cost function is further minimised. This process is iterated until full resolution is achieved. Along with the dense disparity "eld, the variance of the disparity estimate for each pixel of the image is also computed, using σ (p, d p )" 1 (I (i #k, j #l)!i (i #k, j #l)), (8) N where (2N#1)(2N#1) is the dimension of the rectangular window W. Finally, depth is estimated from disparity, using the camera geometry as in [32] Foreground/background segmentation The model is initialised by separating the body in the videoconference scene from the background using an initial foreground/background segmentation procedure. The depth map produced by the method in Section 4.1 applied to the full resolution image may be used for this purpose. However, to reduce as much as possible the e!ects of errors in depth estimation, we propose instead the use of a hierarchical foreground/background segmentation, focused on the determination of only the largest disparity vectors. These vectors correspond to foreground objects (objects that lie very close to the camera). This information is propagated to the higher resolution level where it is corrected in a coarse to "ne manner. Thus, by carefully selecting the search area of the disparity estimator at each resolution an initial foreground/background segmentation mask is formed. The resulting segmentation map is then post-processed using a motion detection mask and the luminance edge information. The motion detection mask is de"ned by simple subtraction of consequent frames of the same channel of the image sequence. Note that in this phase, the aim is not to calculate motion accurately, but rather to identify regions with very high or very low motion. The motion detection mask contains important information for both inner and boundary areas of the foreground object while luminance edge information carries important information about errors that occur mainly on the silhouette (border) of the foreground object. The foreground object boundary is found as the part of the image where both the depth gradient and the luminance gradient are high. Summarising, the following algorithm is used for foreground/background separation, as shown in Fig. 4: The disparity information in level l of the algorithm is segmented using a histogram based segmentation algorithm and areas corresponding to large disparity values, are identi"ed as objects close to the camera. The segmentation information is propagated to the "ner resolution level where it is corrected appropriately. In the full resolution level, the resulting segmentation mask is post-processed using motion and luminance information as follows: each portion of the scene designated as background by the disparity segmentation procedure is reexamined in view of its motion u and depth and luminance gradients g (Fig. 4). If all these parameters exceed preselected thresholds, this portion of the scene is con"rmed as being part of the foreground. Otherwise, it is relegated to the background.

8 824 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 4. The proposed foreground/background estimation scheme. A 3D wireframe is adapted to the foreground object produced by the described procedure. Then, using the reliable depth estimates as described in the following sections, the "nal 3D model is created Consistency checking and disparity evaluation for the points of interest A set F of points of interest is "rst de"ned, composed of points in the 3D space with left or right image projections located on depth and luminance edges. The latter are extracted using the edge detection algorithm presented in Section 4.1. For each of these points, the disparity estimation algorithm produces left-to-right (LR) and right-to-left (RL) disparity "elds. However, the LR and RL disparity "elds may be

9 inconsistent because of occlusions and errors in the disparity estimation procedure. Thus, a consistency checking algorithm is used to indicate the correct matches followed by an averaging procedure (Kalman estimate) which assigns a depth value to pixels with inconsistent matches. More speci"cally, the correspondence between pixels p "(i, j ) of the right image and p "(i, j ) of the left image is considered consistent if d(p )"!d(p #d(p )). D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} If the above relation is not valid, a more reliable depth estimate must be assigned to that pixel. The method in [9] is applied to this e!ect, using the reliability of the disparity estimates as a weighting function. Speci"cally, the disparity d(p ) and the disparity d(p ) satisfying p "p #d(p ), (9) are averaged with respect to their disparity error variances as follows: dk " dσ!dσ σ #σ, σ K " σ σ σ #σ, where σ and σ are, respectively, the variances of the disparity estimates d and d, computed at the disparity/depth estimation stage using Eq. (8), and σ K is the variance of the averaged disparity. If more than one disparity vectors d(p ) satisfy Eq. (9), the one with the minimum estimation variance σ is selected. The consistency checking algorithm is applied to the set of all points of interest, selected as above so as to have projections on depth and luminance edges, and reliable depth estimates for pixels with either consistent or corrected disparity are obtained to be used for model initialisation. The result of this procedure is a set F of points of interest (xl, yl, zl ) whose projections are located on the foreground depth map and luminance edges of either the left or right camera and zl is their estimated depth. 5. Initial 3D model adaptation For the generation of the 3D model object, depth information must be modeled using a wire mesh. We shall generate a surface model of the form [17] z"s(x, y, PK ), (10) where PK is a set of 3D &control' points or &nodes' PK "(x, y, z ), i"1,2,n that determine the shape of the surface and (x, y, z) are the coordinates of a 3D point. An initial choice for PK is the regular tessellation shown in Fig. 5(a). The consistency checking algorithm, described in the previous section, is applied to all control points to assign corrected depth values to every node of the 3D model. Automatic adaptation of the 3D model (Fig. 5(c) and (d)) to the foreground object is sought by forcing the 3D model to meet the boundary of the foreground/background segmentation mask (Fig. 5(b)). A set of reference image points G"(xJ, yj, zj ), i"1, 2, Q is de"ned as the aggregate of F and PK : G"FPK, (11) where F is the set of points of interest de"ned in the preceding section and PK are the nodes of the 3D model with the corrected depth values. Then, S can be modelled by a piecewise linear surface, consisting of adjoint triangular patches, which may be written in the form z"z g (x, y)#z g (x, y)#z g (x, y), (12)

10 826 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 5. (a) Initial triangulation of the image plane. (b) Foreground/background segmentation. (c) Part of the initial triangulation corresponding to the foreground object. (d) Expanded wireframe adapted to the foreground object. (e) Barycentric coordinates.

11 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} if (x, y, z) is a point on the triangular patch P PY P "(x, y, z ), (x, y, z ), (x, y, z ). The functions g (x, y) are the barycentric coordinates of (x, y, z) relative to the triangle and they are given by g (x, y)"area(a )/Area(P PK P ) (Fig. 5(e)). The reconstruction of a surface from consistent sparse depth measurements may be e!ected by minimising a functional of the form E (PK )" (S(xJ, yj, PK )!zj ). (13) The value of sum (13) expresses con"dence to the reference points (xj, yj, zj )3G, i"1, 2, Q. Note that no smoothness constraint is imposed to the surface of the 3D model, since the depth estimates for these points are considered very reliable. Replacing Eq. (12) in Eq. (13) yields E (PK )"APK!B, (14) where A is a QN matrix and B a Q1 vector given by A " g (xj, yj ) if (xj, yj ) inside triangle (x, y ), (x, y ), (x, y ), i"1, 2, Q, 0 otherwise, j"1,2, N, B "zj, i"1, 2, Q, where (i, j) are two triangle indices. The vector PK minimising Eq. (14) is PK "(AA)AB, (15) which de"nes the nodes of the wire-mesh surface. Using Eq. (12), the depth z of any point on a patch can be expressed in terms of the depth information of the nodes of the wireframe and the X and > coordinates of that point. Hence, full depth information will be available if only the depths of the nodes of the wireframe are transmitted. 6. Object articulation A novel subdivision method based on the rigid 3D motion parameters of each triangle and the error variance of the rigid 3D motion estimation is proposed for the articulation of the foreground object (separation of the head and shoulders). The model initialisation procedure described above, results in a set of interconnected triangles in the 3D space: ¹, k"1,2, K where K is the number of triangles of the 3D model. In the following, S will denote an articulation of the 3D model at iteration i of the articulation algorithm, consisting of s, k"1, 2, M sub-objects. The proposed iterative object articulation procedure is composed of the following steps: Step 1. Set i"0. Let an initial segmentation S"s, k"1, 2, K, with s "¹. Let also the initial neighbourhood for each triangle to be empty, i.e. ¹S ". Step 2. Apply the 3D rigid motion estimation algorithm to each triangle ¹, taking into account the neighbourhood constraint imposed by the neighbourhood ¹S. This constraint is described in detail in the Section 6.1 that follows. Step 3. Set i"i#1. Execute the object segmentation procedure that subdivides the initial object into M sub-objects, i.e. S"s, k"1, 2, M. Step 4. Use the segmentation map S to de"ne the new neighbourhood ¹S Step 5. If S"S then stop. Else go to step 2. of each triangle ¹. The proposed algorithm can be also explained by the example of Fig. 6(a)}(c). Fig. 6(a) illustrates the initial phase of the algorithm where each triangle is treated as an object. The estimated rigid 3D motion

12 828 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 6. (a) Initial phase of the object articulation algorithm. (b) The output of the rigid 3D motion estimation procedure for each triangle. (c) The output of the object segmentation procedure. (d) Non-rigid 3D motion estimation example. The light grey vector represents the rigid motion of the working node while the black vectors represent estimates for the motion of the same node using the 3D motion parameters corresponding to each triangle containing the working node.

13 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} vectors of each triangle, computed at the second step of the proposed algorithm are shown in Fig. 6(b) and the output of the object segmentation procedure of step 3 is shown in Fig. 6(c). Based on this new object segmentation map the rigid 3D motion estimation and object segmentation procedures are then further re"ned iteratively. The 3D motion estimation of each triangle and the object segmentation procedure are described in more detail below Rigid 3D motion estimation of small surface patches The foreground object of a typical videophone scene is composed of more than one sub-objects (head, neck, shoulders, etc.) each of which exhibits di!erent rigid 3D motion. Thus, object articulation has to be completed and the rigid motion of each sub-object must be estimated. For rigid 3D motion estimation of each triangle ¹ we use least median of squares minimisation. This procedure removes the outliers from the initial data set and "nds the estimate that minimises the median of the square error. More speci"cally, the rigid motion of each triangle ¹, k"1,2, K, where K is the number of triangles in the foreground object, is modeled using a linear 3D model, with three rotation and three translation parameters [1]: x(t#1) 1!w w y(t#1) 1!w y(t) t z(t#1)"!w w 1 x(t) z(t)#t, (16) t where (x(t), y(t), z(t)) is a point on the plane de"ned by the coordinates of the vertices of triangle ¹. Since the triangle motion is to be used for object articulation, neighbourhood constraints are needed for the estimation of the model parameter vector a"(w, w, w, t, t, t ), in order to guarantee a smooth estimated triangle motion vector "eld, that can be successfully segmented. Let N be the ensemble of the triangles neighbouring the triangle ¹. If triangle ¹ belongs to the region s of S at iteration i of the object articulation algorithm, we de"ne as neighbourhood ¹S of each triangle ¹ the set of triangles ¹ in ¹S "¹ 3N ¹ 3s. For example, in order to de"ne the neighbourhood of triangle A in Fig. 6(c) we "rst consider all triangles that share at least one common vertex with triangle A (i.e. N "B, C, D, E, H, I, J, K). From the set N only the triangles belonging to the same object with triangle A, are "nally de"ned as neighbourhood of triangle A (i.e. ¹S "B, C, D, E). Then for each triangle ¹ the set of points belonging to ¹S are input to the 3D rigid motion estimation procedure so as to smooth the motion "eld produced The 3D motion estimation algorithm For the estimation of the model parameter vector a"(w, w, w, t, t, t ) for each neighbourhood ¹S at iteration i of the object articulation procedure, the MLMS iterative algorithm [27] was used. The MLMS algorithm is based on median "ltering and is very e$cient in suppressing noise with a large amount of outliers (i.e. in situations where conventional least-squares techniques usually fail). As noted in the previous sections, the 3D motion of each extended neighbourhood ¹S of a triangle ¹ is modelled in the global coordinate system by P(t#1)"R P(t)#T, (17)

14 830 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 where the matrix R and the vector T are de"ned from Eq. (16). Since initial motion estimates are available in the left and right camera images, the rigid 3D motion must be projected on the left and right coordinate systems. Using Eqs. (17) and (1), P (t#1)"r P (t)#t, (18) where and R "R R R T "!R R R T #R T #T, (20) where R and T are the 3D motion rotation and translation matrices corresponding to camera c and triangle k. Using the fact that the matrices R and T are of the form " 1!w w "t R w 1!w, T, (21)!w w 1 t and also using Eqs. (2) and (3), the projected 2D motion vector in camera c, d (X,>) is given by d (X(t),>(t))"f!w x (t)y (t)#w (x (t)#z (t))!w y (t)z (t)#t z (t)!t x (t) (!w x (t)#w y (t)#z (t)#t )z (t)d, (22) d (X(t),>(t))"f w (y (t)#z (t))!w x (t)y (t)!w x (t)z (t)!t z (t)#t y (t) (!w x (t)#w y (t)#z (t)#t )z (t)d, (23) where d (X,>)"(d (X(t),>(t)), d (X(t), >(t))). Using the initially estimated 2D motion vectors corresponding to the left and right cameras and Eqs. (22) and (23) along with Eqs. (19) and (20) evaluated for c"l and c"r, a linear system for the global motion parameter vector a for triangle ¹ is formed. Note that the parameters of a are implicitly contained in Eqs. (22) and (23), since a "(w, w, w, t, t, t ) and a are related by Eqs. (19) and (20). This is a system of equations with six unknowns, where are the number of reference points in the set 2( # ), G of (Eq. (11)), contained in the plane de"ned by the coordinates of the vertices of triangle k in the left and right image planes, respectively. If # *2 this is overdetermined and can be solved using least-squares methods or alternately, by the robust least median of squares motion estimation algorithm described in detail in [27]. The reference points initially chosen should be enough to guarantee # *2 for each triangle. As explained in Section 5, this is ensured by choosing in Eq. (11) as reference points all triangle vertices plus the points of interest on depth and luminance edges. (19) 6.3. Object segmentation At each iteration of the object articulation method, the rigidity constraint imposed on each rigid object component is exploited. This constraint requires that the distance between any pair of points of a rigid object component must remain constant at all times and con"gurations. Thus, the motion of a rigid model object component represented by a mesh of triangles can be completely described by using the same 6 motion parameters. Therefore, to achieve object articulation, neighbouring triangles which exhibit similar 3D motion parameters are clustered into patches. In an ideal case, these patches will represent the complete visible surface of the moving object components of the articulated object.

15 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} More speci"cally, the following iterative algorithm is proposed: Step 1. Set j"0. Set M"K. Set S "S. Step 2. For each patch s for k"1, 2, M execute the following clustering algorithm. Step 3. Set s "s s. For all the patches s that belong to the neighbourhood of s,if σ a #σa!a σ #σ )th, cluster s to s and set s "s and M "M!1. In the above a, m"k, l are the motion parameters, σ, m"k, l is the variance of the 3D motion estimate, i.e. the Displaced Frame Di!erence (DFD) of patch s computed by compensating the projected 3D motion in the left and right cameras and th is a threshold. Also, σ " 1 (I(P(t))!I(P(t#1)))# 1 (I N P N P (P(t))!I (P(t#1))), where N is the number of points contained in patch s and P(t) and P(t#1) are two corresponding points in time instants t and t#1, respectively. Step 4. Set j"j#1 and M "M. Set S "s, k"1, 2,M.IfS "S stop. Else go to step D motion estimation of each sub-object 7.1. Rigid 3D motion estimation of each sub-object The object articulation procedure identi"es a number of sub-objects of the 3D model object, as areas with homogeneous motion. A sub-object s represents a surface patch of the 3D model object consisting of N control points and q triangles. A sub-object may consist of q"1 triangle only. The motion of an arbitrary point P(t) on the sub-object s to its new position P(t#1) is described by P(t#1)"RP(t)#T, (24) where k"1,2, M, and M is the number of sub-objects, where as before [1]: 1!w w w 1!w T"t R", t.!w w 1 t For the estimation of the model parameter vector a"(w, w, w, t, t, t ) the MLMS iterative algorithm described earlier is used, this time without imposing neighbourhood constraints Non-rigid 3D motion estimation The rigid motion of the articulated objects cannot compensate errors occurring due to local motion (such as due to movement of eyes and lips). These errors can only be compensated by deforming appropriately the nodes of the wireframe, in order to also follow the local motion. An analysis-by-synthesis approach is proposed for the computation of the non-rigid motion *J at node i, which minimises the DFD between the image frame at time t#1 and the 3D non-rigid motion compensated estimate of frame t#1 from frame t.

16 832 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 More speci"cally, the 3D motion r, i"1,2, N, of the wire-mesh nodes is governed by (24). Alternative estimates of the motion of the same node are provided by applying to (16) the 3D motion parameters originally estimated for a triangle containing node i (see Fig. 6(d)). Since the motion of each triangle re#ects both global rigid motion and local deformations, the di!erence of these two estimates of the motion of each node may be assumed to approximate the non-rigid motion component of the node. If r is the rigid 3D motion of node i and *, k"1,2, N, are the estimates for the motion of node i produced by rotating and translating this node with the rotation and translation parameters corresponding to each triangle ¹ containing node i, wede"ne as candidates for the minimisation of the DFD of the reconstruction error *J "*!r, k"1,2, N, (25) where N is the number of neighbourhood triangles of node i. The "nal non-rigid motion vector *J is chosen to be where *J "arg min 2 (DFD (*I )#DFD (*J )), DFD (*J )" 1 (I N (P(t))!I (P(t)#r #*J )). In the above equation, P(t) are the 3D coordinates of node i at time instance t and P(t)#r #*J are the corresponding corrected coordinates at time instance t#1 corresponding to the *J non-rigid motion vector. The intensities I at time instances t and t#1 are calculated for cameras c"l,roveraregionr de"ned as the aggregate of the planes of all triangles containing node i,andn is the number of points contained in region R. 8. Experimental results The proposed model-based analysis and coding method was evaluated for the right and left channels of a stereoscopic image sequence. The "rst frame of both channels is transmitted using intra frame coding techniques, as in H263 [20]. The performance of the proposed methods was investigated in application to the compression of the interlaced stereoscopic videoconference sequence &Claude' of size The hierarchical dynamic programming procedure for matching across the epipolar line, described in Section 4.1, with 2 levels of hierarchy was used for LR and RL disparity/depth estimation. The search area for disparity was chosen to be $62 and $2 half pixels for the x and y coordinate, respectively. Fig. 7(b) and (d) show the computed left and right channel depth maps using the hierarchical dynamic programming approach. The depth map has the same resolution as the original image (since it is computed by a dense disparity "eld). Depth information is quantised to 256 levels. Darker areas represent objects closer to the cameras. The smoothing properties of the dynamic programming method are seen to result in more realistic depth-map estimates. Foreground/background separation is performed next, using the coarse to "ne technique described in Section 4.2. The motion detection mask along with the luminance edge information are then used to improve the results of the initial segmentation. The resulting foreground/background mask of &Claude' is shown in Fig. 5(b). This sequence were prepared by the THOMPSON BROADCASTING SYSTEMS for use in the DISTIMA RACE project.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 833 Fig. 7. (a) Original left channel image &Claude' (frame 2).

17 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} Fig. 7. (a) Original left channel image &Claude' (frame 2). (b) Corresponding depth map estimated using dynamic programming. (c) Original right channel image &Claude' (frame 2). (d) Corresponding depth map estimated using dynamic programming. The LR and RL disparity estimates are then subjected to the consistency checking procedure and inconsistent matches are corrected by reliably fusing LR and RL information as described in Section 4.3. On the basis of the consistent and the corrected depth information at all reference points, the wireframe model is adapted to the foreground object (Fig. 5(c) and (d)). The rigid 3D motion of each triangle of the foreground object is computed next, using the technique described in Section 6.1. The output of the proposed local 3D motion estimator was a set of 3D motion parameters assigned to each triangle of the wireframe 3D model. In order to show the resulting local 3D motion we have produced a visualization of the rotation and translation parameters of the homogeneous motion matrix. For the rotation parameters, the direction of the vector assigned to each triangle shows the rotation axis and the size of the vector as well as the color of the triangle show the magnitude of the angle of

834 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 8. (a) Visualization of the rotation parameters of the rigid 3D motion for each triangle of the 3D model of &Claude'.

18 834 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 8. (a) Visualization of the rotation parameters of the rigid 3D motion for each triangle of the 3D model of &Claude'. (b) Visualization of the translation parameters of the rigid 3D motion for each triangle of the 3D model of &Claude'. the rotation. For the translation parameters, the direction of the vector at each triangle shows the direction of the local 3D motion, while the size of the vector as well as the color of the triangle show the magnitude of translation. The visualisation of the rotation and the translation parameters of the rigid 3D motion of each wireframe triangle for &Claude' are shown in Fig. 8(a) and (b), respectively. As is in this way demonstrated, the head and the shoulders undergo di!erent motion. The resulting articulation of the foreground object achieved by the methods of Section 6 is shown in Fig. 9(a). The accuracy of this object articulation is remarkable and is due to the fact that the foreground/background segmentation and the object articulation procedures are de"ned on the 3D space in terms of triangle rather than pixel motion. In this way, complete correspondence is achieved between objects in the right and left channel image. Following object articulation, the algorithm presented in Section 7.1 is used for rigid 3D motion estimation. The computed motion parameter vectors, between frames 1 and 2, for the head and shoulders sub-objects are shown in Fig. 9(b). As seen, the 3D motion of the &shoulders' sub-object is negligible while the 3D motion of the &head' has signi"cant rotation and translation parameters (this can also be observed by examining the original frames 1 and 2 of &Claude'). The performance of the algorithm in terms of PSNR is evaluated in Tables 1 and 2 where the quality of the reconstruction of the whole image as well as only the &head' or &shoulders' sub-objects is presented. Fig. 10(a) and (c) show the reconstructed left and right images, respectively, using rigid 3D motion compensation while Fig. 10(b) and (d) show the corresponding prediction errors. The performance of the algorithm in terms of PSNR is shown in Tables 1 and 2 where the quality of the reconstruction of the whole image as well as only the &head' or &shoulders' sub-objects is presented. As seen, rigid 3D motion is not su$cient for very accurate reconstruction of the &head' sub-object, and thus non-rigid 3D motion must be used to improve the performance of the algorithm. The analysis-by-synthesis technique presented in Section 7.2 is then used for non-rigid 3D motion estimation. The quality of the reconstruction in terms of PSNR is described in Tables 1 and 2 where an improvement of about 1 db is seen to be achieved by non-rigid 3D motion compensation. Fig. 11(a) and (c) show details of the reconstruction error using 3D rigid motion compensation of the left and right images, respectively, while Fig. 11(b) and (d) show the corresponding prediction errors using 3D non-rigid motion compensation.

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 835 Fig. 9. (a) Articulation of the foreground object.

19 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} Fig. 9. (a) Articulation of the foreground object. (b) The 3D motion parameter vectors corresponding to the head and shoulder sub-objects. Table 1 PSNR (db) of the reconstruction of the left channel frame 2 of &Claude' using rigid and non-rigid 3D motion compensation Object Rigid Non-rigid Whole image Head Shoulders Table 2 PSNR (db) of the reconstruction of the right channel frame 2 of &Claude' using rigid and non-rigid 3D motion compensation Object Rigid Non-rigid Whole image Head Shoulders The proposed algorithm was also tested for the coding of a sequence of frames at 10 frames/s. The model adaptation, depth estimation and object articulation procedures were applied only at the beginning of each group of frames. Each group of frames consists of 10 frames. The "rst frame of each group of frames was transmitted using intra frame coding techniques. In the intermediate frames the model and articulation

836 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 10. (a) Reconstructed left channel of &Claude' using rigid 3D motion compensation.

20 836 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 10. (a) Reconstructed left channel of &Claude' using rigid 3D motion compensation. (b) The corresponding prediction error. (c) Reconstructed right channel of &Claude' using rigid 3D motion compensation. (d) The corresponding prediction error. formation are self-adapted using the rigid and non-rigid 3D motion information. The only parameters that need to be transmitted are the 6 parameters of the rigid 3D motion and the 3D non-rigid motion vector for each node of the wireframe. The methodology developed in this paper allows both left and right images to be reconstructed using the same 3D rigid motion vectors, thus achieving considerable bit-rate savings. The coding algorithm requires a bit-rate of 24.4 kbps and produces better image quality compared to a correspondingly simple block matching motion estimation algorithm [23], as shown in Figs. 12 and 13. The simple block matching approach is identical to that used in H263 and consists of absolute displaced frame di!erence minimization, by searching exhaustively within a search area of!15,2,15 half-pixels in the previous in time frame, centered at the position of the examined block. In both coders, only the "rst frame of

D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 837 Fig. 11.

(b) The corresponding prediction error using non-rigid 3D motion compensation.

21 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} Fig. 11. (a) Detail of the reconstruction error of left channel of &Claude' using rigid 3D motion compensation. (b) The corresponding prediction error using non-rigid 3D motion compensation. (c) Detail of the reconstruction error of right channel of &Claude' using rigid 3D motion compensation. (d) The corresponding prediction error using non-rigid 3D motion compensation. Fig. 12. PSNR of each frame of the left channel of the proposed algorithm, compared with the block matching scheme with a block size of 1616 pixels.

22 838 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817}840 Fig. 13. PSNR of each frame of the right channel of the proposed algorithm, compared with the block matching scheme with a block size of 1616 pixels. each group of frames was transmitted using intra frame coding. It was also assumed that each frame was predicted using the reconstructed previous frame, and that the prediction error was not transmitted. The bit-rate required by this scheme with a 1616 block size was 24.5 kbps. 9. Conclusions In this paper we addressed the problem of rigid and non-rigid 3D motion estimation for model-based stereo videoconference image sequence analysis and coding. On the basis of foreground/background segmentation using motion, depth and luminance information, the model was initialised by automatically adapting a wireframe model to the consistent depth information. Object articulation was then performed based on the rigid 3D motion of small surface patches. Spatial constraints were imposed to increase the reliability of the obtained 3D motion estimates for each triangle patch. A novel iterative classi"cation technique was then used to obtain an articulated description of the scene (head, neck, shoulders). Finally, #exible motion of the nodes of the wireframe was estimated using a novel technique based on the rigid 3D motions of the triangles containing the speci"c node. The results of the algorithm can be used in a series of applications. For <ideo Production and Computer Graphics applications, the 3D motion of a speci"c scene could be used to produce a scene with similar motion but di!erent texture, as when producing a video with a model mimicking the motion of an actor. The method can have also useful applications in Image Analysis since an analytic representation of the motion of the object is given (either in triangle or in wireframe node level) that can be used for the segmentation or articulation of the object into uniform moving rigidly components. Finally, the method was experimentally shown to be e$cient for very low bit-rate coding of stereoscopic image sequences.

23 D. Tzovaras et al. / Signal Processing: Image Communication 14 (1999) 817} References [1] G. Adiv, Determining three-dimensional motion and structure from optical #ow generated by several moving objects, IEEE Trans. on Pattern Analysis and Machine Intelligence 7 (July 1985) 384}401. [2] K. Aizawa, H. Harashima, T. Saito, Model-based analysis synthesis image coding (MBASIC) system for a person's face, Signal Processing: Image Communication 1 (October 1989) 139}152. [3] S. Barnard, W. Tompson, Disparity analysis of images, IEEE Trans. Pattern Anal. Mach. Intell. 2 (July 1980) 333}340. [4] G. Bozdagi, A.M. Tekalp, L. Onural, 3-D Motion estimation wireframe adaptation including photometric e!ects for model-based coding of facial image sequences, IEEE Trans. Circuits Systems Video Technol. (June 1994) 246}256. [5] I.J. Cox, S. Hingorani, B.M. Maggs, S.B. Rao, Stereo without regularization, tech. rep., NEC Research Institute, Princeton, USA, October [6] I. Cox, S. Hingorani, S. Rao, B. Maggs, A maximum likelihood stereo algorithm, Comput. Vision, Graphics Image Process. (1995) to appear. [7] J.L. Dugelay, D. Pele, Motion disparity analysis of a stereoscopic sequence. Application to 3DTV coding, EUSIPCO '92, October 1992, pp. 1295}1298. [8] W.L.O. Egger, M. Kunt, High compression image coding using an adaptive morphological subband decomposition, Proc. IEEE 83 (February 1995) 272}287. [9] L. Falkenhagen, 3D Object-based depth estimation from stereoscopic image sequences, in: Proc. Internat. Workshop on Stereoscopic and 3D Imaging '95, Santorini, Greece, September 1995, pp. 81}86. [10] N. Grammalidis, S. Malassiotis, D. Tzovaras, M.G. Strintzis, Stereo image sequence coding based on three-dimensional motion estimation compensation, Signal Processing: Image Communication 7 (August 1995) 129}145. [11] M. HoK tter, Object-oriented analysis}synthesis coding based on moving two-dimensional objects, Signal Processing: Image Communication 2 (December 1990) 409}428. [12] M. HoK tter, Optimization and e$ciency of an object-oriented analysis-synthesis coder, Signal Processing: Image Communication 4 (April 1994) 181}194. [13] E. Izquierdo, M. Ernst, Motion/disparity analysis for 3D-video-conference applications, in: M.G.S. et al. (Eds.), Proc. Internat. Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 180}186. [14] R. Koch, Dynamic 3D scene analysis through synthesis feedback control, IEEE Trans. Pattern Anal. and Mach. Intell. 15 (June 1993) 556}568. [15] H. Li, A. Lundmark, R. Forchheimer, Image sequence coding at very low bitrates } a review, IEEE Trans. Image Process. 3 (September 1995) 589}609. [16] J. Liu, R. Skerjanc, Stereo and motion correspondence in a sequence of stereo images, Signal Processing: Image Communication 5 (October 1993) 305}318. [17] S. Malassiotis, M.G. Strintzis, Optimal 3D mesh object modeling for depth estimation from stereo images, in: Proc. 4th European Workshop on 3D Television, Rome, October [18] G. Martinez, Shape estimation of moving articulated 3D objects for object-based analysis-synthesis coding (OBASC), in: Internat. Workshop on Coding Techniques for Very Low Bit-rate Video, Tokyo, Japan, November [19] G. MartmH nez, Object articulation for model-based facial image coding, Signal Processing: Image Communication, (September 1996). [20] MPEG-2, Generic coding of moving pictures and associated audio information, tech. rep., ISO/IEC 13818, [21] H.G. Mussman, M. HoK tter, J. Ostermann, Object-oriented analysis}synthesis coding of moving images, Signal Processing: Image Communication 1 (October 1989) 117}138. [22] H.G. Musmann, P. Pirsch, H.J. Grallert, Advances in picture coding, Proc. IEEE 73 (April 1985) 523}548. [23] A.N. Netravali, B.G. Haskell, Digital Pictures } Representation and Compression. Plenum Press, New York and London, [24] S. Panis, M. Ziegler, Object based coding using motion stereo information, in: Proc. Picture Coding Symposium (PCS '94), Sacramento, California, September 1994, pp. 308}312. [25] D.V. Papadimitriou, Stereo in model-based image coding, in: Internat. Workshop on Coding Techniques for Very Low Bit-rate Video (VLBV 94), Colchester, April 1994, p [26] L. Robert, R. Deriche, Dense depth map reconstruction using a multiscale regularization approach with discontinuities preserving, in: M.G.S. et al. (Eds.), Proc. Internat. Workshop Stereoscopic and 3D Imaging, Santorini, Greece, September 1995, pp. 32}39. [27] S.S. Sinha, B.G. Schunck, A two-stage algorithm for discontinuity-preserving surface reconstruction, IEEE Trans. on PAMI 14 (January 1992). [28] A. Tamtaoui, C. Labit, Constrained disparity motion estimators for 3DTV image sequence coding, Signal Processing: Image Communication 4 (November 1991) 45}54. [29] A. Tamtaoui, C. Labit, Symmetrical stereo matching for 3DTV sequence coding, in: Picture Coding Symp. PCS '93, March [30] D. Tzovaras, N. Grammalidis, M.G. Strintzis, Depth map coding for stereo and multiview image sequence transmission, in: Internat. Workshop on Stereoscopic and 3D Imaging (IWS3DI'95), Santorini, Greece, September 1995, pp. 75}80.

Transactions on Information and Communications Technologies vol 19, 1997 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 19, 1997 WIT Press, ISSN Hopeld Network for Stereo Correspondence Using Block-Matching Techniques Dimitrios Tzovaras and Michael G. Strintzis Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle