Phase2. Phase 1. Video Sequence. Frame Intensities. 1 Bi-ME Bi-ME Bi-ME. Motion Vectors. temporal training. Snake Images. Boundary Smoothing

Size: px

Start display at page:

Download "Phase2. Phase 1. Video Sequence. Frame Intensities. 1 Bi-ME Bi-ME Bi-ME. Motion Vectors. temporal training. Snake Images. Boundary Smoothing"

Ashley Woods
6 years ago
Views:

1 CIRCULAR VITERBI BASED ADAPTIVE SYSTEM FOR AUTOMATIC VIDEO OBJECT SEGMENTATION I-Jong Lin, S.Y. Kung Princeton University Abstract - Many future video standards such as MPEG-4 are shifting focus from compression to content; Video Object Segmentation is the key technology required by these standards. Unlike still images, motion in video can be used to discriminate between objects. We present an automatic video single object segmentation system whose core is the adaptive version of the Circular Viterbi algorithm. The Circular Viterbi algorithm fuses the information from motion analysis, edge analysis, active contours (dynamic snake), and temporal correlation into a unied system. Results of pixel-resolution boundaries from our system are shown. Analysis of results, the integration of human-guided data and region-based analysis and a future iterative scheme for multiple object sequences are also discussed. INTRODUCTION The focus of MPEG-4 standard [1] is toward manipulation of video, creating cut-and-paste functionality for the video domain through the concept of Video Object Planes (VOPs). A key technology for MPEG-4 is video object segmentation; given the amount of video, both archived and acquired daily, this segmentation process must be automatic. This paper discusses a system based on the Circular Viterbi algorithm for automatic segmentation. We will discuss design and results of our system and future extensions for human guidance, region-based approaches and multiple object segmentation. CIRCULAR VITERBI The Circular Viterbi (CirVit, for short) algorithm [2] is the kernel of our system; CirVit links together visually important edges, integrates object tracking results, correlates estimates through time and even enables our motion edge analysis. The CirVit algorithm itself is a mapping of the classic Viterbi algorithm onto an image through polar coordinates. Given a properly formulated score function, a center point and a point on the path, the Circular Viterbi algorithm nds the maximum score of a path that encircles a center point. When paths are scored w.r.t. an edge image, the algorithm nds optimal contour that encircles the center point, a prime candidate for 1 Accepted to IEEE Signal Processing Society 1998 Workshop on Multimedia Signal Processing December 7-9, 1998, Los Angeles, California, USA

2 Video Sequence Coarse Segmentation Frame Intensities 4 Viterbi Rings Viterbi Rings Viterbi Rings 1 Bi-ME Bi-ME Bi-ME 5 Boundary Analysis Boundary Analysis Boundary Analysis 3 2 Snake Imaging Motion Vectors Snake Imaging Snake Images Dynamic Snake Snake Imaging Phase Motion Enhanced Edge Images Temporal Training Pixel-Resolution Boundaries Boundary Smoothing Final Boundaries Phase2 Coarse Segmentation MPEG-4 Encoder Figure 1: System Design and the 7 major steps an pixel-level object boundary. Furthermore, a CirVit pass has the computational complexity of a convolutional lter. Temporal Training Since video exists within a temporal continuum, our system extends the CirVit algorithm beyond iterative renement to temporal training. In our temporal training, the application of the CirVit reects temporal correlation of the time-wise object boundary through the adaption of its parameters (i.e. edge images, score function parameters and trellis structures). Through feedback, we not only iteratively improve the boundary estimates but also correlate each frame's estimation process through time. Edge images of one frame are enhanced with projections of boundary estimates from other past and future frames. Score function parameters that encode the expected object radius are adjusted to guide the dynamic programming search to more time consistent results. The estimated object radius also warps the trellis structure to push boundary estimates from the center. Through these dierent feedback mechanisms, the temporal training of our system parametrically links together separate CirVit passes for each frame, extending the scope of the CirVit optimization from a single image to the whole object boundary through time. SYSTEM DESIGN Centered about the CirVit algorithm, our system has two distinct phases (see Fig 1): the rst phase, to provide rough center estimates and initial boundary parameters for temporal training; the second phase, to enhance and rene these rough estimates into pixel-level boundaries through a CirVitbased temporal training.

Sn F Image F Int Sm Sl F Reg +1 F Time a) b) S i Sj Sk c) S n-1 d) Frame (t-1) Frame (t+1) Figure 3: Three Conventional Intra-frame Snake Forces and the fourth Force ~FT ime, (Time Regularization):

3 a) b) c) Figure 2: Image processing for object tracking: a) the processed 66th frame of the hall monitor sequence (darker = a stronger FImage). ~ The 66th frame of the hall monitor sequence b) without and c) with FT ~ ime. Note that FT ~ ime allows the snake to overcome motion noise. Sn F Image F Int Sm Sl F Reg +1 F Time a) b) S i Sj Sk c) S n-1 d) Frame (t-1) Frame (t+1) Figure 3: Three Conventional Intra-frame Snake Forces and the fourth Force ~FT ime, (Time Regularization): a) ~ FImage, Attraction to Edges in the image, b) ~ FInt, Repulsion from other snake points, c) ~ FReg, Attraction to the expected midpoint (regularization) and d) ~ FT ime, Attraction to time interpolated point Phase 1: Object Tracking and Coarse Estimation The rst phase in our system is to approximate regions of interest through changes in the motion eld. At low (block-level) resolution, approximate boundary centers and radii can be found with a dynamic snake, a volumetric version of the snake algorithm [3]. Our second phase will rene these rough estimates to the pixel level. Step 1: Bidirectional Motion Estimation For our motion eld, we will recognize the bidirectionality of motion estimation techniques. For each block we perform motion estimation twice, forward and backward in time to derive a 4D motion eld for each frame. Step 2: Snake Imaging From the 4D (combined forward and backward ) motion eld information, we derive a force image compatible with snakes: 1-D edge images. To reduce the 4D eld to 1D eld (intensity map), we both reduce dimensionality and enhance data resilience to noise with Principal Component (PC) analysis [4]. After the PC image is processed with a noise lter and edge operator (see Fig. 2a), this snake image is used as the force image for the dynamic snake. Step 3: Dynamic Snake After deriving the snake force image, object boundaries for each frame are found by linking conventional static snakes through time to form a dynamic snake (see Fig. 5). At rst, these static snakes

R I B R E v O C N E R O a) b) v I v E c) Figure 4: Edge Motion Analysis: a) Viterbi Rings (n=8) of the 93rd coastguard frame b) RI, RO, and RE are the inside, outside and edge regions, respectively;

4 R I B R E v O C N E R O a) b) v I v E c) Figure 4: Edge Motion Analysis: a) Viterbi Rings (n=8) of the 93rd coastguard frame b) RI, RO, and RE are the inside, outside and edge regions, respectively; ~vi, ~vo, ~ve, their respective motions; C, the object center; B boundary estimate; ~ N e normal to edge. and c) the boundary motion coherent subset (see Eq. 2) of Viterbi Rings are guided by three forces from the points within the frame ( ~ FInt; ~ FReg) and from the image ( ~ FImage) as shown in gure 3. After this isolated frame optimization, we add a fourth force, a time regularization force ( ~ FT ime). ~ F T ime links the series of static snakes into a single dynamic (volumetric) snake, dened by ( ~ FT ime on ~ S t n) =?c( ~ S t?1 n? 2 ~ S t n + ~ S t+1 n ) (1) where ~ S t n represents the nth snake point in frame t and c is an image constant. This dynamic snake tolerates motion errors by leveraging both past and future information (see Fig. 2b,c). For robustness, the dynamic snake has only 6 points per frame. Phase 2:Pixel Level Boundary Renement Using (16x16 pixel) block-level estimates of object boundary centers and radii from the dynamic snake, the second phase derives pixel-resolution boundaries through the integration of motion information and temporal training based upon CirVit algorithm. Step 4: Viterbi Rings CirVit also enables our motion edge analysis by nding a manageable and relevant subset of edges called Viterbi Rings. For a given single object, the vast of majority edge segments in the image are irrelevant to the boundary calculation. To avoid a costly analysis over all possible edge segments, we nd a set of concentric and mutually exclusive boundaries estimates (see Fig. 4a) by 1) using CirVit to nd a contour around the estimated center, 2) removing the contour from the edge image, 3) and repeating the process. Step 5: Boundary Motion Coherence With these Viterbi Rings, we enhance edge segments whose surrounding motions imply boundary membership. By considering an edge segment and its immediate neighborhood, an edge is boundary motion coherent (see Fig. 4b) i (~vi 6= ~vo) ^ (~vi ~ Ne = ~ Ne ~ve) ^ ( ~vo ~ Ne 6= ~ Ne ~ve) (2)

a) b) Figure 5: Boundary Estimates of the akiyo sequence (frames 60-69) a) after object tracking b) after pixel-level renement where ~vi, ~vo, ~ve and Ne ~ are the motions of the inside, outside and

Step 6: Temporal Training From a set of the motion enhanced edge images and object tracking estimates of boundary radius and center points, we temporally train the system to converge to pixel-level

5 a) b) Figure 5: Boundary Estimates of the akiyo sequence (frames 60-69) a) after object tracking b) after pixel-level renement where ~vi, ~vo, ~ve and Ne ~ are the motions of the inside, outside and edge regions, and the normal to edge, respectively. These boundary motion coherent edges are passed onto the CirVit through enhancement of edge images (see Fig. 4c). Step 6: Temporal Training From a set of the motion enhanced edge images and object tracking estimates of boundary radius and center points, we temporally train the system to converge to pixel-level object boundaries. Each iteration of the temporal training has two parts: CirVit passes on each frame's edge image, followed by a parametric reestimation under the assumption of temporal smoothness. The reestimation results are then factored back into the CirVit by adjustments to the edge images, the smoothing of temporal score function parameters and appropriate warping to the trellis diagram. The temporal training iterates until boundary estimates converge. Step 7: Boundary Smoothing In its pursuit of the best path, CirVit often leaves jagged boundaries; the nal step smoothes of the boundary estimate with a low-pass lter. RESULTS/ANALYSIS The system has been tested on parts of three MPEG-4 sequences (akiyo frames 10-19, hall monitor frames and coastguard frames 90-99) with promising results. Currently, we assume that 1) there exists only one moving object in the frame and 2) the object boundary is always within the frame. No human guidance was used and all parameters were the same for all three runs. Fig. 6 show subjectively best and worst case results for the three video subsequences. In the akiyo sequence, our system focuses on the head movement and segments it out almost perfectly (see Fig. 5). By analyzing the boundary, the system avoids avoid motion complexity of the face while still nding a good solution. In the coastguard sequence with camera panning with the boat, the central region of the boat are consistently captured by the system; forward and back ends are not, mostly due to strong internal edges and the lack of opposing motion analysis. In the hall monitor sequence with the man walking down the hall, the system has trouble with strong edges with inconclusive motion estimates (the hallway lines).

a) c) e) b) d) f) Figure 6: Worst (a,c,e) and Best (b,d,f) Case Results frames: a,b): akiyo 11th,18th, c,d) coastguard, 98th and 93rd and e,f) hall monitor 69th,62nd.

More accurate true motion estimator and edge operators would improve performance. Integration of object region-based methods can compensate for or enhance the current edge-based motion analysis.

The simple smoothing constraints can be upgraded to more complete model of object motion. Analysis of stability and convergence of temporal training is needed as well.

6 a) c) e) b) d) f) Figure 6: Worst (a,c,e) and Best (b,d,f) Case Results frames: a,b): akiyo 11th,18th, c,d) coastguard, 98th and 93rd and e,f) hall monitor 69th,62nd. Complete results are available in HTML version. CONCLUSION Our system represents a new paradigm for object segmentation and has much room for improvement. More accurate true motion estimator and edge operators would improve performance. Integration of object region-based methods can compensate for or enhance the current edge-based motion analysis. For a semi-automatic system, temporal training can be guided by human input through either direct information or high condence edge images. The simple smoothing constraints can be upgraded to more complete model of object motion. Analysis of stability and convergence of temporal training is needed as well. Using this system as a basis, we continue this work in an iterative method of multiple object boundary segmentation. ACKNOWLEDGMENTS This research was funded by Mitsubishi Electric ITA Adv. Television Lab. Thanks to Anthony Vetro and Huifan Sun for their enlightening discussions. References [1] ISO/IEC JTC1/SC29/WG11 Coding of Moving Pictures and Associated Audio MPEG98/W2194. Mpeg-4 requirements doc., March [2] I.-J. Lin, A. Vetro, H. Sun, and S.-Y. Kung. Circular viterbi: Boundary detection with dynamic programming, mpeg98/doc. 3659, July [3] M. Kass, A. Witkin, and D. Terzopoulos. Snakes:active countour models. Int'l Journal of Computer Vision, 1(4):321{331, [4] Y.-T. Lin, Y.-K. Chen, and S.Y. Kung. A principal component clustering approach to object-oriented motion segmentation and estimation. J. of VLSI Sig. Proc., 2(17):163{187, November 1997.

FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS

FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS Yen-Kuang Chen 1, Anthony Vetro 2, Huifang Sun 3, and S. Y. Kung 4 Intel Corp. 1, Mitsubishi Electric ITA 2 3, and Princeton University 1