Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation

Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation Yibiao Zhao 1,3, Yanbiao Duan 2,3, Xiaohan Nie 2,3, Yaping Huang 1, Siwei Luo 1 1 Beijing Jiaotong University, 2 Beijing Institute of Technology, 3 Lotus Hill Institute {ybzhao.lhi,ybduan.lhi,xhnie.lhi}@gmail.com,{yphuang,swluo}@bjtu.edu.cn Abstract Experts-Shift is a novel statistical framework for keyframe-based video segmentation. Compared to existing video segmentation techniques with simple color models, our method proposes a probability mixture model coupling strong image classifiers (experts) with latent spatial configuration. In order to propagate image labels to the successive frames, our algorithm track all experts jointly by a efficient MCMC sampler with their relations modeled by MRFs. This algorithm is capable to handle overlapping color distribution, ambiguous image boundaries, large displacement in challenging scenario with a solid foundation of both generative modeling and discriminative learning. Experiment shows our algorithm achieves high quality results and need less supervision than previous work. Figure 1. Probabilistic Graphical Model representing the Mixture of Classification Experts. 1. Introduction Keyframe-based video segmentation, the process to propagate segmentation label from manually labeled keyframe to successive frames, is an essential technique in many video applications. Several existing approaches consider video as image sequence [9, 8] or 3D volume [2, ], and attempt to solve the problem by modeling foreground vs. background distributions in the color space. In some challenging video sequences, unfortunately, foreground and background regions contain the similar colors or very ambiguous image patterns, which will confuse algorithms relying only on the global color statistics (Fig.2(a)). Video Snapcut [3] first addressed this problem, and learned foreground/background color models in localized features along object boundary. The SIFT matching together with optical flow algorithm is applied to build correspondence between successive frames. In this paper, we aim to model the local image pattern and the spatial position simultaneously. By introducing a location-sensitive variable, the latent relation between target label and spatial position are explicitly established. The Figure 2. The learned experts (color ellipses) seem to be very meaningful in that they represent identifiable spatial areas and handle the local ambiguities (eg. area in red and violet ellipses) quite well. mixture model thus acts as an ensemble of localized classifiers, namely experts, with each of which in charge of a spatial domain of Gaussian (illustrated by color ellipses in Fig.2(b)). It naturally embraces the generative nature of spatial configuration and the discriminative power of local image feature. The mixture model is estimated in EM fashion: experts compete for the ownership of pixels at E step; 978-1-4244-9497-2//$26. 2 IEEE 622

and Boosting algorithm is applied to distinguish the subtle difference in their domination area at M step. Moreover, we design an efficient model-adapting scheme based on MCMC sampler: Markov Random Field model captures pairwise relations between experts, which enables reliable multiple-target tracking and maintains a stable topological relationship between experts (illustrated in Fig.2(b)). 2. Mixture Model of Classification Experts In our method, instead of learning the complicated distribution with local image features x and spatial position s jointly, we introduce a location-sensitive latent variable z for each pixel n = 1,, N to infer the latent relations between spatial position s and target variable y on the keyframe t. z is a K-dimensional binary random variable having a 1-of-K representation in which a particular element z k is equal to 1 and all other elements are equal to. The target distribution p(y t x t, s t ) (Considering a single frame, we omit the superscript t in the rest of this section for simplification) is obtained by marginalization of joint distribution p(y, z x, s) over the latent variables z, which allows complicated distributions to be formed from simpler components. p(y x, s) = k K p(y, z k x, s) = k K p(y x, z k )p(z k s) The model is thus formulated as an ensemble of K discriminative classifiers p(y x, z k ). Each classifier, known as an expert, is an independent discriminative model associated to the location-sensitive latent variable z k. The latent relation between observed variable s and target variable y are explicitly established by z k. (the corresponding graphical model representation is depicted in Fig.1) Each expert is learned by a discriminative classifier of Boosting, p(y x, z k = 1) = 1 1 + exp{ 2 y F (x)} where F (x) = i λ if i (x) is a strong classifier, namely an ensemble of weak classifiers f i (x) with coefficients λ i. The classifiers with strong discriminative power enable using rich image features without additional effort on modeling the observed data distribution p(x) in large feature space. To obtain good generalization ability, we explicitly formulate a generative distribution of spatial position for each expert as normal distribution, p(s z k = 1) = N(s µ k, Σ k ). With Bayesian rules, we get the posterior by π k N(s µ k, Σ k ) p(z k = 1 s) = j K π jn(s µ j, Σ j ) (1) (2) (3) which π k = P (z k = 1) is mixing coefficients of each Gaussian component. A intuitive example is shown in Fig.2. We visualize the seven experts by ellipses indicating their latent spatial distribution. The learned experts seem to be very meaningful in that they represent identifiable spatial areas and handle the local ambiguities quite well. Model Estimation with the EM The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm, which alternates two steps: (a) E-step is an expectation step where posterior probabilities are computed for the latent variables, p(y x, z k )p(z k s) γ(z k ) = p(z k y, x, s) = j K p(y x, z j)p(z j s) (b) M-step is a maximization step where parameters are updated by taking the expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables, (4) Θ k = arg max n N γ(z k ) ln(p(y z k, x)p(z k s)) (5) The parameters π k, µ k, Σ k for the spatial model p(z k s) are estimated same as original Gaussian Mixture Model. Meanwhile, the GentleBoost algorithm [6] is performed, and it optimizes the local image feature model with an exponential criterion exp( yf (x)) which to second order is proved to be equivalent to the binomial log-likelihood criterion [6]. 3. Joint Experts-Shift with MRF Having learned classification experts on the keyframe, p(y t x t, s t ; Θ t ), in this section, we focus on tracking the expert Θ t to the next frame Θ t+1 according to local image evidence x t, x t+1 between two successive frames. Benefiting from the Gaussian model of spatial position N(s; µ k, Σ k ) for each expert, the mixture model is very flexible to be explicitly updated. Inspired by [7], we model the pair-wise relation between experts by a Markov Random Field model. The MRF defines a graph structure (V, E) (shown in Fig.3(d)) with undirected edges E between nodes V where the joint probability is factored as a product of local potential functions on each clique, p(θ t+1 Θ t ) φ(θ t+1 i, Θ t i) ψ(θ t+1 i, Θ t+1 j ) (6) i i,j E where φ(θ t+1 i, Θ t i ) are transition models between successive frames capturing individual expert s motion, and ψ(θ t i, Θt j ) are pairwise interaction potentials proportion to, exp{ min(n(s µ i, Σ i ), N(s µ j, Σ j ))dxdy}. (7) 623

Figure 3. Experts (top), probability map (bottom left) and error map (bottom right) with different learning strategies (a-c). (d) MRF graph structure. where min(, )dxdy is the intersection distance between two distributions. An efficient MCMC sampling step of Metropolis- Hastings (MH) algorithm is applied to optimize the MRF model as introduced in [7]. In this paper, we do not elaborate it further. Incorporating motion prior After determined the position of each expert, we can estimate how much does the entire label image shift between two frames. A very simple intuition is that the probability of a pixel on frame t + 1 to be a segmentation boundary will be very high if there are boundary pixel nearby on the last frame t. p(y t+1 y t 1 ) = 1 + exp{ yp t+1 min d(s, v)}, v Bt+1 p yp t+1 (s) = y t (τ(s) + s) (8) 624

where we formulate the motion prior as logistic function of signed distance between the pixel s and its nearest point on the boundary B t+1 p, τ(s) is the total shifting of all experts weight on the position of a pixel. As is shown in Fig. 3, the shade confused the single classifier(a), and was solved by introducing the mixture of experts to a great extent (b). The prior forcefully suppresses the ambiguity in the inner regions(c), making the experts focus on the local difference near the object boundary. This is implemented by simply adjusting the initial weight of data according to the prior. We perform the learning process with different strategies Fig.3. The shades in both foreground and background confused the single classifier but is easily solved by mixture of experts. The prior deals with the main component of the foreground and background, leaving the area near the boundary to the experts, and improve the performance greatly. 4. Experiments We evaluate our algorithm on several popular experimental video sequences. Videos in Fig.4 (1-4) and Fig.5 are selected from [8, 9, 3], and LHI dataset [11]. For all experiment, we use eight experts. We apply a recently proposed interactive image segmentation algorithm of CO3 [12] to segment each keyframe. Following the algorithm of CO3, we use Bregman iteration to refine the classification map obtained from learned experts. The speed of post-processing process of Bregman iteration is generally 6 frames per second, and the experts can test 2.3 frames for each second on a common PC. Fig.4 shows some representative results of our algorithm on several challenging video sequences. The keyframes with user interaction are annotated on the time axis. Fig.5 further display algorithm comparison on the fish sequence. For this challenging data, Geodesic Matting [2], Video Cutout [] only based on a simple global color model requires intensive user interaction. Video Snapcut [3] with local color model performs better than the formers. Our method, without any user interaction, successively propagates the desired segment label to the last frame. 5. Conclusion In this paper, we present a statistical framework, Experts-Shift, to deal with complicated video sequence in keyframe-based video segmentation task. We first propose a mixture model to learn local image classifiers (experts), and infer their latent spatial configuration in the EM fashion. Moreover, an efficient joint tracking scheme is designed based on MRF model of all experts, which enables reliable multiple-target tracking and maintains a stable topological relationship. Experiment shows our algorithm achieves high quality results and need less user interaction than previous work. 6. Acknowledgments This work at Beijing Jiaotong University is supported by China 863 Program 27AA1Z168, NSF China grants 697578, 69258, 68541, 687282, 677316, Beijing Natural Science Foundation 9233, and Doctoral Foundations of Ministry of Education of China 28449. And the work at the Lotus Hill Institute is supported by China 863 Program 27AA1Z3, 29AA1Z331 and NSF China grants 697156, 672823. References [1] A. Agarwala, A. Hertzmann, D. Salesin, and S. Seitz. Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, pages 584 591, 24. [2] X. Bai and G. Sapiro. Geodesic matting: A framework for fast interactive image and video segmentation and matting. International Journal of Computer Vision, 82(2):113 132, 29. [3] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, pages 1 11, 29. [4] X. Bai, J. Wang, D. Simons, and G. Sapiro. Dynamic Color Flow: A Motion-Adaptive Color Model for Object Segmentation in Video. In European Conference on Computer Vision (ECCV), 2. [5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 26. [6] Y. Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In Proceeding of International Conference on Computer Vision., volume 1, pages 5 112 vol.1, 21. [7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statatistics, 28:2, 2. [8] Z. Khan, T. Balch, and F. Dellaert. An MCMC-based Particle Filter For Tracking Multiple Interacting Targets. In European Conference on Computer Vision. 23. pages, pages 279 29 [9] Y. Li, J. Sun, and H. Shum. Video object cut and paste. ACM Transactions on Graphics, pages 595 6, 25. [] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In International Conference on Computer Vision. 29. [11] J. Wang, P. Bhat, R. Colburn, M. Agrawala, and M. Cohen. Interactive video cutout. ACM Transactions on Graphics, pages 585 594, 25. [12] B. Yao, X. Yang, and S.C. Zhu. Introduction to a large-scale general purpose ground truth database: Methodology, annotation tool and benchmarks. In EMMCVPR., pages 169 183, 27. [13] Y. Zhao, S.C. Zhu and S. Luo, CO3 for Ultra-fast and Accurate Interactive Segmentation. In ACM Multimedia 2. 625

Frame 4 2 Frame 3 3 2 Frame 1 3 2 Frame 3 5 6 6 7 Frame 57 3 5 6 Frame 45 Frame 31 5 Frame 6 Frame 28 Frame 5 Frame 38 Frame 19 Frame 37 Frame 27 2 3 Figure 4. Representative results of proposed algorithm. Red dots indicates keyframes with user interaction. 626 5

Experts-Shift Geodesic 7 Cutout 5 Snapcut 9 Figure 5. Results of different methods. Red dots indicates keyframes with user interaction. 627