Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation

Similar documents
A Benchmark for Interactive Image Segmentation Algorithms

4/13/ Introduction. 1. Introduction. 2. Formulation. 2. Formulation. 2. Formulation

IMA Preprint Series # 2153

CO3 for Ultra-fast and Accurate Interactive Segmentation

Learning and Recognizing Visual Object Categories Without First Detecting Features

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Markov Random Fields and Gibbs Sampling for Image Denoising

Segmentation. Bottom up Segmentation Semantic Segmentation

Image Segmentation Using Iterated Graph Cuts BasedonMulti-scaleSmoothing

Supervised Learning for Image Segmentation

Supervised texture detection in images

Scene Grammars, Factor Graphs, and Belief Propagation

Graph-Based Superpixel Labeling for Enhancement of Online Video Segmentation

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Scene Grammars, Factor Graphs, and Belief Propagation

A Feature Point Matching Based Approach for Video Objects Segmentation

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Adaptive Feature Selection via Boosting-like Sparsity Regularization

Undirected Graphical Models. Raul Queiroz Feitosa

Tri-modal Human Body Segmentation

Support Vector Machines

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Dynamic Color Flow: A Motion-Adaptive Color Model for Object Segmentation in Video

IMPROVED FINE STRUCTURE MODELING VIA GUIDED STOCHASTIC CLIQUE FORMATION IN FULLY CONNECTED CONDITIONAL RANDOM FIELDS

Neighbourhood-consensus message passing and its potentials in image processing applications

CHAPTER 6 PERCEPTUAL ORGANIZATION BASED ON TEMPORAL DYNAMICS

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Markov Random Fields and Segmentation with Graph Cuts

Expectation Maximization (EM) and Gaussian Mixture Models

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS 664 Flexible Templates. Daniel Huttenlocher

Improving Recognition through Object Sub-categorization

Generative and discriminative classification techniques

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

CRFs for Image Classification

Inference and Representation

Generative and discriminative classification techniques

High-Resolution Image Classification Integrating Spectral-Spatial-Location Cues by Conditional Random Fields

IMA Preprint Series # 2171

A Taxonomy of Semi-Supervised Learning Algorithms

08 An Introduction to Dense Continuous Robotic Mapping

Improving Image Segmentation Quality Via Graph Theory

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

3 October, 2013 MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos

MRFs and Segmentation with Graph Cuts

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Clustering and The Expectation-Maximization Algorithm

Learning Objects and Parts in Images

Algorithms for Markov Random Fields in Computer Vision

FUSION OF MULTITEMPORAL AND MULTIRESOLUTION REMOTE SENSING DATA AND APPLICATION TO NATURAL DISASTERS

Efficient Particle Filter-Based Tracking of Multiple Interacting Targets Using an MRF-based Motion Model

Human Head-Shoulder Segmentation

Efficient Feature Learning Using Perturb-and-MAP

intro, applications MRF, labeling... how it can be computed at all? Applications in segmentation: GraphCut, GrabCut, demos

Human Upper Body Pose Estimation in Static Images

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore

Self Lane Assignment Using Smart Mobile Camera For Intelligent GPS Navigation and Traffic Interpretation

CS 534: Computer Vision Segmentation and Perceptual Grouping

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Clustering web search results

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Graphical Models, Bayesian Method, Sampling, and Variational Inference

An efficient face recognition algorithm based on multi-kernel regularization learning

Content-based image and video analysis. Machine learning

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Mixture Models and EM

Mixture Models and EM

Visual Motion Analysis and Tracking Part II

Detection of Man-made Structures in Natural Images

Iterated Graph Cuts for Image Segmentation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Random projection for non-gaussian mixture models

CS 664 Segmentation. Daniel Huttenlocher

Machine Learning. Chao Lan

PROBABILISTIC IMAGE SEGMENTATION FOR LOW-LEVEL MAP BUILDING IN ROBOT NAVIGATION. Timo Kostiainen, Jouko Lampinen

Probabilistic Index Histogram for Robust Object Tracking

Segmentation with non-linear constraints on appearance, complexity, and geometry

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Probabilistic Graphical Models

A Hierarchical Compositional System for Rapid Object Detection

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Multi-Instance Hidden Markov Model For Facial Expression Recognition

Simultaneous Appearance Modeling and Segmentation for Matching People under Occlusion

Recovering Intrinsic Images from a Single Image

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Generic Face Alignment Using an Improved Active Shape Model

Locally Adaptive Learning for Translation-Variant MRF Image Priors

Estimating Labels from Label Proportions

Markov/Conditional Random Fields, Graph Cut, and applications in Computer Vision

Markov Random Fields for Recognizing Textures modeled by Feature Vectors

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Lesion Segmentation and Bias Correction in Breast Ultrasound B-mode Images Including Elastography Information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Transcription:

Experts-Shift: Learning Active Spatial Classification Experts for Keyframe-based Video Segmentation Yibiao Zhao 1,3, Yanbiao Duan 2,3, Xiaohan Nie 2,3, Yaping Huang 1, Siwei Luo 1 1 Beijing Jiaotong University, 2 Beijing Institute of Technology, 3 Lotus Hill Institute {ybzhao.lhi,ybduan.lhi,xhnie.lhi}@gmail.com,{yphuang,swluo}@bjtu.edu.cn Abstract Experts-Shift is a novel statistical framework for keyframe-based video segmentation. Compared to existing video segmentation techniques with simple color models, our method proposes a probability mixture model coupling strong image classifiers (experts) with latent spatial configuration. In order to propagate image labels to the successive frames, our algorithm track all experts jointly by a efficient MCMC sampler with their relations modeled by MRFs. This algorithm is capable to handle overlapping color distribution, ambiguous image boundaries, large displacement in challenging scenario with a solid foundation of both generative modeling and discriminative learning. Experiment shows our algorithm achieves high quality results and need less supervision than previous work. Figure 1. Probabilistic Graphical Model representing the Mixture of Classification Experts. 1. Introduction Keyframe-based video segmentation, the process to propagate segmentation label from manually labeled keyframe to successive frames, is an essential technique in many video applications. Several existing approaches consider video as image sequence [9, 8] or 3D volume [2, ], and attempt to solve the problem by modeling foreground vs. background distributions in the color space. In some challenging video sequences, unfortunately, foreground and background regions contain the similar colors or very ambiguous image patterns, which will confuse algorithms relying only on the global color statistics (Fig.2(a)). Video Snapcut [3] first addressed this problem, and learned foreground/background color models in localized features along object boundary. The SIFT matching together with optical flow algorithm is applied to build correspondence between successive frames. In this paper, we aim to model the local image pattern and the spatial position simultaneously. By introducing a location-sensitive variable, the latent relation between target label and spatial position are explicitly established. The Figure 2. The learned experts (color ellipses) seem to be very meaningful in that they represent identifiable spatial areas and handle the local ambiguities (eg. area in red and violet ellipses) quite well. mixture model thus acts as an ensemble of localized classifiers, namely experts, with each of which in charge of a spatial domain of Gaussian (illustrated by color ellipses in Fig.2(b)). It naturally embraces the generative nature of spatial configuration and the discriminative power of local image feature. The mixture model is estimated in EM fashion: experts compete for the ownership of pixels at E step; 978-1-4244-9497-2//$26. 2 IEEE 622

and Boosting algorithm is applied to distinguish the subtle difference in their domination area at M step. Moreover, we design an efficient model-adapting scheme based on MCMC sampler: Markov Random Field model captures pairwise relations between experts, which enables reliable multiple-target tracking and maintains a stable topological relationship between experts (illustrated in Fig.2(b)). 2. Mixture Model of Classification Experts In our method, instead of learning the complicated distribution with local image features x and spatial position s jointly, we introduce a location-sensitive latent variable z for each pixel n = 1,, N to infer the latent relations between spatial position s and target variable y on the keyframe t. z is a K-dimensional binary random variable having a 1-of-K representation in which a particular element z k is equal to 1 and all other elements are equal to. The target distribution p(y t x t, s t ) (Considering a single frame, we omit the superscript t in the rest of this section for simplification) is obtained by marginalization of joint distribution p(y, z x, s) over the latent variables z, which allows complicated distributions to be formed from simpler components. p(y x, s) = k K p(y, z k x, s) = k K p(y x, z k )p(z k s) The model is thus formulated as an ensemble of K discriminative classifiers p(y x, z k ). Each classifier, known as an expert, is an independent discriminative model associated to the location-sensitive latent variable z k. The latent relation between observed variable s and target variable y are explicitly established by z k. (the corresponding graphical model representation is depicted in Fig.1) Each expert is learned by a discriminative classifier of Boosting, p(y x, z k = 1) = 1 1 + exp{ 2 y F (x)} where F (x) = i λ if i (x) is a strong classifier, namely an ensemble of weak classifiers f i (x) with coefficients λ i. The classifiers with strong discriminative power enable using rich image features without additional effort on modeling the observed data distribution p(x) in large feature space. To obtain good generalization ability, we explicitly formulate a generative distribution of spatial position for each expert as normal distribution, p(s z k = 1) = N(s µ k, Σ k ). With Bayesian rules, we get the posterior by π k N(s µ k, Σ k ) p(z k = 1 s) = j K π jn(s µ j, Σ j ) (1) (2) (3) which π k = P (z k = 1) is mixing coefficients of each Gaussian component. A intuitive example is shown in Fig.2. We visualize the seven experts by ellipses indicating their latent spatial distribution. The learned experts seem to be very meaningful in that they represent identifiable spatial areas and handle the local ambiguities quite well. Model Estimation with the EM The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm, which alternates two steps: (a) E-step is an expectation step where posterior probabilities are computed for the latent variables, p(y x, z k )p(z k s) γ(z k ) = p(z k y, x, s) = j K p(y x, z j)p(z j s) (b) M-step is a maximization step where parameters are updated by taking the expectation of the complete-data log likelihood with respect to the posterior distribution of the latent variables, (4) Θ k = arg max n N γ(z k ) ln(p(y z k, x)p(z k s)) (5) The parameters π k, µ k, Σ k for the spatial model p(z k s) are estimated same as original Gaussian Mixture Model. Meanwhile, the GentleBoost algorithm [6] is performed, and it optimizes the local image feature model with an exponential criterion exp( yf (x)) which to second order is proved to be equivalent to the binomial log-likelihood criterion [6]. 3. Joint Experts-Shift with MRF Having learned classification experts on the keyframe, p(y t x t, s t ; Θ t ), in this section, we focus on tracking the expert Θ t to the next frame Θ t+1 according to local image evidence x t, x t+1 between two successive frames. Benefiting from the Gaussian model of spatial position N(s; µ k, Σ k ) for each expert, the mixture model is very flexible to be explicitly updated. Inspired by [7], we model the pair-wise relation between experts by a Markov Random Field model. The MRF defines a graph structure (V, E) (shown in Fig.3(d)) with undirected edges E between nodes V where the joint probability is factored as a product of local potential functions on each clique, p(θ t+1 Θ t ) φ(θ t+1 i, Θ t i) ψ(θ t+1 i, Θ t+1 j ) (6) i i,j E where φ(θ t+1 i, Θ t i ) are transition models between successive frames capturing individual expert s motion, and ψ(θ t i, Θt j ) are pairwise interaction potentials proportion to, exp{ min(n(s µ i, Σ i ), N(s µ j, Σ j ))dxdy}. (7) 623

Figure 3. Experts (top), probability map (bottom left) and error map (bottom right) with different learning strategies (a-c). (d) MRF graph structure. where min(, )dxdy is the intersection distance between two distributions. An efficient MCMC sampling step of Metropolis- Hastings (MH) algorithm is applied to optimize the MRF model as introduced in [7]. In this paper, we do not elaborate it further. Incorporating motion prior After determined the position of each expert, we can estimate how much does the entire label image shift between two frames. A very simple intuition is that the probability of a pixel on frame t + 1 to be a segmentation boundary will be very high if there are boundary pixel nearby on the last frame t. p(y t+1 y t 1 ) = 1 + exp{ yp t+1 min d(s, v)}, v Bt+1 p yp t+1 (s) = y t (τ(s) + s) (8) 624

where we formulate the motion prior as logistic function of signed distance between the pixel s and its nearest point on the boundary B t+1 p, τ(s) is the total shifting of all experts weight on the position of a pixel. As is shown in Fig. 3, the shade confused the single classifier(a), and was solved by introducing the mixture of experts to a great extent (b). The prior forcefully suppresses the ambiguity in the inner regions(c), making the experts focus on the local difference near the object boundary. This is implemented by simply adjusting the initial weight of data according to the prior. We perform the learning process with different strategies Fig.3. The shades in both foreground and background confused the single classifier but is easily solved by mixture of experts. The prior deals with the main component of the foreground and background, leaving the area near the boundary to the experts, and improve the performance greatly. 4. Experiments We evaluate our algorithm on several popular experimental video sequences. Videos in Fig.4 (1-4) and Fig.5 are selected from [8, 9, 3], and LHI dataset [11]. For all experiment, we use eight experts. We apply a recently proposed interactive image segmentation algorithm of CO3 [12] to segment each keyframe. Following the algorithm of CO3, we use Bregman iteration to refine the classification map obtained from learned experts. The speed of post-processing process of Bregman iteration is generally 6 frames per second, and the experts can test 2.3 frames for each second on a common PC. Fig.4 shows some representative results of our algorithm on several challenging video sequences. The keyframes with user interaction are annotated on the time axis. Fig.5 further display algorithm comparison on the fish sequence. For this challenging data, Geodesic Matting [2], Video Cutout [] only based on a simple global color model requires intensive user interaction. Video Snapcut [3] with local color model performs better than the formers. Our method, without any user interaction, successively propagates the desired segment label to the last frame. 5. Conclusion In this paper, we present a statistical framework, Experts-Shift, to deal with complicated video sequence in keyframe-based video segmentation task. We first propose a mixture model to learn local image classifiers (experts), and infer their latent spatial configuration in the EM fashion. Moreover, an efficient joint tracking scheme is designed based on MRF model of all experts, which enables reliable multiple-target tracking and maintains a stable topological relationship. Experiment shows our algorithm achieves high quality results and need less user interaction than previous work. 6. Acknowledgments This work at Beijing Jiaotong University is supported by China 863 Program 27AA1Z168, NSF China grants 697578, 69258, 68541, 687282, 677316, Beijing Natural Science Foundation 9233, and Doctoral Foundations of Ministry of Education of China 28449. And the work at the Lotus Hill Institute is supported by China 863 Program 27AA1Z3, 29AA1Z331 and NSF China grants 697156, 672823. References [1] A. Agarwala, A. Hertzmann, D. Salesin, and S. Seitz. Keyframe-based tracking for rotoscoping and animation. ACM Transactions on Graphics, pages 584 591, 24. [2] X. Bai and G. Sapiro. Geodesic matting: A framework for fast interactive image and video segmentation and matting. International Journal of Computer Vision, 82(2):113 132, 29. [3] X. Bai, J. Wang, D. Simons, and G. Sapiro. Video snapcut: robust video object cutout using localized classifiers. ACM Transactions on Graphics, pages 1 11, 29. [4] X. Bai, J. Wang, D. Simons, and G. Sapiro. Dynamic Color Flow: A Motion-Adaptive Color Model for Object Segmentation in Video. In European Conference on Computer Vision (ECCV), 2. [5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 26. [6] Y. Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In Proceeding of International Conference on Computer Vision., volume 1, pages 5 112 vol.1, 21. [7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statatistics, 28:2, 2. [8] Z. Khan, T. Balch, and F. Dellaert. An MCMC-based Particle Filter For Tracking Multiple Interacting Targets. In European Conference on Computer Vision. 23. pages, pages 279 29 [9] Y. Li, J. Sun, and H. Shum. Video object cut and paste. ACM Transactions on Graphics, pages 595 6, 25. [] B. Price, B. Morse, and S. Cohen. Livecut: Learningbased interactive video segmentation by evaluation of multiple propagated cues. In International Conference on Computer Vision. 29. [11] J. Wang, P. Bhat, R. Colburn, M. Agrawala, and M. Cohen. Interactive video cutout. ACM Transactions on Graphics, pages 585 594, 25. [12] B. Yao, X. Yang, and S.C. Zhu. Introduction to a large-scale general purpose ground truth database: Methodology, annotation tool and benchmarks. In EMMCVPR., pages 169 183, 27. [13] Y. Zhao, S.C. Zhu and S. Luo, CO3 for Ultra-fast and Accurate Interactive Segmentation. In ACM Multimedia 2. 625

Frame 4 2 Frame 3 3 2 Frame 1 3 2 Frame 3 5 6 6 7 Frame 57 3 5 6 Frame 45 Frame 31 5 Frame 6 Frame 28 Frame 5 Frame 38 Frame 19 Frame 37 Frame 27 2 3 Figure 4. Representative results of proposed algorithm. Red dots indicates keyframes with user interaction. 626 5

Experts-Shift Geodesic 7 Cutout 5 Snapcut 9 Figure 5. Results of different methods. Red dots indicates keyframes with user interaction. 627