Intelligent Scissors & Erasers Interactive Video Object Extraction & Inpainting 林嘉文 清華大學電機系 cwlin@ee.nthu.edu.twnthu edu tw 1
Image/Video Completion The purpose of image/video completion Remove objects and replace them automatically with other content that is visually non-distinguishable from background The completed video must be natural for human eyes Maintaining spatio-temporal coherence is very important in avoiding annoying visual effect 2
Texture Synthesis Texture synthesis generates image regions from sample textures Applications: Remove some non-subjective object of the image. Restoration of damaged image 3
Texture Synthesis The patch priority is very important The texture boundary patch has a high priority 4
Texture Synthesis for Object Removal 5
Texture Synthesis May NOT BE Good for Video Inpainting 6
Video Inpainting: Space-Time Completion Y. Wexler, E. Shechtman, and M. Irani, IEEE T-PAMI PAMI, Mar 2007 7
Video Inpainting: Space-Time Completion 8
Video Inpainting under Constrained Camera Motion K. A. Patwardhan, G. Sapiro, and M. Bertalmío, IEEE T-IP, Feb 2007 9
Video Inpainting under Constrained Camera Motion 10
Interactive Object Extraction as a Digital Scissor 林嘉文 清華大學電機系 cwlin@ee.nthu.edu.tw 11
Our Interactive Video Inpinting System Input Video Surveillance Video Forgery Flow Object Extraction & Removal Background Mosaics Modeling Scene Classification Human Interaction Video Inpainting Inpainted Video 12
Video Object Extraction Object extraction has been widely studied, Object segmentation in a single frame Object tracking and segmentation ti in a video Target is represented in many forms Centroid of object or a set of points Geometric shapes Object contours Yilmaz et al. 2004 13
Video Object Extraction Major issues in object tracking Partial or full occlusion Changes of characteristics of an object Changes of environment (background, lighting, etc.) Update of foreground/background models Features selected greatly impact performance Every feature e has its own limits Color values Edges Created histograms Motion Hybrid Features 14
Proposed Interactive Object Extraction Scheme Incoming frame Region-wise tracker Pixel-wise Tracker NO MAP Decision YES Update Manual Refinement Tool
Initiation foreground background Manually assigned Opposite samples are defined 16 Networked CCU Video CSIE Lab, Department of Electrical Engineering, National Tsing Hua University
Seed Features The set of seed features is composed of linear combinations. F= { wr+ wg+ wb w [2,1,0,1,2]} 0 1 2 * F= { wh 0 + ws 1 + wv 2 w* [2,1,0,1,2]} F= { wl+ wa+ wb w [ 2, 1,0,1,2]},, 0 1 2 * Totally 49 features for each color space 17 { 2,-2,-1}, { 2,-2, 1}, { 2,-1,-2}, { 2,-1,-1}, { 2,-1, 0}, { 2,-1, 1}, { 2,-1, 2}, { 2, 0,-1}, { 2, 0, 1}, { 2, 1,-2}, { 2, 1,-1}, { 2, 1, 0}, { 2, 1, 1}, { 2, 1, 2}, {2,2,-1}, { 2, 2, 1}, { 1,-2,-2}, 2}, { 1,-2,-1},{1,-2, 1}, 0}, { 1,-2, 1}, { 1,-2, 2}, { 1,-1,-2}, { 1,-1,-1}, { 1,-1, 0}, { 1,-1, 1}, { 1,-1, 2}, { 1, 0,-2}, { 1, 0,-1}, { 1, 0, 0}, { 1, 0, 1}, { 1, 0, 2}, { 1, 1, -2}, { 1, 1,-1}, { 1, 1, 0}, { 1, 1, 1}, { 1, 1, 2}, { 1, 2,-2}, { 1, 2,-1}, { 1, 2, 0}, { 1, 2, 1}, { 1, 2, 2}, { 0, 0, 1}, { 0, 1,-2}, { 0, 1,-1}, { 0, 1, 0}, { 0, 1, 1}, { 0, 1, 2}, { 0, 2,-1}, { 0, 2, 1}
Feature Extraction B V w 1 H+w 2 S+w 3 V p(x) G S q(x) threshold R H 18
Tuned Features For each seed feature, foreground p(x) bins background q(x) We create the tuned feature in form of 19 L w () i ( ) () { p i δ } { q i δ} max p i, = log max, R. T. Collins et al., "Online selection of discriminative tracking features, " IEEE T-PAMI PAMI, vol. 27, no. 10, pp. 1631-1643, Oct. 2005.
Adaboost-Based Feature Selection We use Adaboost to combine all the seed features to achieve more accurate segmentation Each seed feature is considered as a weak classifier Through Adaboost, we generate a strong classifier to separate foreground objects from background 20 Networked CCU Video CSIE Lab, Department of Electrical Engineering, National Tsing Hua University
Adaboost Basic Concept Weak Classifier 2 Weak Classifier 1 21 CCU CSIE
Result of Pixel-Wise Tracker: Demo
Region-Wise Tracker Morphological Pre-processing Regionalization i Backward Region Tracking
Backward Region Tracking frame t -1 frame t backward 1 1 1 { } t t t t t t label( R ) = label( R D( R, R ) = min D( R, R ) i j i j k i k
Maximum A Posteriori (MAP) Based Spatio-Temporal Tracking Confidence Measurement Pixel-wise Spatial Coherence Region-wise Uncertain Region Relabeling li MAP Estimation
MAP Based Spatio-Temporal Tracking Confidence Measurement Use maximum a posteriori i PR R R R t t t 1 t ( i) = (1 λϕ ) region ( i, j ) + λϕpixel( i), λ= 0.5 { t t 1 t 1 t t 1 sqrt hi x hj x Rj (1) Likelihood: ϕregion ( R R = Foreground / background i, j ) or uncertain t t 1 region t 1 sqrt hi x hj x R j k t t 1 Foreground: P( Ri) > 0.5 and Rj foreground f( x) 1 t t Background: t P( R ) < 0.5 and R background ( ( ) ( )), if foreground region 1 ( ( ) ( )), if background x (2) Prior: ϕ pixel( Ri) = i, f( x) foreground j, P( y) R N Uncertain: otherwise Py ( ) y t i
Final Combination Uncertain Region Relabeling Spatial coherence Region growing begins from boundary markers by gradient magnitude. Black : Background marker White : foreground marker Gray: be flooded 27
Final Combination Result: Demo
Object Extraction Results: Demo Human interaction is only performed for the first frame
Object Extraction Results: Demo Human interaction is only performed for the first frame
Object Extraction Results: Demo Human interaction is only yperformed for the first frame
Object Extraction Results: Demo Human interaction acto is only in the first frame Human interaction in the first frame and frame 350 32
Manual Refinement Tools We provide brush-like tools to refine the object lables The regions with more than 50% percent areas marked by the brush will be relabeled The result after refinement will be used to update the models of trackers
Computational Complexity Sequence Resolution Average time of regular iteration Average time of update Bream 176 * 144 0.3s 1s Akiyo 352 * 288 08s 0.8s 38s 3.8s Mother and daughter 352 * 288 1.4s 4.1s Jumping 352 * 240 0.4s 1s Flower 352 * 240 0.8s 2.5s Airplane 352 * 240 0.4s 1.8s Man walking 720 * 480 0.5s 1.5s
Video Inpainitng as a Digital Ease Eraser 林嘉文清華大學電機系 cwlin@ee.nthu.edu.tw 35
Background Inpainting Flowchart Input Frames Object Extraction N Moving Camera? Y Merge the Entire Past Foreground Masks Merge the Foreground Masks in Each Sub-Sequence Build Correspondence Dynamic Texture Synthesis Exponential Weighting g Blurring in Spatial Incoherent Boundaries Moving Camera? N Linear Weighting Blurring in Temporal Incoherent Regions
Background Mosaics 37
Mosiacs-Based Video Inpainitng Texture synthesis tools are not suitable for video inpainting due to the difficulty of maintaining temporal coherence Our method uses background mosaics to model a video captured by a moving camera A video scene is classified into the following types of regions, and different inpainting schemes are applied accordingly Static ti background: background mosaics Dynamic background (e.g., river, moving clouds): dynamic texture synthesis Occluded Objects: spatio-temporal slices 38
Static Background Inpainting (Mosaic-Based Copy-Paste) Original Video Inpainted Video 39
Dynamic Texture Synthesis Linear Dynamic System (LDS): x ( t + 1) = Ax ( t ) + Bv ( t ) y(t):observation () vectors x(t):hidden state vectors y( t) = Cx( t) v(t):noise (a) Training Mapping y ( t) = Cx( t) Observation Hidden State Input Images Vectors Vectors (b) Synthesis { y (0), y(1),..., y( n)} { x(0), x(1),..., x( n)} ABC: A, B,C : parameters State Equation x ( t + 1) = Ax( t) + Bv( t) Get the parameters Aˆ, Bˆ, Cˆ Initial State x(0) State Equation New Hidden Observation Sampling noise State Vectors Vectors ˆ v ( t) = B ˆ * S x ( t + 1) = Ax ( t ) + v ( t ) { x (0), x (1),..., x ( m ),...} { y (0), y (1),..., y ( m ),...} S ~ N (0,1) y ( t) = Cˆ x( t) Mapping Output Images 40
Issues with Dynamic Texture Synthesis Temporal Coherence Inconsistent t transition in training i and synthesizing i Training number: 20 Synthesizing number: 100 Inconsistent transition in corresponding regions Spatial Coherence Incoherent in the boundaries of synthesized and original data 41
Swimming Pool Sequence 1 Original video: 42
Swimming Pool Sequence 1 (Cont.) Completed video by proposed method: 43
Swimming Pool Sequence 2 Original video: 44
Swimming Pool Sequence 2 (Cont.) Completed video by proposed method: 45
Lawn Sequence Original video: 46
Lawn Sequence (Cont.) Completed video by proposed method: 47
Lawn Sequence (Cont.) Completed video by temporal copy-past: 48
Playground Sequence Original video: 49
Playground Sequence (Cont.) Video completion without ghost shadow compensation: 50
Playground Sequence (Cont.) Video completion with ghost shadow compensation: 51
Issues with Dynamic Background Inpainting How to maintain spatio-temporal p coherence across the boundaries of original and synthesized videos? How to classify regions and select training data from a video captured by a moving camera? Data registration and alignment Effect due to alignment inaccuracy Static background as a special case of dynamic background Complexity vs quality 52
Occluded Object Inpainting Using Spatio-Temporal al Slices 林嘉文清華大學電機系 cwlin@ee.nthu.edu.tw 53
Occluded Object Inpainting Using Spaio-Temporal Slices 54
Proposed Object Inpainting Method 55
Occluded Object Inpainting Using Spaio-Temporal Slices (Cont.) y x t XT spatio-temporal slice Image inpainting Result v1 v2 56
Occluded Object Inpainting Using Spaio-Temporal Slices (Cont.) Construct virtual contour Detect the edges of spatial temporal slice Recover spatio-temporal temporal slices to video frame v3 57
Post-processing for S-T Slices 58
Posture Mapping 59
Synthetic Postures No good match in posture matching due to a small number of available postures Separate each available posture into three parts, then combine the three to synthesize more postures 60
Occluded Object Inpainting: Demo 61
Occluded Object Inpainting: Demo 62
63