Armand Joulin, Francis Bach and Jean Ponce. INRIA -Ecole Normale Supérieure April 25, 2012
Segmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Segmentation is classical and fundamental vision problem.
Segmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Segmentation is classical and fundamental vision problem. Problem: Many possible solutions.
Existing solutions Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Supervised Segmentation: Need ground truth for every class of object Cannot deal with an unknown object. P. Krähenbühl and V. Koltun (NIPS 11) Interactive segmentation (scribbles or a bounding box) Need human interaction for each image. GrabCut
Cosegmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Dividing one images.
Cosegmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Dividing a set of images by using shared information.
Cosegmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Dividing a set of images by using shared information. No prior information. But: common foreground and different background.
Cosegmentation Introduction Segmentation Supervised and weakly-supervised segmentation Cosegmentation Previous existing methods (Rother et al. 2006, Singh and Hochbaum 2009,...) only work with 2 images and the exact same object. The first presented method works on multiple images and on an object class. The second one extends it to multiple images and multiple object classes.
Segmentation Supervised and weakly-supervised segmentation Cosegmentation...Cosegmentation is also a ill-posed problem In natural images, objects are link with their environement...the background is also common to all the images.
Segmentation Supervised and weakly-supervised segmentation Cosegmentation...Cosegmentation is also a ill-posed problem In natural images, objects are link with their environement...the background is also common to all the images. Solutions: Use user interaction on some images,
Segmentation Supervised and weakly-supervised segmentation Cosegmentation...Cosegmentation is also a ill-posed problem In natural images, objects are link with their environement...the background is also common to all the images. Solutions: Use user interaction on some images, Segment the background into meaningful regions.
The goals of our approach Goal of our approach Notations Our method should: Handle multiple images.
The goals of our approach Goal of our approach Notations Our method should: Handle multiple images. Works on any kind of object/stuff.
The goals of our approach Goal of our approach Notations Our method should: Handle multiple images. Works on any kind of object/stuff. Segments the background into meaningful regions.
The goals of our approach Goal of our approach Notations Our method should: Handle multiple images. Works on any kind of object/stuff. Segments the background into meaningful regions. Uses no prior information but can be easily extended to interactive cosegmentation.
Method goals Introduction Goal of our approach Notations Local consistency Maximizing spatial consistency within a particular image. Figure: Image space.
Method goals Introduction Goal of our approach Notations Local consistency Maximizing spatial consistency within a particular image. Separation of the classes Maximizing the separability of K classes between different images Our framework: Unsupervised discriminative clustering. Figure: Image space. Figure: Feature space.
Problem Notations Introduction Goal of our approach Notations Each image i is reduced to a subsampled grid of pixels. For the n-th pixel, we denote by: x n its d-dimensional feature vector. y n the K-vector such as y nk = 1 if the n-th pixel is in the k-class and 0 otherwise.
Figure: Image space. Normalized Cut (Shi and Malik, 2000): The similarty between two pixels is mesured by the rbf distance between their position p n and their color c n.
Figure: Image space. Normalized Cut (Shi and Malik, 2000): The similarty between two pixels is mesured by the rbf distance between their position p n and their color c n. For an image i, our similarity matrix is: W i nm = exp( λ p p n p m 2 2 λ c c n c m 2 ).
Figure: Image space. Normalized Cut (Shi and Malik, 2000): The Laplacian matrix is L = I D 1/2 WD 1/2 where D the diagonal matrix composed of the row sums of W
Figure: Image space. Normalized Cut (Shi and Malik, 2000): The Laplacian matrix is L = I D 1/2 WD 1/2 where D the diagonal matrix composed of the row sums of W We thus have the following in our cost function: E B (y) = µ N tr(yt Ly).
Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Figure: Feature space. Discriminative classifier: given the labels y, we solve the following problem: 1 E U (y) = min A IR K d, N b IR K N n=1 l(y n,aφ(x n )+b)+ λ 2K A 2 F, Notations φ a non-linear mapping of the feature, l is a cost function.
Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Figure: Feature space. Discriminative classifier: given the labels y, we solve the following problem: 1 E U (y) = min A IR K d, N b IR K N n=1 l(y n,aφ(x n )+b)+ λ 2K A 2 F, Notations φ a non-linear mapping of the feature, l is a cost function.
Mapping approximation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Our discriminative clustering framework works with positive definite kernels
Mapping approximation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Our discriminative clustering framework works with positive definite kernels We use the χ 2 kernel matrix K: ( D K nm = exp λ h d=1 (x nd x md ) 2 ), x nd +x md
Mapping approximation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Our discriminative clustering framework works with positive definite kernels We use the χ 2 kernel matrix K: ( D K nm = exp λ h d=1 (x nd x md ) 2 ), x nd +x md Equivalent to apply a mapping φ from the feature space to a high-dimensional Hilbert space F, such that: K nm = φ(x n ), φ(x m )
Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Figure: Feature space. Discriminative classifier: given the labels y, we solve the following problem: 1 E U (y) = min A IR K d, N b IR K N n=1 l(y n,aφ(x n )+b)+ λ 2K A 2 F, Notations φ a non-linear mapping of the feature, l is a cost function.
Loss function Introduction Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation We choose the soft-max loss function because it is suited for multiclass and is related to probabilistic models: l(y n,aφ(x n )+b) = K k=1 ( exp(a T ) y nk log k φ(x n )+b k ) K, l=1 exp(at l φ(x n )+b l )
Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Find the set of labels y which leads to the best data separation into K classes: min y {0,1} N K, y1 K =1 N min E U(y,A,b) A IR K d,b IR K
Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Find the set of labels y which leads to the best data separation into K classes: min y {0,1} N K, y1 K =1 N min E U(y,A,b) A IR K d,b IR K Problem: Same label for all the pixels perfect separation
Cluster size balancing Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Two solutions: adding linear constraints on the number of elements per class Encourage the proportion of points per class to be uniform
Cluster size balancing Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Two solutions: adding linear constraints on the number of elements per class Encourage the proportion of points per class to be uniform We choose the second: No additional parameters and have a probabilistic interpretation. H(y) = i I K k=1 ( 1 N n N i y nk ) ( 1 log N n N i y nk where i is an image, and N i the number of pixels in i Note: In a weakly supervised setting (e.g., interactive segmentation), this term can be modify to take into account prior knowledge. ).
Overall problem Introduction Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Combining the unary and binary term with the class balancing term, we obtain the following problem: [ ] min y {0,1} N K, y1 K =1 N min E U (y,a,b) +E B (y) H(y). A IR d K, b IR K
Probabilistic interpretation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation We introduce t n in {0,1} I indicating to which image n belongs and z n in {1,...,M} giving for each pixel n some observable information The label y is a latent variable of the observable information z given x (x y z t) inducing an explain away phenomenon: the label y n and the variable t n compete to explain the observable information z n.
Probabilistic interpretation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation More precisely, we suppose a bilinear model: P(z nm = 1 t ni = 1, y nk = 1) = y nk G ik mt ni, where N m=1 Gik m = 1
Probabilistic interpretation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation More precisely, we suppose a bilinear model: P(z nm = 1 t ni = 1, y nk = 1) = y nk G ik mt ni, where N m=1 Gik m = 1 and a exponential family model for Y = (y 1,...,y N ) given X = (x 1,...,x N ) with unary parameters (A,b) and binary parameters L.
Probabilistic interpretation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Our cost function is the mean-field variational approximation of the following (regularized) negative conditional log-likelihood of Z = (z 1,...,z N ) given X and T = (t 1,...,t N ) for our model: min A IR d K,b IR K, G IR N K I, G T 1 N =1, G 0 1 N N n=1 log ( p(z n x n, t n ) ) + λ 2K A 2 2.
Probabilistic interpretation Formulation Mapping approximation Loss function Cluster size balancing Overall problem Probabilistic interpretation Our cost function is the mean-field variational approximation of the following (regularized) negative conditional log-likelihood of Z = (z 1,...,z N ) given X and T = (t 1,...,t N ) for our model: min A IR d K,b IR K, G IR N K I, G T 1 N =1, G 0 1 N N n=1 log ( p(z n x n, t n ) ) + λ 2K A 2 2. Z can encode must-link and must-not-link constraints between pixels (e.g., superpixels).
EM procedure Introduction [ min y {0,1} N K, y1 K =1 N min A IR d K, b IR K ] E U (y,a,b) +E B (y) H(y).
EM procedure Introduction [ min y {0,1} N K, y1 K =1 N min A IR d K, b IR K ] E U (y,a,b) +E B (y) H(y). This cost function is not jointly convex in y and (A,b).
EM procedure Introduction [ min y {0,1} N K, y1 K =1 N min A IR d K, b IR K ] E U (y,a,b) +E B (y) H(y). This cost function is not jointly convex in y and (A,b). However it is convex in both independently.
EM procedure Introduction [ min y {0,1} N K, y1 K =1 N min A IR d K, b IR K ] E U (y,a,b) +E B (y) H(y). This cost function is not jointly convex in y and (A,b). However it is convex in both independently. We alternatively optimize over each variable while fixing the other: We use L-BFGS for (A,b) We use a projected gradient descent for y.
The initialization Since our problem is not convex, a good initialization is crucial We propose a quadratic convex approximation related to Joulin et al. (CVPR 10). Quadratic function may lead to poor solutions, thus we also use random initializations.
Initialization: Quadratic approximation The second-order Taylor expansion of our cost function is: J(y) = K [ tr(yy T C)+ 2µ 2 NK tr(yyt L) 1 ] N tr(yyt Π I ), where C = 1 N Π N(I Φ(NλI K +Φ T Π N Φ) 1 Φ T )Π N is related to the reweighted ridge regression classifier (Joulin et al. CVPR 10).
Initialization: Quadratic approximation The second-order Taylor expansion of our cost function is: J(y) = K [ tr(yy T C)+ 2µ 2 NK tr(yyt L) 1 ] N tr(yyt Π I ), where C = 1 N Π N(I Φ(NλI K +Φ T Π N Φ) 1 Φ T )Π N is related to the reweighted ridge regression classifier (Joulin et al. CVPR 10). This is not convex because of the last term which can be replaced by the following linear constraints: y nk 0.9N i ; y nk 0.1(N N i ). n N i n N j j I\i we obtain a formulation similar to Joulin et al. (CVPR 10).
Introduction Binary segmentation (foreground/background) on MSRC: High variability in foreground and background, around 30 images per classes, We use SIFT features. Multiclass cosegmentation on icoseg: Low variability in the image, same illumination... around 10 images per classes, We use color histograms. Some extensions: Grabcut. weakly supervised problem video key frames.
Binary cosegmentation
Binary cosegmentation
Binary cosegmentation class Ours Kim et al. (ICCV 11) Joulin et al. (CVPR 10) Bike 43.3 29.9 42.3 Bird 47.7 29.9 33.2 Car 59.7 37.1 59.0 Cat 31.9 24.4 30.1 Chair 39.6 28.7 37.6 Cow 52.7 33.5 45.0 Dog 41.8 33.0 41.3 Face 70.0 33.2 66.2 Flower 51.9 40.2 50.9 House 51.0 32.2 50.5 Plane 21.6 25.1 21.7 Sheep 66.3 60.8 60.4 Sign 58.9 43.2 55.2 Tree 67.0 61.2 60.0 Average 50.2 36.6 46.7
class K Ours Joulin et al. CVPR 10 Kim et al ICCV 11 Baseball player 5 62.2 53.5 51.1 Brown bear 3 75.6 78.5 40.4 Elephant 4 65.5 51.2 43.5 Ferrari 4 65.2 63.2 60.5 Football player 5 51.1 38.8 38.3 Helicopter 3 43.3 67.8 7.3 Kite Panda 2 57.8 58.0 66.2 Monk 2 77.6 76.9 71.3 Panda 3 55.9 49.1 39.4 Skating 2 64.0 47.2 51.1 Stonehedge 3 86.3 85.4 64.6 Plane 3 45.8 39.2 25.2 Face 3 70.5 56.4 33.2 Average 64.8 58.1 48.7
Extensions Introduction grabcut: Weakly supervised learning with image tags ({ plane, sheep, sky, grass}). Video shot segmentation:
Limitations Introduction Number of classes: Each class must be in each image (because of the entropy). Running time: About half an hour to one hour (MATLAB implementation).
Thank you.