Segmentation: Clustering, Graph Cut and EM

Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu 1/29

Outline Motivations and Applications Image Segmentation by Clustering K-Means Algorithm Self-Organizing Map Image Segmentation by Graph Cut Basic Idea Block-diagonalization Segmentation by Expectation-Maximization Missing Data Problem E-M iteration Issues Remained 2/29

Segmentation is a Fundamental Problem To group up similar components such as image pixels, image regions or even video clips. It is an ill-posed problem! How do we define the similarity measurement? 3/29

Background Subtraction Video surveillance applications Separate the foreground from background Assume fixed camera Subtract background from images Adaptive scheme by selecting w a, w i and w c. B n+1 = w af + i w ib n i w c 4/29

Object Modeling Represent an object by regions First step for recognition Represent a scene by a set of layers Motion segmentation Image segmentation v.s. motion segmentation 5/29

Basic Approaches Segmentation by clustering Segmentation by graph cut Segmentation by EM algorithm 6/29

K-Means Clustering Assume the number of clusters, K, is given. Use the center of each clusters C i to represent each cluster. How do we determine the identity of a data point? Need to define a distance measurement, D(x, y). e.g., D(x, y) = x y 2. Winner takes all: l k (x k ) = arg min D(x k, C i ) = arg min x k C i 2 i i where l k is the label for the data point x k. K-means finds the clusters to minimize the total distortion. φ(x, C) = x j C i 2 i C j i th cluster 8/29

K-Means Clustering To minimize φ, K-means algorithm iterates between two steps: Labelling: assume the p-th iteration ends up with a set of cluster centers C (p) i, i = 1,...,K. We label each data point based on such a set of cluster centers, i.e., x k, find l (p+1) k (x k ) = min x k C (p) i i 2 and group data points belong to the same cluster Ω j = {x k : l k (x k ) = C j } Re-centering: re-calculating the centers: C (p+1) x i = k Ω i x k Ω i Iterates between labelling and re-centering until it converges. 9/29

Self-Organizing Map (SOM) SOM can be used for visualizing high-dim data Map to a low-dim space based on competitive learning A two-layer neural network outputs ξ 1 ξ 2 ξ m-1 ξ m weights x 1 x 2 x 3 inputs The # of neuron in the input layer is the same as the dimension of the input vector. Connection weights W k for each output neuron. 10/29

Competitive Learning For an input x, all neurons compete against each other The winner is the one whose weight is the closest to the input: y i = arg min i D(x i, W i ) The index of the winner is taken as the output of SOM. Adjust the weight of the winner Train the neurons nearby, and counter-train those far away. A window function Λ( y yk ) and the Hebbian learning rule: W(t + 1) = W(t) + η(t)λ( y y k )(x k W y k (t)) Intuition: the input data point will attract the neuron inside the window to its location, but push those neuron outside the window far away. Relation to vector quantization (VQ) and K-means clustering? 11/29

Adjacency Graph and Affinity Matrix We can represent the set of data {x 1,...,x N } by a graph G = {V,E} Each vertex represents an individual data point Each edge represents the adjacency of two data points And the weight of the edge represents the affinity of the two points For example A ij = exp { x i x j 2 } 2σ 2 i.e., the similarity of two points Thus, the data set can be viewed as a weighted adjacency graph More importantly, it can also be viewed as an affinity matrix A 13/29

Block-diagonalization: Idea If the data are grouped, then the affinity matrix is pretty much block-diagonalized Now, clustering can be treated as the task of finding the best re-permutation to block-diagonalize A More specifically, the summation of the affinity values of those off-diagonal block matrices is minimized or the sum of diagonal block matrices is maximized 14/29

Block-diagonalization: Formulation Introduce an association vector (i.e., a projection) for each cluster component w k, w k = w k1 w k2. w kn where w ki is the association of x i to the cluster k. Positive w ki indicates that x i is in cluster k to some extent, and negative otherwise Usually, such projection vector is normalized, i.e., we have: w T k w k = 1, Now we can formulate the problem as w k = arg max w k k = 1,...,K s.t. w T k w k = 1 w T k Aw k 15/29

Spectral Analysis The solution is easy The Lagrangian L = wk T Aw k + λ(1 wk T w k) It is clear that L w k = 2Aw k 2λw k = 0 Aw k = λw k What is this! w k, an eigenvector, indicates the association of data with cluster k The size of the cluster is given by the eigenvalue λ More significantly, we don t need to know K in advance! The significant λs tell K 16/29

A Problem Ideally, we can check the values of w ki for grouping But life is always complicated Suppose A has two identical eigenvalues Aw 1 = λw 1, and Aw 2 = λw 2 It is easy to see any linear combination of w 1 and w 2 also gives a valid eigenvector A(a 1 w 1 + a 2 w 2 ) = λ(a 1 w 1 + a 2 w 2 ) This means that we cannot simply use the values of w = a 1 w 1 + a 2 w 2 for grouping now Instead of using the 1-D subspace, we need to go to the 2-D subspace spanned by {w 1,w 2 } If all the K clusters are more or less of the same size, we ll have K similar eigenvalues. Then we have to go to K-d subspace. This is the worse case. 17/29

Graph Cut We may view the problem from another point of view: graph cut We still represent the data set by the affinity graph Suppose we want to divide the data set into two clusters, we need to find the set of weakest links between the subgraphs, each of which corresponds to one cluster A set of edges in a graph is called a cut Now, we need to find a minimum cut for the weakest links But we have singularity here: the separation of the isolated vertex gives the minimum cut In other words, the cut does not balance the sizes of the clusters 18/29

Normalized Cut So, the cut needs to be normalized. Suppose we partition V into A and B. z { 1, 1} N is the indicator. z i = 1 if x i in A, and -1 otherwise. Let d i = j A ij be the total connection from x i to all others Define normalized cut NCut(A, B) = = cut(a, B) cut(b, A) + asso(a, V) asso(b, V) A ij x i x j x i >0,x j <0 + d i x i >0 x i <0,x j >0 x i <0 A ij x i x j Denote x D = diag{d 1,...,d N }, k = i >0 d i i d, b = k i 1 k d i 19/29

Normalized Cut Define y = (1 + x) b(1 x) Shi & Malik (1997) gave a nice formulation 1 y T (D A)y minncut(x) = min x y y T Dy { yi {1, b} s.t. y T D1 = 0 This is to solve a generalized EVD under constraints (D A)y = λdy The showed that the eigenvector associated with the 2nd smallest eigenvalue is able to bipartite the graph 1 J. Shi and J. Malik, Normalized Cuts and Image Segmentation, CVPR 97 20/29

Generative Model and Missing Data Assume each image pixel is produced by a probability density associated with one of the g image segments. The data generation process: we first choose an image segment, and then generate the pixel based on: p(x) = i p(x θ i )π i where π is the prior for the i-th image segment, and θ i is the parameter. We can use Gaussian for each component: p(x θ i ) G(µ i, Σ i ) Associate a label l k for each x k for its identity This mixture model is a generative model. The data labels are missing. 22/29

Formulation So, our task is to do the inverse. Given a set of data point (image pixels) X = {x k, k = 1,...,N}, we need to estimate those parameters θ i, π i, and estimating the labels for all the data points by: lj = arg maxp(l j = k x j, Θ), x j k which gives the posteriori probability of x j. Maximum Likelihood Estimation. The likelihood of the data set can be written by: p(x Θ) = g ( p(x j θ i )π i ) j Usually, we use log likelihood: log p(x Θ) = j i=1 log( g p(x j θ i )π i ) But this is very ugly (why?) and intractable! 23/29 i=1

Missing Data and Indicator Variable Introduce an indicator variable z: z 1 z 2 z =. If a data point x is drawn from the k-th component, then z k = 1, and all other z i k = 0. This indicator variable tells the identity of a data point It is the missing part! Why do we need it? z g 24/29

Good News! Let s form the complete data: y k = [ xk And the complete data set is Y = {y k, k = 1,...,N}. The likelihood of the complete data point y k : g p(y k ) Θ) = z ki p(x k θ i ) log p(y k Θ) = i=1 z k ] g z ki log p(x k θ i ) i=1 So, for the whole data set, we have N g p(y Θ) = z ki p(x k θ i ) k=1 i=1 25/29

Good News and Bad News And thus: log p(y Θ) = = N g log( z ki p(x k θ i )) k=1 N k=1 i=1 i=1 g z ki log p(x k θ i ) Because we eliminate the summation terms inside log, the ML estimation becomes easier: Θ = arg maxlog p(y Θ) Θ However, the bad news is that the indicator variable z k make the ML difficult, since we do not know z k. 26/29

Expectation-Maximization Iteration Fortunately, life won t be too bad. A quite interesting phenomenon: if we know such zk, i.e., we know the identities for each data point, we can easily estimate the density parameters Θ based on ML, without any doubt. At the same time, if we know the density parameters, we can easily solve such indicator variables z k based on MAP. This phenomenon suggest an iterative procedure: E-step: computing an expected value of the complete data, here only E[z k ]; M-step: maximizing the the log likelihood of the complete data to estimate Θ. It converges to a local maximum of the likelihood. 27/29

EM for Image Segmentation let s apply EM to image segmentation: E-step: E[z ki ] = 1 p(kth pixel comes from ith component) M-step + 0 p(kth pixel doesn t come from ith component) = p(kth pixel comes from ith component) π i p(x k θ i ) = g j=1 π jp(x k θ j ) π i = 1 r µ i = r p(i x l,θ) l=1 r l=1 x lp(i x l,θ) r l=1 p(i x l,θ) Σ i = r l=1 p(i x l,θ)[(x l µ i )(x l µ i ) T ] r l=1 p(i x l,θ) 28/29

Issues Remained Structural parameters EM assumes a known number of components A common problem in clustering What if we don t know it? Minimum Description Length (MDL) principle in theory Cross-validation in practice Curse of dimensionality What if the dimensionality of x is very high? Too many parameters to estimate Requires a huge amount of training data Otherwise, the estimation is heavily biased 29/29