In this lecture, we are going to talk about image segmentation, essentially defined as methods for grouping pixels together.

In this lecture, we are going to talk about image segmentation, essentially defined as methods for grouping pixels together. We will first define the segmentation problem, overview some basic ideas of grouping in human vision, and go over several segmentation methods including mean-shift, graph-cuts, energy-based methods, and conclude with reviewing a few semantic segmentation works.

What is segmentation? In its most basic definition, segmentation groups some pixels in an image and form bigger entities/components out of them. These entities/components/groups are called segments.

All pixels in one group share some property. What this property is? That is up to the definition of grouping criteria (i.e., which pixels should be grouped together and which ones should go into different groups). There are numerous methods for it: grouping based on color, texture, semantic meaning, etc.

This criteria can be based on different levels of abstraction. For instance, you may want to group pixels together that are just sharing the same texture (low-level of abstraction). Similarly, one may want to group pixels that share the same semantic meaning (e.g., all pixels on a dog) regardless of their diverse texture or color (higher level of abstraction).

As an extreme example, let s look at this image. One may want to segment out the big face, while one may want to segment the image into, say 8, regions that just share a similar color. That way, the big face would be decomposed into 3 or 4 regions.

Therefore, segmentation can happen at different levels of abstraction and even doesn t have to be on a 2D image. For instance, what you see on the right is a LiDAR scan where the object instances are segmented. In this case, the input is actually a 3D scan and not a 2D image. The bottom example shows a case where a surface model is segmented into its parts, defined based on the body parts of the fourlegged animal.

Now we look at the general approaches to segmentation. Generally, there are two main categories of methods. The first one, bottom-up segmentation, groups pixels together because they are locally coherent (based a defined coherency metric). The segments acquired from this approach usually show a lower level of abstraction and are commonly just coherent image regions.

Top-down segmentation is the opposite end of the spectrum. It groups pixels together based on their semantics. The resulting segments could actually show a diverse appearance and may even be in conflict with what bottom-up segmentation would yield (e.g., the guy in the picture who is wearing a green-ish shirt might be segmented together with the background using a bottom-up segmentation while in top-down, it would be expected to have a separate segment).

Why would we want to segment an image at all? There are several reasons, but here are a few: Segments are usually building blocks of many vision systems. We usually prefer to apply an operation on a segment rather than a bunch of independent pixels. It s also beneficial in terms of complexity. If one didn t have segments, the desired operation would have to be applied on pixels or some (sliding) windows, that are blind wrt the image content. Segments are essentially pixels that are similar to each other, so without losing virtually anything, the complexity would be reduced by the order of number of pixels included in the segment. Also, segments can extract the precise boundary of an object, so they provide more information compared to bounding boxes or similar coarse crops.

Let s have a brief discussion on how our brain performs visual grouping.

Gestaltism is a theory in psychology. Gestalt is a German word meaning whole or form. Gestaltism advocates that the brain is holistic and performs grouping to extract a whole out of pieces.

Gestalt includes a number of properties that may result in grouping some pieces together. The examples shown here (each row is one example) demonstrate cases that are identical except in one property. That property would lead to grouping some of these pieces together. For instance, the second row would have 3 groups, since there can be seen 3 groups each with 2 pieces which are in proximity of one another. As another example, the 5th case shows 6 pieces that are identical except for the direction of their motion. Therefore, each motion direction would form one group since they share their fate. The other two main properties shown here are similarity (3rd and 4th examples) and falling in a common region (the last two examples).

Our brain can also perform grouping based on occlusion. For instance, here we clearly see two things: one (gray) is occluding something else (black), though none of them look like any particular object we have seen before.

When we remove the occluder, the black object looks quite meaningless and would be less likely that our brain would group its pieces together. Therefore, the occluder had a key role in leading our brain to think there is one black object in the back.

The completion can happen by invisible objects too. For instance, all these cases look like a feasible object, while they are formed of disjoint pieces. What glues them together is basically an object that we don t even see, yet our brain performs the grouping.

Grouping can also happen based on assigning some pieces to the figure and the rest to the ground. Which piece goes to the figure and which ones to the ground can happen based on low-level bottom-up cues (just local coherency) or high level recognition (e.g. seeing a semantic object).

Now let s look at this case of emergence. What do you see here? A tree? A road intersection? A dog? Nothing?

You may have seen different things, but what matters is that we all saw something. Most of the people see a dog (shown on the right), while if we looked at the pieces forming the dog locally, they re actually quite meaningless. What this is showing is that our brain essentially aggregated meaningless pieces and perceived a bigger meaningful whole in them, according to the things it often sees.

Now, let s see how we can segment an image algorithmically. How would you do it? Take a minute and think Using clustering? Using edges/boundaries in the image? Using other cues/modalities like depth?...

Clustering is likely to be the first idea. As a very basic and fast overview, what clustering does is: given a set of data point and a feature associated with each, it groups the data points in a way that each one belongs to one cluster/group. In the example shown here, the feature is 2 dimensional and there are 4 clusters among the data points.

We can cast the problem of segmentation as clustering, since they essentially do something very similar: if each data point is one pixel, clustering the data points would give us coherent pixel groups.

You must be familiar with k-means clustering, a common method for clustering, from previous courses. Here we will discuss mean-shift. The main properties of k-means is that: it requires specifying the number of clusters and makes the assumption that the clusters have a certain shape (spherical). The primary characteristics of mean-shift is that it doesn t require to specify the number of clusters and makes no assumption about cluster shapes.

Now let s see how mean-shift works.

Mean-shift is a generic clustering technique. It can be easily used for image segmentation. Here you can see a few examples.

The key idea of mean-shift is that it looks for modes of density in a given distribution. Let s see that using an example.

Here we see a set of data points (let s say pixels) in a 2 dimensional space. There can be seen an obvious dense center in this distribution.

Mean-shift first starts at an arbitrary data point and looks at its neighborhood.

Then it finds the center of mass (mean) of the data points in this neighborhood.

Then the center is moved from the previous arbitrary data point to the center of mass.

This process continues till it converges.

The equation on the right shows how the center of mass (m) is found around a given data point (x). N(x) is the neighborhood around x, and K(.) is a weighting function or kernel that adjusts the contribution of each datapoint according to its distance from x. This equation is basically just a weighted mean. x m(x) says that at each iteration, x is updated to m(x) which is the new center of mass.

This procedure is again shown in these animated slides. We start from one point, find the center of mass, and move to it. 32

Again, we find the center of mass, and move to it. 34

And again. 36

And again...now we ve converged and found the mode of density. 38

This process will be initiated at a large number of arbitrary locations (or could be initiated at each data point). All windows that end up converging to the same peak will be merged. 39

Think of starting this mean-shift search process at each data point. Many of them obviously end up over the same peak. Those points are all on the same Attraction Basin. Each mode has it s own attraction basin, and all data points on one attraction basin form one cluster.

Here you can see the input data points to mean-shift (left), the found clusters (on the right), and the attraction basin and trajectories leading to them on the 3D plot (bottom) where the z dimension shows the density. Observe how attraction basins formed clusters that are consistent with the density of the data points.

Now, if we want to do image segmentation using mean-shift, we should follow very similar steps. First we define some features (color, texture, etc), extract the feature for each pixel, and follow the described mean-shift process to find the segments (i.e., clusters).

Here are some results. It works a fairly well bottom-up segmentation.

Some more results.

The advantages of mean-shift are: not requiring the number of clusters a priori, having few parameters to set, being robust to outliers, and not being limited to spherical clusters. The main disadvantages are: requiring kernel parameters and window size (though this is natural since mean-shift doesn t require inputting the number of clustering, the user has to specify some criteria for grouping), being computationally expensive, and not scaling well when the input feature is high dimensional.

Now, let s look at another way of performing clustering: graph based methods.

We represent the image as a graph and cut the graph into subgraphs in a way that the subgraphs correspond to image segments. The input graph should capture the similarity among pixels, so by cutting the low-similarity edges and preserving the high-similarity ones, proper segments can be found.

In the input graph, each node represents one pixel, and there is an edge between every pair of pixels (though sometimes we only connect pixels that are spatially nearby -- mostly to reduce complexity). The weight of each edge captures how similar the nodes/pixels are. We should mention that in the next few slides, we will discuss binary graph partitioning (i.e., the graph is cut into two disjoint pieces). Therefore, it yields two segments. There are ways for going from a binary partitioning/segmentation to multi-segment segmentation. The simplest one is just iteratively applying the same binary segmentation method on the segments found by the previous iteration. You ll try this in the homework.

There are different ways the similarity metric used in the edge weights can be defined. Here are some common ones: Spatial Distance: the closer the pixels, the lesser the distance between them. Intensity: the more similar the density, the lesser the distance. Color: the more similar the color, the lesser the distance. The total distance could be a (often linear) mixture of these components.

Now we cut this graph to find the sub-graphs. Each sub-graph represents one segment. Cutting means breaking the graph into disjoint components; cutting happens by removing edges. Obviously, we need to remove the edges with low-affinity and preserve the ones with high-affinity to get proper segments.

Graph-cut is a technique for finding the cut given an input graph.

We need to assign a cost to a feasible cut in order to find the best cut. The cost of a cut is defined as the summation of the weights of the edges the cut will remove. Based on this definition, what s the cost of the blue cut?.... It s 2+3=5.

As mentioned before, the problem of segmenting the image into two segments is now equivalent to finding the optimal cut on the input graph. There are good methods for solving this efficiently. However, this has a fatal bias: the cut with the lowest cost usually just cuts out one or few isolated nodes. That s because those nodes are so isolated and the partitions they would form are small. Therefore, the cut associated with them would have a small cost, while the desired cut to properly segment an image may require cutting out a significant part of the image.

Normalized Cut was introduced in 2000 to solve this issue. The key idea is that the cost of the cut should be normalized based on the component size. That way, the few or isolated nodes won t be cheap to cut anymore, since after normalization with their small size, their relative cost increases. The idea is implemented easily by updating the cost of a feasible cut to this equation. A and B are the two sub-graphs the cut would yield. Cut(A,B) is the traditional cost of cut defined before (i.e., the sum of the weights of the edges being removed). V is the set of all nodes in the graph. assoc(x,a) is the sum of the weights of all edges in the graph that touch the sub-graph A. Therefore, if the sub-graph the cut produces is small, its cost will be proportionally higher.

This would be a low cost (2) unnormalized cut. But now after normalization, the cost would increase since there is only one node in the resulted sub-graph.

On the other hand, this cut (which is actually the desired cut) after normalization will be low-cost due to the subgraph sizes. This cut wouldn t be prefered over the undesirable cut in the previous slide by the unnormalized cost (cost 2 vs 3).

Now, let s see how we can implement solving normalized cut. G(V,E,W) defines a graph where V are the nodes, E are the edges, and W are the edge weights. Each pixel is one node in V. x is a vector that represents a feasible cut. Each element in x represents one node in V; if the node is included in the cut, the corresponding element would be +1, and -1 otherwise. D is a diagonal matrix. D_i_i (the elements on the main diagonal) represents the i_th node and its value is equal to the summation of the weights of all edges connected to the i_th node. Now, k is the sum of weights of all the edges in A over the total weight of the graph. b is therefore the proportion of the weights of A versus the weights of B (remember that A and B are disjoint and their union is the entire graph, i.e. V). y is the reformulation of x based on the proportion of weights in the subgraph the cut defines. The elements included in A will be +1 while the rest will be -b. The two equations in the bottom left are direct facts that can be derived from the definitions explained above.

Now it can be shown that finding the optimal cut (i.e. x) is equal to solving the optimization problem shown in the top right. The full derivations can be found in the normalized cut paper (referenced in the homework). To make sure the x satisfying this optimization equation is a feasible cut, the assumptions shown below it (y_td1-0, etc) should be satisfied. Therefore, the optimization should be solved subject to these assumptions. The paper also shows that this optimization problem can be solved through an eigen system. The eigenvector associated with the second smaller (absolute) eigenvalue is an approximate solution for x. However, solving the eigen system may yield an infeasible cut (i.e., may not satisfy the conditions necessarily). In those cases, a subsequent quantization is usually employed to snap the found solution to the closest feasible cut.

We can intuitively interpret the idea of graph cuts (or generally segmentation) by looking at it as a dynamical system. Let s say we have a large number of particles connected to each other by springs. The springs have different elasticities and form local regions. If we shake this system, the particles that are connected with strong springs will vibrate together while the ones with looser springs will have a more dissimilar vibration pattern. The vibration modes we would observe in this system would be equal to the segments, if each particle was a pixel and the spring strength was proportional to the similarity between pixels.

Normalized cut is a generic and flexible framework that can be adopted for many problems, including segmentation, with decent results. However, it requires high amount of storage and computation. It also has a bias towards partitioning the graph into near-equal-sized partition (due to size normalization).

Let s briefly look at another way of performing segmentation based on graphs.

This time, let s assume we want to segment an image into its foreground and background (2 clusters) and we can utilize a quick help from the user. The user can draw a few lines/markers on the image specifying a few regions which belong to the foreground or background (white and red strokes, respectively). How could we solve this problem now? How to incorporate the user input and encourage the segmentation results to be consistent with it and yield the foreground and background?

We can solve this by defining a labeling problem (L) in which we label each pixel with either 0 or 1, representing background and foreground, respectively. We then define an energy function that assigns an energy to each feasible labeling instance (L). The labeling instance L with the lowest energy corresponds to the desired segmentation. In high-level, there are two terms in this energy function, and they re mixed using a linear mixture. Lambda is the linear mixture constant specifying the contribution of each term in the final energy. The first terms is the match cost which specifies how likely it is for one pixel to be labeled as either foreground or background. This is usually acquired through matching the similarity of the pixel to the user specified regions. The second term is the smoothness cost. This is similar to the coherency cues we discussed in the previous segmentation methods. This term encourages the segments to be smooth and coherent.

For a given image with sample user inputs, here we show the corresponding energies for each pixel to belong to either foreground or background (i.e., the match cost in the energy function). The local inconsistencies and spurious outliers are expected to be recovered using the smoothness cost.

One popular example of this class of segmentation methods is GrabCut, introduced in 2004 by Rother et al.. It receives a boundary from the user for the foreground and segments it out. It works quite well and is implemented in Microsoft PowerPoint! The Remove Background feature in PowerPoint is in fact GrabCut.

What we discussed so far was almost entirely bottom up segmentation. As you saw, the resulting segments might not have any high level and semantic meaning. Let s look at a couple of semantic segmentation methods briefly. You ll see more about this topic and scene understanding in the next lectures.

Semantic segmentation seeks the segments in the image that have a semantic meaning, e.g., all pixels on one object, while the appearance of those pixels may not be that coherent. In order to solve this problem, we need to define semantics and develop an algorithm that has a notion of it. This is usually where machine learning comes into the picture. One recent and popular way of solving this problem is using fully convolutional neural networks. The neural network receives an image as its input and returns a perpixel mask in the output, where each pixel is labeled with its semantic class. This is solved through a series of consecutive convolution operations. The parameters of the convolutions are entirely learned using a fully supervised dataset; that s why they re expected to have a notion of semantics, objects, and whatever high-level label they ve seen in their training data.

Duygulu et al. also proposed a nominal method in this context back in 2002. The overall goal is similar to what we discussed so far in topdown segmentation. Their method is centered around finding a mapping between a set of regions in the image and keywords supplied with the image using EM. The result is the regions in the image along with their corresponding semantic keyword.

Another popular method for semantic segmentation belongs to Ladicky et al. While their overall goal is the same as previously overviewed methods, their main distinction is using a hierarchical random field model that enables integration of features from various levels of quantization from the image. There are a large number of other well-known semantic segmentation methods out there that we don t have time to overview in this lecture, but the few that we showed should establish what they seek and what their conceptual requirements are.