An Introduction to PDF Estimation and Clustering

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 1 An Introduction to PDF Estimation and Clustering David Corrigan corrigad@tcd.ie Electrical and Electronic Engineering Dept., University of Dublin, Trinity College. See www.sigmedia.tv for more information.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 2 PDF Estimation Quantify the characteristics of a signal, x[n], my measuring its PDF, p(x n = x) Ubiquitous in Signal Processing applications - image segmentation, restoration, texture synthesis. 0.015 Probability 0.01 0.005 0 0 50 100 150 200 250 300 Intensity

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 3 PDF Estimation Estimators fall into two categories 1. Parametric Estimation A known model for the PDF is fitted to the data (e.g. A gaussian distribution for a noise signal) The PDF is then represented by the parameters (mean, variance etc.) 2. Non-Parametric Estimation No assumed model for the PDF The PDF is estimated by measuring the signal A correct parametric model gives a better model from less data than non-parametric techniques.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 4 Non-Parametric Estimators Best known estimator is the histogram. Finds the frequency (and hence probability) of a signal value lying in a range. 0.02 0.1 0.015 0.075 Probability 0.01 Probability 0.05 0.005 0.025 0 0 50 100 150 200 250 Intensity 0 0 50 100 150 200 250 Intensity Histogram with bin width of 1 Histogram with bin width of 5 Histograms are poor if they are not adequately populated. Can increases the bin width or smoothen the histogram.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 5 Kernel Density Estimation Another non-parametric Estimator Given a signal x[n] the PDF is: p(x = x) = 1 Nh N ( ) x x[i] K h i=1 (1) K(x) is the Kernel and h is the bandwidth. Common Kernels include Guassian kernels and the Epanechnikov kernel K(x) = k(1 x 2 ) x 2 < 1 0 otherwise (2) K(x) k 0 The Epanechnikov Kernel 2 1 0 1 2 x

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 6 Kernel Density Estimation A Kernel Density Estimate is visually comparable to a smoothened histogram (but quite different in concept). The bandwidth controls the smoothness of the KDE. The PDF can be estimated at any signal value (real or complex). We dont need to worry about quantising the signal or choosing bin widths. It is slow. O(n) to estimate PDF at a signal value.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 7 Gaussian Mixture Models (GMM s) The PDF is weighted a weighted sum of Gaussian Distributions (can be multivariate for vector valued signals) p(x = x) = The model has K components K π(k)n(x; µ(k), R(k)) (3) k=1 π(k) is the weight of each gaussian such that k π(k) = 1. µ(k) and R(k) are the mean and variance (or co-variance) of the k th component. To create the GMM certain questions need to be answered - 1. How many clusters do we choose? 2. What are the initial estimates for the weights, means and variances? 3. How do we make sure that we have the optimum model for our data? To answer these questions we need to talk a bit about clustering.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 8 Clustering Clustering involves partitioning the set of signal values into subsets of similar values. Used in signal modelling, segmentation, vector quantisation, image compression,... Consider the following 2D vector-valued signal 50 10 100 150 200 250 5 300 dy 350 400 0 450 500 550 100 200 300 400 500 600 700 Vector-valued signal d(x) = (dx(x), dy(x)) 5 10 5 0 5 dx Scatter Plot of d(x).

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 9 k-means Clustering Algorithm that divides data into an arbitrary number of clusters, K. The algorithm attempts to minimize the distance, V, between each data value,x, and the cluster centroids. V = K k=1 j C k (x j c k ) where C k is the k th cluster (4) K-means operates as follows - 1. The user selects the number of clusters and assigns a value to each cluster centroid. 2. Every data point is assigned to the cluster of the nearest centroid. This partitions the data into clusters. 3. The centroids of the new clusters are estimated (by estimating the mean data value of each cluster). 4. Steps 2 and 3 are iterated until centroid values are suitably converged.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 10 k-means Clustering Nice demo - http://home.dei.polimi.it/matteucc/clustering/tutorial_html/appletkm.html Matlab has its own kmeans function. 10 10 5 5 dy dy 0 0 5 10 5 0 5 dx Scatter Plot of d(x). 5 10 5 0 5 dx Partitioned Scatter Plot. Related Fuzzy C-Means algorithm allows data points to belong to one or more clusters.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 11 Back to GMM s How many components to pick? Usually an arbitrary number but could use mean shift or watershed algorithms (later). How to estimate the initial GMM parameter values? Using a clustering algorithm like k-means to get rough clustering of the data set. The mean and covariance of each component is estimated on the corresponding cluster. The component weight is the fraction of the overall number of data points in the cluster. How to get the model parameters to best fit the data? By using the Expectation Maximisation (EM) algorithm.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 12 Expectation Maximisation EM finds the maximum likelihood estimates for the model parameters. The algorithm has two steps. 1. The E-Step: The current model parameters are used to cluster the data. A maximum likelihood solution ˆk(x) = arg max π(k)n(x; µ(k), R(k)) (5) 2. The M-Step: From the data set and the clustered data, the new parameter values are estimated. Given the data, find the model parameters that best fit the clustering obtained from the E-Step. For a GMM this is simplifies to estimating the mean and covariance of the data points in each cluster. Again the weights are the fraction of points in the cluster. The parameters are optimised by alternating between the two steps until the parameters converge.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 13 Expectation Maximisation The algorithm is broadly similar to k-means clustering. Both algorithms have a clustering stage followed by a parameter estimation stage. In k-means the euclidean distance from the centroid is used. In EM, the euclidean distance from the centroid (i.e. mean) is normalised by the component covariance and weight. Nice demo applet here http://lcn.epfl.ch/tutorial/english/gaussian/html/.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 14 Mean Shift Mean Shift clusters the data by finding the peaks of the Kernel Density Estimate. The number of clusters is automatically determined. Remember the equation for the KDE f(x) = 1 Nh N ( ) x x[i] K h i=1 (6) At a peak the gradient of the data is 0. The gradient is then f(x) = 1 Nh N ( ) x x[i] K h i=1 (7)

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 15 Mean Shift If we use the Epanechnikov kernel for K(x), its gradient is linear inside the bandwidth. Therefore f(x) 1 N x[i] S h (x) (x[i] x) (8) S h (x) is an N-Dimensional sphere of radius h. The RHS of the equation 8 is the difference between the current point x and the mean of all the data points in the sphere centred on x. It is known as the mean shift vector.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 16 Mean Shift There are some important things to consider The direction of the mean shift vector is the direction of the gradient. The direction gradient vector points in the direction of maximum change. The gradient vector at a peak is 0. Therefore the mean shift vector is also 0. The peak can be found by following the mean shift vector to regions of higher density until the mean shift vector is 0.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 17 Mean Shift This can be implemented as follows 1. Pick a data point at random. 2. Find the mean of all points in the sphere centred on the data point. 3. Repeat by searching the sphere centred on the mean from step 2. 4. Stop when successive means are the same. The mean is the value of the peak. Clustering using mean shift. The chosen bandwidth is 2.5.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 18 Mean Shift To cluster the data this procedure is applied to every point in the data. Every data point will have a characteristic peak. All data points with the same peak are assigned to a cluster. 10 5 dy 0 5 10 5 0 5 dx Clustering using mean shift. The chosen bandwidth is 2.5.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 19 Mean Shift It s good: It tells us something about the complexity of the signal. No need to guess the number of clusters. Bandwidth parameter gives some degree of control. It s bad: The algorithm is very slow (O(n 2 )). The distance between every pair of points in the data set must be known. Several publications of attempted to address this including Akash. Tendency to get small clusters in regions of low density. Post processing maybe necessary.

Sigmedia, Electronic Engineering Dept., Trinity College, Dublin. 20 Reading For more information on KDE s and mean-shift Comaniciu and Meer. Mean Shift Analysis and Applications, http://www.caip.rutgers.edu/~comanici/papers/msanalysis.pdf. Akash s paper from ICIP 2008 on the Path-Assigned Mean-Shift algorithm. For k-means clustering Some lecture notes http://home.dei.polimi.it/matteucc/clustering/tutorial_html/kmeans.html For training GMM s using EM The wikipedia entry for Expectation Maximisation gives more detail on EM shows how it applies to GMM s.