( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Review Lecture 14

! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues 2. W = eigs,k Projection = Data Top Keigenvectors ( ) Reconstruction = Projection Transpose of top K eigenvectors X µ Independently discovered by Pearson in 1901 and Hotelling in 1933. 3. Y = W

When to use PCA Great when data is truly low dimensional (on a hyperplane (linear)) Or approximately low dimensional (almost lie on plane Eg. very flat ellipsoid) Eg. Dimensionality reduction for face images, for multiple biometric applications as preprocessing

When to use PCA Warning: Scale matters (if some feature in in centimeters and others is meters, PCA direction can change) Problem 1 of Homework 1

Problem 1 of Homework 1 Part 1: draw points on a 45 degree angle line Eg. all coordinates are identical and values drawn for uniform distribution Part 2: (Scale matters) Say first coordinate had a variance many magnitudes larger than other coordinates. The coordinates are all independently drawn

When to use PCA Warning: direction of importance need not always be the one with good enough spread Problem 2 of Homework 1

1. CCA ALGORITHM! X = n X 1 X 2 Write x t = x t x t the d + d dimensional concatenated vectors., Calculate covariance matrix d1 of the joint d2 data points = =cov 1,1 1,2 X 2. = 11 12 2,1 2,2 21 22! Calculate 1 1,1 1,2 1 2,2 2,1. The top K eigen vectors of this matrix give us projection matrix for view I. ( ) = eigs,k 11 12 22 1 21 3. W 1 1 Calculate 1 2,2 2,1 1 1,1 1,2. The top K eigen vectors of this matrix give us projection matrix for view II. X µ 4. Y = W 1 1 1 1

When to use CCA? CCA applies for problems where data can be split into 2 views X = [X1,X2] CCA picks directions of projection (in each view) where data is maximally correlated Maximizes correlation coefficient and not just covariance so is scale free

When to use CCA Scenario 1: You have two feature extraction techniques. One provides excellent features for dogs Vs cats and noise on other classes Other method provides excellent features for cars Vs bikes and noise for other classes What do we do? A. Use CCA to find one common representation B. Concatenate the two features extracted

When to use CCA Scenario 2: You have two cameras capturing images of the same objects from different angles. You have a feature extraction technique that provides feature vectors from each camera. You want to extract good features for recognizing the object from the two cameras What do we do? A. Use CCA to find one common representation B. Concatenate features provides excellent features for

PICK A RANDOM W 2 3 +1... 1 Y = X 6 4 1... +1 +1... 1 d 7 5 p K +1... 1 K

When to use RP? When data is huge and very large dimensional For PCA, CCA typically you think of K (no. of dimensions we reduce to) in double digits For RP think of K typically in 3-4 digit numbers RP guarantees preservation of inter-point distances. RP unlike PCA and CCA does not project using unit vectors. (What does this mean?) Problem 1 of Homework 2

How do we choose K? For PCA? For CCA? For Random Projection?

KERNEL PCA! K 1. n = 2. n " # A, compute k, X (,K) = eigs K = 3. P A p 1... 1 A K p K

KERNEL PCA K 4. Y = P

When to use Kernel PCA When data lies on some non-linear, low dimensional subspace Kernel function matters. (Eg. RBF kernel, only points close to a given point have non-negligible kernel evaluation)

CLUSTERING Clustering n X X d

K-means K-means algorithm: (wishful thinking, EM) Fix parameters (the k means) and compute new cluster assignments (or probabilities) for every point Fix cluster assignment for all data points and reevaluate parameters (the k-means)

Single-Link Clustering Start with all points being their own clusters Until we get K-clusters, merge the closest two clusters

When to Use Single Link When we have dense sampling of points within each cluster When not to use: when we might have outliers

When to use K-means When we have nice spherical round equal size clusters or cluster masses are far apart Handles outliers better

Homework 3

Spectral Clustering You want to cluster nodes of a graph into groups based on connectivity Unnormalized spectral clustering: divide into groups where as few edges between groups are cut

SPECTRAL CLUSTERING ALGORITHM (UNNORMALIZED) 1 Given matrix A calculate diagonal matrix D s.t. D i,i = n j=1 A i,j 2 Calculate the Laplacian matrix L = D A 3 Find eigen vectors v 1,...,v n of L (ascending order of eigenvalues) 4 Pick the K eigenvectors with smallest eigenvalues to get y 1,...,y n R K 5 Use K-means clustering algorithm on y 1,...,y n

Spectral Embedding n A n Y n K Use K-means on Y

Normalized Spectral Clustering Unnormalized spectral embedding encourages loner nodes to be pushed far away from rest This is indeed the minute solution to cut off loners Instead form clusters that minimize ratio of edges cut to number of edges each cluster has (busy groups tend to form clusters) Algorithm, replace Laplacian matrix by normalized one

SPECTRAL CLUSTERING ALGORITHM (NORMALIZED) 1 Given matrix A calculate diagonal matrix D s.t. D i,i = n j=1 A i,j 2 Calculate the normalized Laplacian matrix L = I D 12 AD 12 3 Find eigen vectors v 1,...,v n of L (ascending order of eigenvalues) 4 Pick the K eigenvectors with smallest eigenvalues to get y 1,...,y n R K 5 Use K-means clustering algorithm on y 1,...,y n

When to use Spectral Clustering First, even works with weighted graph, where weight of edge represents similarity When knowledge about how clusters should be formed is solely decided by similarity between points, there is no underlying prior knowledge

PROBABILISTIC MODELS More generally: consists of set of possible parameters We have a distribution P over the data induced by each Data is generated by one of the Learning: Estimate value or distribution for given data

Gaussian Mixture Models EXAMPLES Each 2 is a model. Gaussian Mixture Model Each consists of mixture distribution = ( 1,..., K ), means µ 1,...,µ K R d and covariance matrices 1,..., K For each t, independently: At time t we generate a new tree as follows: c t, x t N(µ ct, ct ) 1 =0.5 µ 3 3 1 µ 1 3 =0.25 µ 2 2 =0.25 2

GMM: MAXIMUM POWER LIKELIHOOD OF WISHFUL PRINCIPAL THINKING 1 Initialize model parameters (0), µ (0) 1,...,µ (0) K and (0) 1,..., (0) K 2 For i Pick = 1 until convergence that maximizes or bored probability of observation MLE = argmax log P (x 1,...,x n ) {z } Likelihood 1 Under current model parameters (i 1), compute probability Q (i) t (k) of each point x t belonging to cluster k 2 Given probabilities of each point belonging to the various clusters, compute optimal parameters (i) Likelihood 3 End For

MIXTURE MODELS is mixture distribution over the K-types 1,..., K are parameters for K distributions Generative process: Draw type c t Next given c t, draw x t Distribtuion( ct )

EM ALGORITHM FOR MIXTURE MODELS For i = 1 to convergence (E step) For every t, define distribution Q t over the latent variable c t as: Q (i) t (c t ) PDF(x t ; (i 1) c t ) (i 1) [c t ] (M step) For every k {1,...,K} (i) k = n t=1 Q (i) t [k], n (i) k = argmin n Q t [k] log(pdf(x t ; )) t=1 x t observation, c t latent variable.