Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover low-dimensional representation of data Blind Source Separation Unmixing multiple signals
Density Estimation Given: S = {x{ 1, x 2,, x N } Find: P(x) Search problem: argmax h P(S h) = argmax h i log P(x i h)
Unsupervised Fitting of the Naïve Bayes Model y x 1 x 2 x 3 x n y is discrete with K values P(x) ) = k P(y=k) j P(x j y=k) finite mixture model we can think of each y=k as a separate cluster of data points
The Expectation-Maximization Algorithm (1): Hard EM Learning would be easy if we knew y i for each x i Idea: guess them and then iteratively revise our guesses to maximize P(S h) x i x 1 x 2 x N y i y 1 y 2 y N
Hard EM (2) 1. Guess initial y values to get complete data 2. M Step: Compute probabilities for hypotheses (model) from complete data [Maximum likelihood estimate of the model parameters] 3. E Step: Classify each example using the current model to get a new y value [Most likely class ŷ of each example] 4. Repeat steps 2-32 3 until convergence x i x 1 x 2 x N y i y 1 y 2 y N
Special Case: k-means k Clustering 1. Assign an initial y i to each data point x i at random 2. M Step. For each class k = 1,,, K compute the mean: µ k = 1/N k i x i I[y i = k] 3. E Step. For each example x i, assign it to the class k with the nearest mean: y i = argmin k x i - µ k 4. Repeat steps 2 and 3 to convergence
Gaussian Interpretation of K-meansK argmax y Each feature x j in class k is gaussian distributed with mean µ kj and constant variance σ 2 P (x j y = k) = 1 " exp 1 (kx j µ kj k) 2 2πσ 2 σ 2 logp (x j y = k) = 1 kx j µ kj k 2 2 σ 2 + C P (x y = k) = argmax logp (x y) =argmin y y # kx µ kj k 2 =argminkx µ y kj k This could easily be extended to have general covariance matrix Σ or class- specific Σ k
The EM algorithm The true EM algorithm augments the incomplete data with a probability distribution over the possible y values 1. Start with initial naive Bayes hypothesis 2. E step: For each example, compute P(y i ) and add it to the table 3. M step: Compute updated estimates of the parameters 4. Repeat steps 2-32 3 to convergence. x i x 1 x 2 x N y i P(y 1 ) P(y 2 ) P(y N )
Details of the M step Each example x i is treated as if y i =k with probability P(y i =k x i ) P (y = k) := 1 N P (x j = v y = k) := P NX i=1 P (y i = k x i ) i P (y i = k x i ) I(x ij = v) P Ni=1 P (y i = k x i )
Example: Mixture of 2 Gaussians Initial distributions means at -0.5, +0.5
Example: Mixture of 2 Gaussians Iteration 1
Example: Mixture of 2 Gaussians Iteration 2
Example: Mixture of 2 Gaussians Iteration 3
Example: Mixture of 2 Gaussians Iteration 10
Example: Mixture of 2 Gaussians Iteration 20
Evaluation: Test set likelihood Overfitting is also a problem in unsupervised learning
Potential Problems If σ k is allowed to vary, it may go to zero, which leads to infinite likelihood Fix by placing an overfitting penalty on 1/σ
Choosing K Internal holdout likelihood
Unsupervised Learning for Sequences Suppose each training example X i is a sequence of objects: X i = (x( i1, x i2,, x i,t i,ti ) Fit HMM by unsupervised learning 1. Initialize model parameters 2. E step: apply forward-backward algorithm to estimate P(y it X i ) at each point t 3. M step: estimate model parameters 4. Repeat steps 2-32 3 to convergence
Agglomerative Clustering Initialize each data point to be its own cluster Repeat: Merge the two clusters that are most similar Build dendrogram with height = distance between the most similar clusters Apply various intuitive methods to choose number of clusters Equivalent to choosing where to slice the dendrogram Source: Charity Morgan http://www.people.fas.harvard.edu/~rizem/teach/stat325/charitycluster.ppt
Agglomerative Clustering Each cluster is defined only by the points it contains (not by a parameterized model) Very fast (using priority queue) No objective measure of correctness Distance measures distance between nearest pair of points distance between cluster centers
Probabilistic Agglomerative Clustering = Bottom-up Model Merging Each data point is an initial cluster but with penalized σ k Repeat: Merge the two clusters that would most increase the penalized log likelihood Until no merger would further improve likelihood Note that without the penalty on σ k, the algorithm would never merge anything
Dimensionality Reduction Often, raw data have very high dimension Example: images of human faces Dimensionality Reduction: Construct a lower-dimensional space that preserves information important for the task Examples: preserve distances preserve separation between classes etc.
Principal Component Analysis Given: Data: n-dimensional n vectors {x{ 1, x 2,, x N } Desired dimensionality m Find an m x n orthogonal matrix A to minimize i A -1 Ax i x i 2 Explanation: Ax i maps x i into an m-dimensional m matrix x i A -1 Ax i maps x i back to n-dimensional n space We minimize the squared reconstruction error between the reconstructed vectors and the original vectors
Conceptual Algorithm Find a line such that when the data is projected onto that line, it has the maximum variance:
Conceptual Algorithm Find a new line, orthogonal to the first, that has maximum projected variance:
Repeat Until m Lines Have Been Found The projected position of a point on these lines gives the coordinates in the m-m dimensional reduced space
A Better Method Numerical Method Compute the co-variance matrix Σ = Σ i (x i µ) (x i µ) T Compute the singular value decomposition Σ = U D V T where the columns of U are the eigenvectors of Σ D is a diagonal matrix whose elements are the square roots of the eigenvalues of Σ in descending order V T are the projected data points Replace all but the m largest elements of D by zeros
Example: Eigenfaces Database of 128 carefully-aligned aligned faces Here are the first 15 eigenvectors:
Face Classification in Eigenspace is Easier Nearest Mean classifier ŷ = argmin k Ax A Aµ k Accuracy variation in lighting: 96% variation in orientation: 85% variation in size: 64%
PCA is a useful preprocessing step Helps all LTU algorithms by making the features more independent Helps decision tree algorithms Helps nearest neighbor algorithms by discovering the distance metric Fails when data consists of multiple separate clusters mixtures of PCAs can be learned too
Non-Linear Dimensionality Reduction: ISOMAP Replace Euclidean distance by geodesic distance Construct a graph where each point is connected to its k nearest neighbors by an edge AND any pair of points are connected if they are less than ε apart Construct an N x N matrix D in which D[i,j] is the shortest path in the graph connecting x i to x j Apply SVD to D and keep the m most important dimensions
Two more ISOMAP examples
Linear Interpolation Between Points Algorithm generates new poses in ISOMAP Space and new 2 s2
Blind Source Separation Suppose we have two sound sources that have been linearly mixed and recorded by two microphones. Given the two microphone signals, we want to recover the two sound sources α x 1 y 1 1 α ŷ 1 β Magic Box y 2 1 β x 2 ŷ 2
Minimizing Mutual Information If the input sources are independent, then they should have zero mutual information. Idea: Minimize the mutual information between the outputs while maximizing the information (entropy) of each output separately: max W H(ŷ 1 ) + H(ŷ 2 ) I(ŷ 1 ; ŷ 2 ) where [ŷ[ 1, ŷ] ] = F W (x 1, x 2 ) and F W is a sigmoid neural network
Independent Component Analysis Microphone 1 Microphone 2 Reconstructed source 1 Reconstructed source 2 (ICA) source: http://www.cnl.salk.edu/~tewon/blind/blind_audio.html
Unsupervised Learning Summary Density Estimation: Learn P(X) given training data for X Mixture models and EM Clustering: Partition data into clusters Bottom up aggomerative clustering Dimensionality Reduction: Discover low- dimensional representation of data Principal Component Analysis ISOMAP Blind Source Separation: Unmixing multiple signals Many algorithms
Objective Functions Density Estimation: Log likelihood on training data Clustering:???? Dimensionality Reduction Minimum reconstruction error Maximum likelihood (gaussian interpretation of PCA) Blind Source Separation Information Maximization Maximum Likelihood (assuming models of the sources)