Unsupervised Learning

Similar documents
Dimension Reduction CS534

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Network Traffic Measurements and Analysis

Unsupervised Learning

Content-based image and video analysis. Machine learning

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Dimension reduction : PCA and Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Fall 09, Homework 5

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

The exam is closed book, closed notes except your one-page cheat sheet.

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CS 229 Midterm Review

Unsupervised Learning: Clustering

ECE 5424: Introduction to Machine Learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Unsupervised learning in Vision

Clustering Lecture 5: Mixture Model

A Course in Machine Learning

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

SGN (4 cr) Chapter 11

Grundlagen der Künstlichen Intelligenz

Non-linear dimension reduction

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Image Analysis, Classification and Change Detection in Remote Sensing

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Expectation Maximization (EM) and Gaussian Mixture Models

Using Machine Learning to Optimize Storage Systems

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

CLASSIFICATION AND CHANGE DETECTION

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Unsupervised Learning

Motion Interpretation and Synthesis by ICA

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CIE L*a*b* color model

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Divide and Conquer Kernel Ridge Regression

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Unsupervised: no target value to predict

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Generative and discriminative classification techniques

MSA220 - Statistical Learning for Big Data

Chapter 7: Competitive learning, clustering, and self-organizing maps

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Methods for Intelligent Systems

Latent Variable Models and Expectation Maximization

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

K-means and Hierarchical Clustering

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

MULTICORE LEARNING ALGORITHM

Network Traffic Measurements and Analysis

Machine Learning for OR & FE

Mixture Models and the EM Algorithm

10-701/15-781, Fall 2006, Final

CS 195-5: Machine Learning Problem Set 5

Unsupervised Learning

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Applications Video Surveillance (On-line or off-line)

Introduction to Machine Learning. Xiaojin Zhu

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Introduction to Machine Learning CMU-10701

Recognition, SVD, and PCA

Deep Learning for Computer Vision

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Lecture Topic Projects

Supervised vs. Unsupervised Learning

Clustering: Classic Methods and Modern Views

Supervised vs unsupervised clustering

Large-Scale Face Manifold Learning

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Intro to Artificial Intelligence

Machine Learning: Think Big and Parallel

Lecture 25: Review I

Clustering web search results

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Face detection and recognition. Detection Recognition Sally

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

K-Means Clustering 3/3/17

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

10701 Machine Learning. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Face Modeling by Information Maximization 1

Cluster Analysis: Agglomerate Hierarchical Clustering

Transcription:

Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover low-dimensional representation of data Blind Source Separation Unmixing multiple signals

Density Estimation Given: S = {x{ 1, x 2,, x N } Find: P(x) Search problem: argmax h P(S h) = argmax h i log P(x i h)

Unsupervised Fitting of the Naïve Bayes Model y x 1 x 2 x 3 x n y is discrete with K values P(x) ) = k P(y=k) j P(x j y=k) finite mixture model we can think of each y=k as a separate cluster of data points

The Expectation-Maximization Algorithm (1): Hard EM Learning would be easy if we knew y i for each x i Idea: guess them and then iteratively revise our guesses to maximize P(S h) x i x 1 x 2 x N y i y 1 y 2 y N

Hard EM (2) 1. Guess initial y values to get complete data 2. M Step: Compute probabilities for hypotheses (model) from complete data [Maximum likelihood estimate of the model parameters] 3. E Step: Classify each example using the current model to get a new y value [Most likely class ŷ of each example] 4. Repeat steps 2-32 3 until convergence x i x 1 x 2 x N y i y 1 y 2 y N

Special Case: k-means k Clustering 1. Assign an initial y i to each data point x i at random 2. M Step. For each class k = 1,,, K compute the mean: µ k = 1/N k i x i I[y i = k] 3. E Step. For each example x i, assign it to the class k with the nearest mean: y i = argmin k x i - µ k 4. Repeat steps 2 and 3 to convergence

Gaussian Interpretation of K-meansK argmax y Each feature x j in class k is gaussian distributed with mean µ kj and constant variance σ 2 P (x j y = k) = 1 " exp 1 (kx j µ kj k) 2 2πσ 2 σ 2 logp (x j y = k) = 1 kx j µ kj k 2 2 σ 2 + C P (x y = k) = argmax logp (x y) =argmin y y # kx µ kj k 2 =argminkx µ y kj k This could easily be extended to have general covariance matrix Σ or class- specific Σ k

The EM algorithm The true EM algorithm augments the incomplete data with a probability distribution over the possible y values 1. Start with initial naive Bayes hypothesis 2. E step: For each example, compute P(y i ) and add it to the table 3. M step: Compute updated estimates of the parameters 4. Repeat steps 2-32 3 to convergence. x i x 1 x 2 x N y i P(y 1 ) P(y 2 ) P(y N )

Details of the M step Each example x i is treated as if y i =k with probability P(y i =k x i ) P (y = k) := 1 N P (x j = v y = k) := P NX i=1 P (y i = k x i ) i P (y i = k x i ) I(x ij = v) P Ni=1 P (y i = k x i )

Example: Mixture of 2 Gaussians Initial distributions means at -0.5, +0.5

Example: Mixture of 2 Gaussians Iteration 1

Example: Mixture of 2 Gaussians Iteration 2

Example: Mixture of 2 Gaussians Iteration 3

Example: Mixture of 2 Gaussians Iteration 10

Example: Mixture of 2 Gaussians Iteration 20

Evaluation: Test set likelihood Overfitting is also a problem in unsupervised learning

Potential Problems If σ k is allowed to vary, it may go to zero, which leads to infinite likelihood Fix by placing an overfitting penalty on 1/σ

Choosing K Internal holdout likelihood

Unsupervised Learning for Sequences Suppose each training example X i is a sequence of objects: X i = (x( i1, x i2,, x i,t i,ti ) Fit HMM by unsupervised learning 1. Initialize model parameters 2. E step: apply forward-backward algorithm to estimate P(y it X i ) at each point t 3. M step: estimate model parameters 4. Repeat steps 2-32 3 to convergence

Agglomerative Clustering Initialize each data point to be its own cluster Repeat: Merge the two clusters that are most similar Build dendrogram with height = distance between the most similar clusters Apply various intuitive methods to choose number of clusters Equivalent to choosing where to slice the dendrogram Source: Charity Morgan http://www.people.fas.harvard.edu/~rizem/teach/stat325/charitycluster.ppt

Agglomerative Clustering Each cluster is defined only by the points it contains (not by a parameterized model) Very fast (using priority queue) No objective measure of correctness Distance measures distance between nearest pair of points distance between cluster centers

Probabilistic Agglomerative Clustering = Bottom-up Model Merging Each data point is an initial cluster but with penalized σ k Repeat: Merge the two clusters that would most increase the penalized log likelihood Until no merger would further improve likelihood Note that without the penalty on σ k, the algorithm would never merge anything

Dimensionality Reduction Often, raw data have very high dimension Example: images of human faces Dimensionality Reduction: Construct a lower-dimensional space that preserves information important for the task Examples: preserve distances preserve separation between classes etc.

Principal Component Analysis Given: Data: n-dimensional n vectors {x{ 1, x 2,, x N } Desired dimensionality m Find an m x n orthogonal matrix A to minimize i A -1 Ax i x i 2 Explanation: Ax i maps x i into an m-dimensional m matrix x i A -1 Ax i maps x i back to n-dimensional n space We minimize the squared reconstruction error between the reconstructed vectors and the original vectors

Conceptual Algorithm Find a line such that when the data is projected onto that line, it has the maximum variance:

Conceptual Algorithm Find a new line, orthogonal to the first, that has maximum projected variance:

Repeat Until m Lines Have Been Found The projected position of a point on these lines gives the coordinates in the m-m dimensional reduced space

A Better Method Numerical Method Compute the co-variance matrix Σ = Σ i (x i µ) (x i µ) T Compute the singular value decomposition Σ = U D V T where the columns of U are the eigenvectors of Σ D is a diagonal matrix whose elements are the square roots of the eigenvalues of Σ in descending order V T are the projected data points Replace all but the m largest elements of D by zeros

Example: Eigenfaces Database of 128 carefully-aligned aligned faces Here are the first 15 eigenvectors:

Face Classification in Eigenspace is Easier Nearest Mean classifier ŷ = argmin k Ax A Aµ k Accuracy variation in lighting: 96% variation in orientation: 85% variation in size: 64%

PCA is a useful preprocessing step Helps all LTU algorithms by making the features more independent Helps decision tree algorithms Helps nearest neighbor algorithms by discovering the distance metric Fails when data consists of multiple separate clusters mixtures of PCAs can be learned too

Non-Linear Dimensionality Reduction: ISOMAP Replace Euclidean distance by geodesic distance Construct a graph where each point is connected to its k nearest neighbors by an edge AND any pair of points are connected if they are less than ε apart Construct an N x N matrix D in which D[i,j] is the shortest path in the graph connecting x i to x j Apply SVD to D and keep the m most important dimensions

Two more ISOMAP examples

Linear Interpolation Between Points Algorithm generates new poses in ISOMAP Space and new 2 s2

Blind Source Separation Suppose we have two sound sources that have been linearly mixed and recorded by two microphones. Given the two microphone signals, we want to recover the two sound sources α x 1 y 1 1 α ŷ 1 β Magic Box y 2 1 β x 2 ŷ 2

Minimizing Mutual Information If the input sources are independent, then they should have zero mutual information. Idea: Minimize the mutual information between the outputs while maximizing the information (entropy) of each output separately: max W H(ŷ 1 ) + H(ŷ 2 ) I(ŷ 1 ; ŷ 2 ) where [ŷ[ 1, ŷ] ] = F W (x 1, x 2 ) and F W is a sigmoid neural network

Independent Component Analysis Microphone 1 Microphone 2 Reconstructed source 1 Reconstructed source 2 (ICA) source: http://www.cnl.salk.edu/~tewon/blind/blind_audio.html

Unsupervised Learning Summary Density Estimation: Learn P(X) given training data for X Mixture models and EM Clustering: Partition data into clusters Bottom up aggomerative clustering Dimensionality Reduction: Discover low- dimensional representation of data Principal Component Analysis ISOMAP Blind Source Separation: Unmixing multiple signals Many algorithms

Objective Functions Density Estimation: Log likelihood on training data Clustering:???? Dimensionality Reduction Minimum reconstruction error Maximum likelihood (gaussian interpretation of PCA) Blind Source Separation Information Maximization Maximum Likelihood (assuming models of the sources)