( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Similar documents
Clustering Lecture 5: Mixture Model

Unsupervised Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Network Traffic Measurements and Analysis

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

MSA220 - Statistical Learning for Big Data

Grundlagen der Künstlichen Intelligenz

Clustering: Classic Methods and Modern Views

CS 229 Midterm Review

Dimension Reduction CS534

Behavioral Data Mining. Lecture 18 Clustering

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Unsupervised Learning

SGN (4 cr) Chapter 11

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Content-based image and video analysis. Machine learning

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

K-Means Clustering 3/3/17

CSE 481C Imitation Learning in Humanoid Robots Motion capture, inverse kinematics, and dimensionality reduction

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Inference and Representation

Machine Learning for Data Science (CS4786) Lecture 11

Clustering and Visualisation of Data

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Spectral Classification

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Machine Learning. Unsupervised Learning. Manfred Huber

Verification: is that a lamp? What do we mean by recognition? Recognition. Recognition

What do we mean by recognition?

Segmentation: Clustering, Graph Cut and EM

Large-Scale Face Manifold Learning

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

ECE 5424: Introduction to Machine Learning

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Clustering Lecture 3: Hierarchical Methods

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

CSC 411 Lecture 18: Matrix Factorizations

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Unsupervised: no target value to predict

Visual Representations for Machine Learning

Face Recognition using Laplacianfaces

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

TELCOM2125: Network Science and Analysis

Generalized Principal Component Analysis CVPR 2007

10701 Machine Learning. Clustering

Bioinformatics - Lecture 07

Non-linear dimension reduction

The Curse of Dimensionality

Fall 09, Homework 5

Modern Medical Image Analysis 8DC00 Exam

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Spectral Clustering and Community Detection in Labeled Graphs

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Lecture 8: The EM algorithm

Lecture 7: Segmentation. Thursday, Sept 20

Introduction to Machine Learning CMU-10701

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Methods for Intelligent Systems

10-701/15-781, Fall 2006, Final

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

CSE 158. Web Mining and Recommender Systems. Midterm recap

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

COMS 4771 Clustering. Nakul Verma

Based on Raymond J. Mooney s slides

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Unsupervised Learning

CS143 Introduction to Computer Vision Homework assignment 1.

Parallel Architecture & Programing Models for Face Recognition

Expectation Maximization (EM) and Gaussian Mixture Models

Discriminate Analysis

Machine learning - HT Clustering

CIE L*a*b* color model

Image Processing. Image Features

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Cluster Analysis (b) Lijun Zhang

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

The K-modes and Laplacian K-modes algorithms for clustering

Motivation. Technical Background

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

K-Means and Gaussian Mixture Models

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Principal Component Image Interpretation A Logical and Statistical Approach

Clustering and The Expectation-Maximization Algorithm

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Transcription:

Review Lecture 14

! PRINCIPAL COMPONENT ANALYSIS Eigenvectors of the covariance matrix are the principal components 1. =cov X Top K principal components are the eigenvectors with K largest eigenvalues 2. W = eigs,k Projection = Data Top Keigenvectors ( ) Reconstruction = Projection Transpose of top K eigenvectors X µ Independently discovered by Pearson in 1901 and Hotelling in 1933. 3. Y = W

When to use PCA Great when data is truly low dimensional (on a hyperplane (linear)) Or approximately low dimensional (almost lie on plane Eg. very flat ellipsoid) Eg. Dimensionality reduction for face images, for multiple biometric applications as preprocessing

When to use PCA Warning: Scale matters (if some feature in in centimeters and others is meters, PCA direction can change) Problem 1 of Homework 1

Problem 1 of Homework 1 Part 1: draw points on a 45 degree angle line Eg. all coordinates are identical and values drawn for uniform distribution Part 2: (Scale matters) Say first coordinate had a variance many magnitudes larger than other coordinates. The coordinates are all independently drawn

When to use PCA Warning: direction of importance need not always be the one with good enough spread Problem 2 of Homework 1

1. CCA ALGORITHM! X = n X 1 X 2 Write x t = x t x t the d + d dimensional concatenated vectors., Calculate covariance matrix d1 of the joint d2 data points = =cov 1,1 1,2 X 2. = 11 12 2,1 2,2 21 22! Calculate 1 1,1 1,2 1 2,2 2,1. The top K eigen vectors of this matrix give us projection matrix for view I. ( ) = eigs,k 11 12 22 1 21 3. W 1 1 Calculate 1 2,2 2,1 1 1,1 1,2. The top K eigen vectors of this matrix give us projection matrix for view II. X µ 4. Y = W 1 1 1 1

When to use CCA? CCA applies for problems where data can be split into 2 views X = [X1,X2] CCA picks directions of projection (in each view) where data is maximally correlated Maximizes correlation coefficient and not just covariance so is scale free

When to use CCA Scenario 1: You have two feature extraction techniques. One provides excellent features for dogs Vs cats and noise on other classes Other method provides excellent features for cars Vs bikes and noise for other classes What do we do? A. Use CCA to find one common representation B. Concatenate the two features extracted

When to use CCA Scenario 2: You have two cameras capturing images of the same objects from different angles. You have a feature extraction technique that provides feature vectors from each camera. You want to extract good features for recognizing the object from the two cameras What do we do? A. Use CCA to find one common representation B. Concatenate features provides excellent features for

PICK A RANDOM W 2 3 +1... 1 Y = X 6 4 1... +1 +1... 1 d 7 5 p K +1... 1 K

When to use RP? When data is huge and very large dimensional For PCA, CCA typically you think of K (no. of dimensions we reduce to) in double digits For RP think of K typically in 3-4 digit numbers RP guarantees preservation of inter-point distances. RP unlike PCA and CCA does not project using unit vectors. (What does this mean?) Problem 1 of Homework 2

How do we choose K? For PCA? For CCA? For Random Projection?

KERNEL PCA! K 1. n = 2. n " # A, compute k, X (,K) = eigs K = 3. P A p 1... 1 A K p K

KERNEL PCA K 4. Y = P

When to use Kernel PCA When data lies on some non-linear, low dimensional subspace Kernel function matters. (Eg. RBF kernel, only points close to a given point have non-negligible kernel evaluation)

CLUSTERING Clustering n X X d

K-means K-means algorithm: (wishful thinking, EM) Fix parameters (the k means) and compute new cluster assignments (or probabilities) for every point Fix cluster assignment for all data points and reevaluate parameters (the k-means)

Single-Link Clustering Start with all points being their own clusters Until we get K-clusters, merge the closest two clusters

When to Use Single Link When we have dense sampling of points within each cluster When not to use: when we might have outliers

When to use K-means When we have nice spherical round equal size clusters or cluster masses are far apart Handles outliers better

Homework 3

Spectral Clustering You want to cluster nodes of a graph into groups based on connectivity Unnormalized spectral clustering: divide into groups where as few edges between groups are cut

SPECTRAL CLUSTERING ALGORITHM (UNNORMALIZED) 1 Given matrix A calculate diagonal matrix D s.t. D i,i = n j=1 A i,j 2 Calculate the Laplacian matrix L = D A 3 Find eigen vectors v 1,...,v n of L (ascending order of eigenvalues) 4 Pick the K eigenvectors with smallest eigenvalues to get y 1,...,y n R K 5 Use K-means clustering algorithm on y 1,...,y n

Spectral Embedding n A n Y n K Use K-means on Y

Normalized Spectral Clustering Unnormalized spectral embedding encourages loner nodes to be pushed far away from rest This is indeed the minute solution to cut off loners Instead form clusters that minimize ratio of edges cut to number of edges each cluster has (busy groups tend to form clusters) Algorithm, replace Laplacian matrix by normalized one

SPECTRAL CLUSTERING ALGORITHM (NORMALIZED) 1 Given matrix A calculate diagonal matrix D s.t. D i,i = n j=1 A i,j 2 Calculate the normalized Laplacian matrix L = I D 12 AD 12 3 Find eigen vectors v 1,...,v n of L (ascending order of eigenvalues) 4 Pick the K eigenvectors with smallest eigenvalues to get y 1,...,y n R K 5 Use K-means clustering algorithm on y 1,...,y n

When to use Spectral Clustering First, even works with weighted graph, where weight of edge represents similarity When knowledge about how clusters should be formed is solely decided by similarity between points, there is no underlying prior knowledge

PROBABILISTIC MODELS More generally: consists of set of possible parameters We have a distribution P over the data induced by each Data is generated by one of the Learning: Estimate value or distribution for given data

Gaussian Mixture Models EXAMPLES Each 2 is a model. Gaussian Mixture Model Each consists of mixture distribution = ( 1,..., K ), means µ 1,...,µ K R d and covariance matrices 1,..., K For each t, independently: At time t we generate a new tree as follows: c t, x t N(µ ct, ct ) 1 =0.5 µ 3 3 1 µ 1 3 =0.25 µ 2 2 =0.25 2

GMM: MAXIMUM POWER LIKELIHOOD OF WISHFUL PRINCIPAL THINKING 1 Initialize model parameters (0), µ (0) 1,...,µ (0) K and (0) 1,..., (0) K 2 For i Pick = 1 until convergence that maximizes or bored probability of observation MLE = argmax log P (x 1,...,x n ) {z } Likelihood 1 Under current model parameters (i 1), compute probability Q (i) t (k) of each point x t belonging to cluster k 2 Given probabilities of each point belonging to the various clusters, compute optimal parameters (i) Likelihood 3 End For

MIXTURE MODELS is mixture distribution over the K-types 1,..., K are parameters for K distributions Generative process: Draw type c t Next given c t, draw x t Distribtuion( ct )

EM ALGORITHM FOR MIXTURE MODELS For i = 1 to convergence (E step) For every t, define distribution Q t over the latent variable c t as: Q (i) t (c t ) PDF(x t ; (i 1) c t ) (i 1) [c t ] (M step) For every k {1,...,K} (i) k = n t=1 Q (i) t [k], n (i) k = argmin n Q t [k] log(pdf(x t ; )) t=1 x t observation, c t latent variable.