Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Similar documents
Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering web search results

ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Expectation Maximization: Inferring model parameters and class labels

Grundlagen der Künstlichen Intelligenz

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Unsupervised Learning

K-Means and Gaussian Mixture Models

Inference and Representation

Nearest Neighbor with KD Trees

Clustering Lecture 5: Mixture Model

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Dimension Reduction CS534

Network Traffic Measurements and Analysis

Unsupervised Learning: Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. Image segmentation, document clustering, protein class discovery, compression

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

CSC 411: Lecture 12: Clustering

COMS 4771 Clustering. Nakul Verma

Unsupervised Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

University of Washington Department of Computer Science and Engineering / Department of Statistics

Unsupervised learning in Vision

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

K-Means Clustering 3/3/17

Clustering and The Expectation-Maximization Algorithm

Clustering and Visualisation of Data

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

SGN (4 cr) Chapter 11

Machine Learning. Unsupervised Learning. Manfred Huber

1 Case study of SVM (Rob)

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Mixture Models and the EM Algorithm

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Expectation Maximization: Inferring model parameters and class labels

Content-based image and video analysis. Machine learning

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

10-701/15-781, Fall 2006, Final

MSA220 - Statistical Learning for Big Data

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 195-5: Machine Learning Problem Set 5

Latent Variable Models and Expectation Maximization

University of Washington Department of Computer Science and Engineering / Department of Statistics

10701 Machine Learning. Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Mixture Models and EM

Supervised vs. Unsupervised Learning

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Application of Principal Components Analysis and Gaussian Mixture Models to Printer Identification

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

University of Washington Department of Computer Science and Engineering / Department of Statistics

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS 229 Midterm Review

Fall 09, Homework 5

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Problem 1: Complexity of Update Rules for Logistic Regression

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Applications Video Surveillance (On-line or off-line)

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

732A54/TDDE31 Big Data Analytics

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Feature selection. LING 572 Fei Xia

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

A Course in Machine Learning

General Instructions. Questions

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Clustering and Dimensionality Reduction

Expectation Maximization (EM) and Gaussian Mixture Models

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Segmentation: Clustering, Graph Cut and EM

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Transcription:

Clustering K-means Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 1

Clustering images Set of Images [Goldberger et al.] Carlos Guestrin 2005-2014 2

Clustering web search results Carlos Guestrin 2005-2014 3

Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Carlos Guestrin 2005-2014 4

Some Data Carlos Guestrin 2005-2014 5

K-means 1. Ask user how many clusters they d like. (e.g. k=5) Carlos Guestrin 2005-2014 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations Carlos Guestrin 2005-2014 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) Carlos Guestrin 2005-2014 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns Carlos Guestrin 2005-2014 9

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! Carlos Guestrin 2005-2014 10

K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! Carlos Guestrin 2005-2014 11

What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) Carlos Guestrin 2005-2014 12

Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C Carlos Guestrin 2005-2014 13

Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ Carlos Guestrin 2005-2014 14

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! Carlos Guestrin 2005-2014 15

Mixtures of Gaussians Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 16

(One) bad case for k-means Clusters may overlap Some clusters may be wider than others Carlos Guestrin 2005-2014 17

Nonspherical data Carlos Guestrin 2005-2014 18

Quick Review of Gaussians Univariate and multivariate Gaussians Carlos Guestrin 2005-2014 19

Two-Dimensional Gaussians 5 spherical 10 diagonal 6 full 4 8 3 6 4 2 1 4 2 2 0 0 0 1 2 2 4 2 3 4 6 8 4 5 4 2 0 2 4 6 10 5 4 3 2 1 0 1 2 3 4 5 6 6 4 2 0 2 4 6 full spherical 0.2 diagonal 0.2 0.15 0.15 0.2 0.1 0.1 0.05 0.15 0.05 0 5 0 5 5 0 5 0.1 0.05 0 10 5 0 0 5 Carlos 10 Guestrin 5 2005-2014 0 10 5 5 0 5 10 10 5 0 5 20 10

Gaussians in d Dimensions P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Carlos Guestrin 2005-2014 21

Learning Gaussians P(x) = 1 (2π ) d/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( Given data: MLE for mean: MLE for covariance: Carlos Guestrin 2005-2014 22

When the world is not Gaussian Distribution of male heights in US Distribution of male heights in Sweden What if we mix these together? Carlos Guestrin 2005-2014 23

Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Cluster indicator: (a) Per-cluster likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Carlos Guestrin 2005-2014 24

Generative Model We can think of sampling observations from the model For each observation i, Sample a cluster assignment (a) Sample the observation from the selected Gaussian Carlos Guestrin 2005-2014 25

Density Estimation Estimate a density based on x 1,,x N Carlos Guestrin 2005-2014 26

Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 27

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Carlos Guestrin 2005-2014 28

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = Carlos Guestrin 2005-2014 29

Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 30

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 31

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 32

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 33

Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between Carlos Guestrin 2005-2014 34

Summary of GMM Concept Estimate a density based on x 1,,x N p(x i, µ, ) = 1 (a) KX z i =1 z in (x i µ z i, z i) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Carlos Guestrin 2005-2014 35

Summary of GMM Components Observations x i 2 R d, x i i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K k, KX k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Carlos Guestrin 2005-2014 36

Application to Document Modeling Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 37

Cluster Documents Cluster documents based on topic Carlos Guestrin 2005-2014 38

Document Representation Bag of words model Carlos Guestrin 2005-2014 39

Issues with Document Representation Words counts are bad for standard similarity metrics Term Frequency Inverse Document Frequency (tf-idf) Increase importance of rare words Carlos Guestrin 2005-2014 40

TF-IDF Term frequency: tf(t, d) = Could also use {0, 1}, 1 + logt f(t, d),... Inverse document frequency: idf(t, D) = tf-idf: tfidf(t, d, D) = High for document d with high frequency of term t (high term frequency ) and few documents containing term t in the corpus (high inverse doc frequency ) Carlos Guestrin 2005-2014 41

A Generative Model Documents: Associated topics: Parameters simple mixture of Gaussians: Carlos Guestrin 2005-2014 42

What you get from mixture model for documents Words give topic: Topic proportions: Topic distribution of each document: Carlos Guestrin 2005-2014 43

Results from Wikipedia data using similar model (LDA) Carlos Guestrin 2005-2014 44

Expectation Maximization Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 45

Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? Carlos Guestrin 2005-2014 46

Learning Model Parameters Want to learn model parameters Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 1 0.5 0 (b) 0 0.2 0.4 0.6 0.8 1 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 47

ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Carlos Guestrin 2005-2014 48

Complete Data Imagine we have an assignment of each x i to a cluster Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 49

If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Carlos Guestrin 2005-2014 50

Cluster Responsibilities We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,, )= 0 0 0.5 1 Soft assignments to clusters C. Bishop, Pattern Recognition & Machine Learning Carlos Guestrin 2005-2014 51

Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values given estimate of parameters 2. Optimize parameters to produce new given filled in data 3. Repeat z i ˆ ˆ z i Example: MoG 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. k : Carlos Guestrin 2005-2014 52

E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data Carlos Guestrin 2005-2014 53

Gaussian Mixture Example: Start Carlos Guestrin 2005-2014 54

After first iteration Carlos Guestrin 2005-2014 55

After 2nd iteration Carlos Guestrin 2005-2014 56

After 3rd iteration Carlos Guestrin 2005-2014 57

After 4th iteration Carlos Guestrin 2005-2014 58

After 5th iteration Carlos Guestrin 2005-2014 59

After 6th iteration Carlos Guestrin 2005-2014 60

After 20th iteration Carlos Guestrin 2005-2014 61

Some Bio Assay data Carlos Guestrin 2005-2014 62

GMM clustering of the assay data Carlos Guestrin 2005-2014 63

Resulting Density Estimator Carlos Guestrin 2005-2014 64

Initialization In mixture model case where there are many ways to initialize the EM algorithm Examples: y i = {z i,x i } Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Carlos Guestrin 2005-2014 65

Label switching Color = label does not matter Can switch labels and likelihood is unchanged Carlos Guestrin 2005-2014 66

What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent Carlos Guestrin 2005-2014 67

Dimensionality Reduction PCA Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, 2014 Carlos Guestrin 2005-2014 68

Dimensionality reduction Input data may have thousands or millions of dimensions! e.g., text data has Dimensionality reduction: represent data with fewer dimensions easier learning fewer parameters visualization hard to visualize more than 3D or 4D discover intrinsic dimensionality of data high dimensional data that is truly lower dimensional Carlos Guestrin 2005-2014 69

Lower dimensional projections Rather than picking a subset of the features, we can new features that are combinations of existing features Let s see this in the unsupervised setting just X, but no Y Carlos Guestrin 2005-2014 70

Linear projection and reconstruction x 2 project into 1-dimension z 1 x 1 reconstruction: only know z 1, what was (x 1,x 2 ) Carlos Guestrin 2005-2014 71

Principal component analysis basic idea Project n-dimensional data into k-dimensional space while preserving information: e.g., project space of 10000 words into 3-dimensions e.g., project 3-d into 2-d Choose projection with minimum reconstruction error Carlos Guestrin 2005-2014 72

Linear projections, a review Project a point into a (lower dimensional) space: point: x = (x 1,,x d ) select a basis set of basis vectors (u 1,,u k ) we consider orthonormal basis: u i u i =1, and u i u j =0 for i j select a center x, defines offset of space best coordinates in lower dimensional space defined by dot-products: (z 1,,z k ), z i = (x-x) u i minimum squared error Carlos Guestrin 2005-2014 73

PCA finds projection that minimizes reconstruction error Given N data points: x i = (x 1i,,x di ), i=1 N Will represent each point as a projection: where: and N N PCA: Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N x 2 x 1 Carlos Guestrin 2005-2014 74

Understanding the reconstruction error Note that x i can be represented exactly by d-dimensional projection: d Given k<<d, find (u 1,,u k ) minimizing reconstruction error: N Rewriting error: Carlos Guestrin 2005-2014 75

Reconstruction error and covariance matrix N d N N Carlos Guestrin 2005-2014 76

Minimizing reconstruction error and eigen vectors Minimizing reconstruction error equivalent to picking orthonormal basis (u 1,,u d ) minimizing: Eigen vector: N d Minimizing reconstruction error equivalent to picking (u k+1,,u d ) to be eigen vectors with smallest eigen values Carlos Guestrin 2005-2014 77

Basic PCA algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Compute covariance matrix: Σ 1/N X c T X c Find eigen vectors and values of Σ Principal components: k eigen vectors with highest eigen values Carlos Guestrin 2005-2014 78

PCA example Carlos Guestrin 2005-2014 79

PCA example reconstruction only used first principal component Carlos Guestrin 2005-2014 80

Eigenfaces [Turk, Pentland 91] Input images: Principal components: Carlos Guestrin 2005-2014 81

Eigenfaces reconstruction Each image corresponds to adding 8 principal components: Carlos Guestrin 2005-2014 82

Scaling up Covariance matrix can be really big! Σ is d by d Say, only 10000 features finding eigenvectors is very slow Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., GraphLab, python, R, Matlab svd Carlos Guestrin 2005-2014 83

SVD Write X = W S V T X data matrix, one row per datapoint W weight matrix, one row per datapoint coordinate of x i in eigenspace S singular value matrix, diagonal matrix in our setting each entry is eigenvalue λ j V T singular vector matrix in our setting each row is eigenvector v j Carlos Guestrin 2005-2014 84

PCA using SVD algoritm Start from m by n data matrix X Recenter: subtract mean from each row of X X c X X Call SVD algorithm on X c ask for k singular vectors Principal components: k singular vectors with highest singular values (rows of V T ) Coefficients become: Carlos Guestrin 2005-2014 85

What you need to know Dimensionality reduction why and when it s important Simple feature selection Principal component analysis minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD Carlos Guestrin 2005-2014 86