CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Similar documents
CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

1 Training/Validation/Testing

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

1 Case study of SVM (Rob)

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Machine Learning / Jan 27, 2010

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Generative and discriminative classification techniques

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet.

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

What is machine learning?

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

Machine Learning for OR & FE

CS 229 Midterm Review

CSC 411: Lecture 12: Clustering

Clustering and The Expectation-Maximization Algorithm

Unsupervised learning in Vision

10-701/15-781, Fall 2006, Final

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

Logistic Regression

Grundlagen der Künstlichen Intelligenz

scikit-learn (Machine Learning in Python)

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Introduction to Machine Learning CMU-10701

CPSC 340: Machine Learning and Data Mining

Machine Learning. Supervised Learning. Manfred Huber

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Machine Learning: Think Big and Parallel

Machine Learning : Clustering, Self-Organizing Maps

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

06: Logistic Regression

ECS289: Scalable Machine Learning

CPSC 340: Machine Learning and Data Mining. Non-Linear Regression Fall 2016

Unsupervised Learning

Neural Networks and Deep Learning

Weka ( )

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Semi-supervised Learning

Lecture 4. Convexity Robust cost functions Optimizing non-convex functions. 3B1B Optimization Michaelmas 2017 A. Zisserman

CSE 158. Web Mining and Recommender Systems. Midterm recap

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

ECE 5424: Introduction to Machine Learning

Expectation Maximization (EM) and Gaussian Mixture Models

Divide and Conquer Kernel Ridge Regression

node2vec: Scalable Feature Learning for Networks

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Algebraic Iterative Methods for Computed Tomography

Supervised vs unsupervised clustering

Regularization and model selection

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Mixture Models and the EM Algorithm

Classification: Feature Vectors

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Network Traffic Measurements and Analysis

CPSC 340: Machine Learning and Data Mining

Evaluating Classifiers

CPSC Coding Project (due December 17)

LSML19: Introduction to Large Scale Machine Learning

Generative and discriminative classification

Scalable Data Analysis

K-Means Clustering. Sargur Srihari

Machine Learning Techniques

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Knowledge Discovery and Data Mining

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

CSE 573: Artificial Intelligence Autumn 2010

Clustering Lecture 5: Mixture Model

Function Algorithms: Linear Regression, Logistic Regression

Support vector machines

Perceptron as a graph

Clustering and Visualisation of Data

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Facial Expression Classification with Random Filters Feature Extraction

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Evaluating Classifiers

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Transcription:

CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016

A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours: Thursdays from 4:30-5:30 in ICICS X836.

Last Time: MAP Estimation MAP estimation maximizes posterior: Likelihood measures probability of labels y given parameters w. Prior measures probability of parameters w before we see data. For IID training data and independent prior, equivalent to using: So log-likelihood is an error function, and log-prior is a regularizer. Squared error comes from Gaussian likelihood. L2-regularization comes from Gaussian prior.

Multi-Class Classification For binary classification with linear models we use: For multi-class classification with linear models we use: Where we have a vector w c for each class c. To jointly estimate the w c, we can use softmax likelihood:

Multi-Class Classification For multi-class classification with linear models we use: To jointly estimate the w c, we can use softmax likelihood: By taking the negative log and adding a regularizer, we get:

Digression: Frobenius Matrix Norm

End of Part 3: Key Concepts Linear models base predictions on linear combinations of features: We model non-linear effects using a change of basis: Replace x i with z i and use w T z i. Examples include polynomial basis and (non-parametric) RBFs. Regression is supervised learning with continuous labels. Logical error measure for regression is squared error: Can be solved as a system of linear equations.

End of Part 3: Key Concepts We can reduce over-fitting by using regularization: Squared error is not always right measure: Absolute error is less sensitive to outliers. Logistic loss and hinge loss are better for binary y i. Softmax loss is better for multi-class y i. MLE/MAP perspective: We can view loss as log-likelihood and regularizer as log-prior. Allows us to define losses based on probabilities.

End of Part 3: Key Concepts Gradient descent finds local minimum of smooth objectives. Converges to a global optimum for convex functions. Can use smooth approximations (Huber, log-sum-exp) Stochastic gradient methods allow huge/infinite n. Though very sensitive to the step-size. Kernels let us use similarity between examples, instead of features. Let us use some exponential- or infinite-dimensional features. Feature selection is a messy topic. Classic methods are hypothesis testing and search and score. L1-regularization simultaneously regularizes and selects features.

Supervised Learning Part 1: The Story So Far Methods based on counting and distances. Unsupervised Learning Part 1: Methods based on counting and distances. Supervised Learning Part 2: Methods based on linear models and gradient descent. Unsupervised Learning Part 2: Methods based on linear models and gradient descent.

Unsupervised learning: Unsupervised Learning Part 2 We only have x i values, but no explicit target labels. You want to do something with them. Some unsupervised learning tasks: Clustering: What types of x i are there? Outlier detection: Is this a normal x i? Association rules: Which x ij occur together? Latent-factors: What parts are the x i made from? Data visualization: What does the high-dimensional X look like? Ranking: Which are the most important x i?

Motivation: Vector Quantization Recall using k-means for vector quantization: Run k-means to find a set of means w c. This gives a cluster c i for each object i. Replace features x i by mean of cluster:

Motivation: Vector Quantization We can write vector quantization as a linear model: Define z i as a binary vector that is zero except in position c i. Our weird notation for mean matrix W :

Regression View of K-Means Recall that we said k-means minimizes the objective: In our new notation, we can write k-means as minimizing: We can view this as solving d regression problems: Each w j is trying to predict column j of X from the basis z i. But we re also trying to learn the basis z i.

Principal Component Analysis (PCA) Principal component analysis (PCA) minimizes the same objective: But instead of 1 of k binary z i we allow a continuous basis z i. Called a latent-factor model: Instead of means, w c called factors or principal components. The z i are called factor loadings or low-dimensional basis. The z i say how to mix the means/factors to approximate example i.

Principal Component Analysis (PCA) Principal component analysis (PCA) in matrix notation: Also called a matrix factorization model:

PCA Applications PCA has been reinvented many times: https://en.wikipedia.org/wiki/principal_component_analysis

Applications of PCA: PCA Applications Dimensionality reduction: replace X with lower-dimensional Z. If k << d, then compresses data. Much better approximation than vector quantization.

Applications of PCA: PCA Applications Dimensionality reduction: replace X with lower-dimensional Z. If k << d, then compresses data. Much better approximation than vector quantization. Outlier detection: if PCA gives poor approximation of x i, could be outlier. Though due to squared error PCA is sensitive to outliers. Partial least squares: uses PCA features as basis for linear model.

Applications of PCA: PCA Applications Data visualization: plot z i with k = 2 to visualize high-dimensional objects. http://infoproc.blogspot.ca/2008/11/european-genetic-substructure.html

Applications of PCA: PCA Applications Data interpretation: we can try to assign meaning to latent factors w c. Hidden factors that influence all the variables. https://new.edu/resources/big-5-personality-traits

PCA with d=1 Consider the case of PCA when d=1: There is an obvious solution: w = 1 and z i = x i. PCA is only interested when k < d, since otherwise we can set z i = x i. PCA is not unique: w = 1/α and z i = αx i for any α is a solution. We can enforce w = 1 to avoid this problem.

PCA with d=2 and k =1 So simplest interesting case is d=2 and k=1: Very similar to a least squares problem, but note that: We have no y i, we are trying to predict each feature x ij from the single z i. But feaures z i are also variables, we are learning the features z i too. Side note: in PCA we assume features have a mean of 0. You can subtract mean or add bias variable if this is not true.

PCA with d=2 and k =1

PCA with d=2 and k =1

PCA with d=2 and k =1

PCA with d=2 and k =1

PCA Computation The PCA objective with general d and k : 3 common ways to solve this problem: Singular value decomposition: classic non-iterative approach (bonus slide). Alternating minimization: 1. Start with random initialization. 2. Optimize W with Z fixed (solve gradient with respect to W equals to 0). 3. Optimize Z with W fixed (solve gradient with respect to Z equals to 0). 4. Go back to 2. Stochastic gradient: gradient descent based on random i and j.

PCA Non-Convexity The PCA objective with general d and k : This objective is not jointly convex in W and Z. This is why iterative methods need random initialization. If you initialize with z1 = z2, then they stay the same. But it s possible to show that all stable local optima are global optima. So alternating minimization and stochastic gradient give global optima in practice.

Latent-factor models: Summary Compress data as linear combination of factors. Principal component analysis: Most common variant based on squared reconstruction error. Next time: modifying PCA so it splits faces into eyes, mouths, etc.

Bonus Slide: PCA with Singular Value Decomposition Under constraints that w ct w c = 1 and w ct w c = 0, use: You can also quickly get compressed version of new data: If W was not orthogonal, could get Z by least squares.