CSC 411 Lecture 18: Matrix Factorizations

Similar documents
CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Singular Value Decomposition, and Application to Recommender Systems

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Unsupervised learning in Vision

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

C. Poultney S. Cho pra (NYU Courant Institute) Y. LeCun

Dimension Reduction CS534

General Instructions. Questions

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Recognition, SVD, and PCA

Convex Optimization / Homework 2, due Oct 3

Facial Expression Recognition Using Non-negative Matrix Factorization

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Unsupervised Learning

Collaborative Filtering for Netflix

Machine Learning for Physicists Lecture 6. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Modern Signal Processing and Sparse Coding

CPSC 340: Machine Learning and Data Mining. Recommender Systems Fall 2017

3D Geometry and Camera Calibration

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CSC 411 Lecture 4: Ensembles I

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

The Singular Value Decomposition: Let A be any m n matrix. orthogonal matrices U, V and a diagonal matrix Σ such that A = UΣV T.

Facial Expression Classification with Random Filters Feature Extraction

Factorization with Missing and Noisy Data

CSC 411: Lecture 12: Clustering

Applications Video Surveillance (On-line or off-line)

10-701/15-781, Fall 2006, Final

Lecture 3: Camera Calibration, DLT, SVD

Image Processing. Image Features

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Unsupervised Learning

Bilevel Sparse Coding

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Histograms of Sparse Codes for Object Detection

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Efficient Visual Coding: From Retina To V2

Sparse & Functional Principal Components Analysis

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Facial Recognition Using Eigenfaces

Big-data Clustering: K-means vs K-indicators

SPARSE CODES FOR NATURAL IMAGES. Davide Scaramuzza Autonomous Systems Lab (EPFL)

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Programming Exercise 7: K-means Clustering and Principal Component Analysis

Does the Brain do Inverse Graphics?

Deep Learning. Volker Tresp Summer 2014

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Day 3 Lecture 1. Unsupervised Learning

Non-negative Matrix Factorization for Multimodal Image Retrieval

Reviewer Profiling Using Sparse Matrix Regression

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Large-Scale Face Manifold Learning

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Using Subspace Constraints to Improve Feature Tracking Presented by Bryan Poling. Based on work by Bryan Poling, Gilad Lerman, and Arthur Szlam

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Sparse Models in Image Understanding And Computer Vision

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Clustering and The Expectation-Maximization Algorithm

K-Means and Gaussian Mixture Models

Novel Lossy Compression Algorithms with Stacked Autoencoders

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

COMP 558 lecture 19 Nov. 17, 2010

Image Coding with Active Appearance Models

Online Subspace Estimation and Tracking from Missing or Corrupted Data

ECE 5424: Introduction to Machine Learning

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Linear Methods for Regression and Shrinkage Methods

Mixture Models and the EM Algorithm

Machine Learning 13. week

CS231A Course Notes 4: Stereo Systems and Structure from Motion

Image Restoration and Background Separation Using Sparse Representation Framework

When Sparsity Meets Low-Rankness: Transform Learning With Non-Local Low-Rank Constraint For Image Restoration

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Does the Brain do Inverse Graphics?

Non-negative Matrix Factorization for Multimodal Image Retrieval

Latent Variable Models and Expectation Maximization

Grundlagen der Künstlichen Intelligenz

Inference and Representation

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

An Introduction To and Applications of Neural Networks

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

EMERGENCE OF SIMPLE-CELL RECEPTIVE PROPERTIES BY LEARNING A SPARSE CODE FOR NATURAL IMAGES

Motivation. My General Philosophy. Assumptions. Advanced Computer Graphics (Spring 2013) Precomputation-Based Relighting

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Recommendation Systems

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

A Neuro Probabilistic Language Model Bengio et. al. 2003

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Two-View Geometry (Course 23, Lecture D)

Transcription:

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 18-Matrix Factorizations 1 / 27

Overview Recall PCA: project data onto a low-dimensional subspace defined by the top eigenvalues of the data covariance We saw that PCA could be viewed as a linear autoencoder, which let us generalize to nonlinear autoencoders Today we consider another generalization, matrix factorizations view PCA as a matrix factorization problem extend to matrix completion, where the data matrix is only partially observed extend to other matrix factorization models, which place different kinds of structure on the factors UofT CSC 411: 18-Matrix Factorizations 2 / 27

PCA as Matrix Factorization Recall: each input vector x (i) is approximated as Uz, where U is the orthogonal basis for the principal subspace, and z is the code vector. Write this in matrix form: X and Z are matrices with one column per data point I.e., for this lecture, we transpose our usual convention for data matrices. Writing the squared error in matrix form N x (i) Uz (i) 2 = X UZ 2 F i=1 Recall that the Frobenius norm is defined as A 2 F = i,j a 2 ij. UofT CSC 411: 18-Matrix Factorizations 3 / 27

PCA as Matrix Factorization So PCA is approximating X UZ. Based on the sizes of the matrices, this is a rank-k approximation. Since U was chosen to minimize reconstruction error, this is the optimal rank-k approximation, in terms of X UZ 2 F. UofT CSC 411: 18-Matrix Factorizations 4 / 27

PCA vs. SVD (optional) This has a close relationship to the Singular Value Decomposition (SVD) of X. This is a factorization Properties: X = USV U, S, and V provide a real-valued matrix factorization of X. U is a n k matrix with orthonormal columns, U U = I k, where I k is the k k identity matrix. V is an orthonormal k k matrix, V = V 1. S is a k k diagonal matrix, with non-negative singular values, s 1, s 2,..., s k, on the diagonal, where the singular values are conventionally ordered from largest to smallest. It s possible to show that the first k singular vectors correspond to the first k principal components; more precisely, Z = SV UofT CSC 411: 18-Matrix Factorizations 5 / 27

Matrix Completion We just saw that PCA gives the optimal low-rank matrix factorization. Two ways to generalize this: Consider when X is only partially observed. E.g., consider a sparse 1000 1000 matrix with 50,000 observations (only 5% observed). A rank 5 approximation requires only 10,000 parameters, so it s reasonable to fit this. Unfortunately, no closed form solution. Impose structure on the factors. We can get lots of interesting models this way. UofT CSC 411: 18-Matrix Factorizations 6 / 27

Recommender systems: Why? 400 hours of video are uploaded to YouTube every minute 353 million products and 310 million users million songs 83 million paying subscribers and streams about 35 Who cares about all these videos, products and songs? People may care only about a few Personalization: Connect users with content they may use/enjoy. Recommender systems suggest items of interest and enjoyment to people based on their preferences UofT CSC 411: 18-Matrix Factorizations 7 / 27

Some recommender systems in action UofT CSC 411: 18-Matrix Factorizations 8 / 27

Some recommender systems in action Ideally recommendations should combine global and session interests, look at your history if available, should adapt with time, be coherent and diverse, etc. UofT CSC 411: 18-Matrix Factorizations 9 / 27

The Netflix problem Movie recommendation: Users watch movies and rate them as good or bad. User Movie Rating Thor Chained Frozen Chained Bambi Titanic Goodfellas Dumbo Twilight Frozen Tangled Because users only rate a few items, one would like to infer their preference for unrated items UofT CSC 411: 18-Matrix Factorizations 10 / 27

Matrix completion problem Matrix completion problem: Transform the table into a big users by movie matrix. Rating matrix Ninja 2 3????? 1? Cat 4? 5?????? Angel??? 3 5 5??? Nursey?????? 2?? Tongey? 5??????? Neutral???????? 1 Chained Frozen Bambi Titanic Goodfellas Dumbo Twilight Thor Tangled Data: Users rate some movies. R user,movie. Very sparse Task: Finding missing data, e.g. for recommending new movies to users. Fill in the question marks Algorithms: Alternating Least Square method, Gradient Descent, Non-negative Matrix Factorization, low rank matrix Completion, etc. UofT CSC 411: 18-Matrix Factorizations 11 / 27

Latent factor models In our current setting, latent factor models attempt to explain the ratings by characterizing both items and users on a number of factors K inferred from the ratings patterns. For simplicity, we can associate these factors with idealized concepts like comedy drama action Children Quirkiness But also uninterpretable dimensions Can we write down the ratings matrix R such that these (or similar) latent factors are automatically discovered? UofT CSC 411: 18-Matrix Factorizations 12 / 27

R<latexit sha1_base64="4897wgcjkikusgqzxxehujvrmza=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif49vbc22owy2l+3szsbsboqs+i+8efdeq//gm//gtzudtg4sddpvsfmmsatxxnw/ndlk6tr6rnmzsrw9s7tx3t9o6zhvdfssfrhqbfsj4bjbhhubnuqhjqkbd8h4ovcfnlbphst7m0nqj+hq8pazaqz02iuogqvhdjftv2tu3z2blboviduo0oxxv3qdmkurssme1brruynxm6omzwknlv6qmafstifytvtsclwfzrjpyylvbismlx3skjn6eyojkdatklcteuk96oxif143negln3gzpaylm38upokymotnkwfxyiyywekz4jyrysoqkdo2piotwvs8ezm0z+qew/duz2unq6komhzbmzycbxfqgbtoqgsyshigv3hztppivdsf89gsu+wcwh84nz/de5d2</latexit> <latexit sha1_base64="4897wgcjkikusgqzxxehujvrmza=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif49vbc22owy2l+3szsbsboqs+i+8efdeq//gm//gtzudtg4sddpvsfmmsatxxnw/ndlk6tr6rnmzsrw9s7tx3t9o6zhvdfssfrhqbfsj4bjbhhubnuqhjqkbd8h4ovcfnlbphst7m0nqj+hq8pazaqz02iuogqvhdjftv2tu3z2blboviduo0oxxv3qdmkurssme1brruynxm6omzwknlv6qmafstifytvtsclwfzrjpyylvbismlx3skjn6eyojkdatklcteuk96oxif143negln3gzpaylm38upokymotnkwfxyiyywekz4jyrysoqkdo2piotwvs8ezm0z+qew/duz2unq6komhzbmzycbxfqgbtoqgsyshigv3hztppivdsf89gsu+wcwh84nz/de5d2</latexit> <latexit sha1_base64="4897wgcjkikusgqzxxehujvrmza=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif49vbc22owy2l+3szsbsboqs+i+8efdeq//gm//gtzudtg4sddpvsfmmsatxxnw/ndlk6tr6rnmzsrw9s7tx3t9o6zhvdfssfrhqbfsj4bjbhhubnuqhjqkbd8h4ovcfnlbphst7m0nqj+hq8pazaqz02iuogqvhdjftv2tu3z2blboviduo0oxxv3qdmkurssme1brruynxm6omzwknlv6qmafstifytvtsclwfzrjpyylvbismlx3skjn6eyojkdatklcteuk96oxif143negln3gzpaylm38upokymotnkwfxyiyywekz4jyrysoqkdo2piotwvs8ezm0z+qew/duz2unq6komhzbmzycbxfqgbtoqgsyshigv3hztppivdsf89gsu+wcwh84nz/de5d2</latexit> <latexit sha1_base64="4897wgcjkikusgqzxxehujvrmza=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif49vbc22owy2l+3szsbsboqs+i+8efdeq//gm//gtzudtg4sddpvsfmmsatxxnw/ndlk6tr6rnmzsrw9s7tx3t9o6zhvdfssfrhqbfsj4bjbhhubnuqhjqkbd8h4ovcfnlbphst7m0nqj+hq8pazaqz02iuogqvhdjftv2tu3z2blboviduo0oxxv3qdmkurssme1brruynxm6omzwknlv6qmafstifytvtsclwfzrjpyylvbismlx3skjn6eyojkdatklcteuk96oxif143negln3gzpaylm38upokymotnkwfxyiyywekz4jyrysoqkdo2piotwvs8ezm0z+qew/duz2unq6komhzbmzycbxfqgbtoqgsyshigv3hztppivdsf89gsu+wcwh84nz/de5d2</latexit> U<latexit sha1_base64="i69u2bbvulryn7n5sm3b9sq3yd8=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif48vtftss9lsn+3szsbsvggl9f948aciv/+nn/+nmzyhbr1yggbey+dnkehh0hw/ndla+sbmvnm7sro7t39qptxqmtjvjpsslrhubnrwkrt3uadknurzggwst4pjbe63n7g2ilypoe14p6ijjulbkfrpsrdrhadh5s8g1zpbd+cgq8qrsa0knafvr94wzmneftjjjel6bol9jgoutpjzpzcanla2ospetvtrijt+nk88i2dwgziw1vypjhp190zgi2omuwan84rm2cvf/7xuiuf1pxmqszertvgotcxbmotnk6hqnkgcwkkzfjyrywoqkunbusww4c2fvepaf3xprxv3l7xgtvfhgu7gfm7bgytowb00wqcgcp7hfd4c47w4787hyrtkfdvh8afo5w/icpd5</latexit> <latexit sha1_base64="i69u2bbvulryn7n5sm3b9sq3yd8=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif48vtftss9lsn+3szsbsvggl9f948aciv/+nn/+nmzyhbr1yggbey+dnkehh0hw/ndla+sbmvnm7sro7t39qptxqmtjvjpsslrhubnrwkrt3uadknurzggwst4pjbe63n7g2ilypoe14p6ijjulbkfrpsrdrhadh5s8g1zpbd+cgq8qrsa0knafvr94wzmneftjjjel6bol9jgoutpjzpzcanla2ospetvtrijt+nk88i2dwgziw1vypjhp190zgi2omuwan84rm2cvf/7xuiuf1pxmqszertvgotcxbmotnk6hqnkgcwkkzfjyrywoqkunbusww4c2fvepaf3xprxv3l7xgtvfhgu7gfm7bgytowb00wqcgcp7hfd4c47w4787hyrtkfdvh8afo5w/icpd5</latexit> <latexit sha1_base64="i69u2bbvulryn7n5sm3b9sq3yd8=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif48vtftss9lsn+3szsbsvggl9f948aciv/+nn/+nmzyhbr1yggbey+dnkehh0hw/ndla+sbmvnm7sro7t39qptxqmtjvjpsslrhubnrwkrt3uadknurzggwst4pjbe63n7g2ilypoe14p6ijjulbkfrpsrdrhadh5s8g1zpbd+cgq8qrsa0knafvr94wzmneftjjjel6bol9jgoutpjzpzcanla2ospetvtrijt+nk88i2dwgziw1vypjhp190zgi2omuwan84rm2cvf/7xuiuf1pxmqszertvgotcxbmotnk6hqnkgcwkkzfjyrywoqkunbusww4c2fvepaf3xprxv3l7xgtvfhgu7gfm7bgytowb00wqcgcp7hfd4c47w4787hyrtkfdvh8afo5w/icpd5</latexit> <latexit sha1_base64="i69u2bbvulryn7n5sm3b9sq3yd8=">aaab8xicbvbns8nafhypx7v+vt16wsycp5kiomeif48vtftss9lsn+3szsbsvggl9f948aciv/+nn/+nmzyhbr1yggbey+dnkehh0hw/ndla+sbmvnm7sro7t39qptxqmtjvjpsslrhubnrwkrt3uadknurzggwst4pjbe63n7g2ilypoe14p6ijjulbkfrpsrdrhadh5s8g1zpbd+cgq8qrsa0knafvr94wzmneftjjjel6bol9jgoutpjzpzcanla2ospetvtrijt+nk88i2dwgziw1vypjhp190zgi2omuwan84rm2cvf/7xuiuf1pxmqszertvgotcxbmotnk6hqnkgcwkkzfjyrywoqkunbusww4c2fvepaf3xprxv3l7xgtvfhgu7gfm7bgytowb00wqcgcp7hfd4c47w4787hyrtkfdvh8afo5w/icpd5</latexit> z<latexit sha1_base64="by0mxx4z7zmdzwlaxfdfhevaxdi=">aaacjhicbvbntwixeg0rffel9oilkzh4irvgri9elx4xky8ig9itxwjotpu2q4hn/guvevtxedmevphb7miebjxkkpf3zjjvnh9xpo3jfmpcrnfza7u0u97d2z84rfsp2lrgitawkvyqro815uzqlmgg026kka59tjv+5dbto09uasbfg5lg1avxslcaewws9dgpsrn7qtjlb5wau3fmhdabm4mayks5qmjifyhjhfjhcmda91wnml6clwge07tcjzwnmjngee1zkhbitzfmlafozdjdfehlwxg0z/9ujdjuehr6djkzqfe1jpxp68umupysjqlyueewh4kyiynr9j8amkwj4vmlmfhmekvkjbumxqa0dcuzhp+pluhym4lhk1nmtxkdlm1u7mpk66b9uxedunt/wwvc5amwwak4befabvegae5ae7qaaqk8gffwbt/hb/yex4vrasx3jsfswz9fjielsg==</latexit> <latexit sha1_base64="by0mxx4z7zmdzwlaxfdfhevaxdi=">aaacjhicbvbntwixeg0rffel9oilkzh4irvgri9elx4xky8ig9itxwjotpu2q4hn/guvevtxedmevphb7miebjxkkpf3zjjvnh9xpo3jfmpcrnfza7u0u97d2z84rfsp2lrgitawkvyqro815uzqlmgg026kka59tjv+5dbto09uasbfg5lg1avxslcaewws9dgpsrn7qtjlb5wau3fmhdabm4mayks5qmjifyhjhfjhcmda91wnml6clwge07tcjzwnmjngee1zkhbitzfmlafozdjdfehlwxg0z/9ujdjuehr6djkzqfe1jpxp68umupysjqlyueewh4kyiynr9j8amkwj4vmlmfhmekvkjbumxqa0dcuzhp+pluhym4lhk1nmtxkdlm1u7mpk66b9uxedunt/wwvc5amwwak4befabvegae5ae7qaaqk8gffwbt/hb/yex4vrasx3jsfswz9fjielsg==</latexit> <latexit sha1_base64="by0mxx4z7zmdzwlaxfdfhevaxdi=">aaacjhicbvbntwixeg0rffel9oilkzh4irvgri9elx4xky8ig9itxwjotpu2q4hn/guvevtxedmevphb7miebjxkkpf3zjjvnh9xpo3jfmpcrnfza7u0u97d2z84rfsp2lrgitawkvyqro815uzqlmgg026kka59tjv+5dbto09uasbfg5lg1avxslcaewws9dgpsrn7qtjlb5wau3fmhdabm4mayks5qmjifyhjhfjhcmda91wnml6clwge07tcjzwnmjngee1zkhbitzfmlafozdjdfehlwxg0z/9ujdjuehr6djkzqfe1jpxp68umupysjqlyueewh4kyiynr9j8amkwj4vmlmfhmekvkjbumxqa0dcuzhp+pluhym4lhk1nmtxkdlm1u7mpk66b9uxedunt/wwvc5amwwak4befabvegae5ae7qaaqk8gffwbt/hb/yex4vrasx3jsfswz9fjielsg==</latexit> <latexit sha1_base64="by0mxx4z7zmdzwlaxfdfhevaxdi=">aaacjhicbvbntwixeg0rffel9oilkzh4irvgri9elx4xky8ig9itxwjotpu2q4hn/guvevtxedmevphb7miebjxkkpf3zjjvnh9xpo3jfmpcrnfza7u0u97d2z84rfsp2lrgitawkvyqro815uzqlmgg026kka59tjv+5dbto09uasbfg5lg1avxslcaewws9dgpsrn7qtjlb5wau3fmhdabm4mayks5qmjifyhjhfjhcmda91wnml6clwge07tcjzwnmjngee1zkhbitzfmlafozdjdfehlwxg0z/9ujdjuehr6djkzqfe1jpxp68umupysjqlyueewh4kyiynr9j8amkwj4vmlmfhmekvkjbumxqa0dcuzhp+pluhym4lhk1nmtxkdlm1u7mpk66b9uxedunt/wwvc5amwwak4befabvegae5ae7qaaqk8gffwbt/hb/yex4vrasx3jsfswz9fjielsg==</latexit> Approach: Matrix factorization methods UofT CSC 411: 18-Matrix Factorizations 13 / 27

Alternating least squares Assume that the matrix R is low rank. One can attempt to factorize R UZ in terms of small matrices u 1 U = and Z = z 1... z N u D Using the squared error loss, a matrix factorization corresponds to solving min U,Z f (U, Z), with f (U, Z) = 1 2 r ij Observed (r ij u i z j ) 2. The objective is non-convex and in fact it s NP-hard to optimize. (See Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard by Gillis and Glineur, 2011) As a function of either U or Z individually, the problem is convex. But have a chicken-and-egg problem, just like with K-means and mixture models! Alternating Least Squares (ALS): fix U and optimize Z, followed by fix U and optimize Z, and so on until convergence. UofT CSC 411: 18-Matrix Factorizations 14 / 27

Alternating least squares ALS for Matrix Completion algorithm 1 Initialize U and Z randomly 2 repeat 3 for i = 1,.., D do 4 u i = ( j rij 0 z jz j ) 1 j rij 0 r ijz j 5 for j = 1,.., N do 6 z j = ( i rij 0 u iu i ) 1 i rij 0 r iju i 7 until convergence See also the paper Probabilistic Matrix Factorization in the course readings. UofT CSC 411: 18-Matrix Factorizations 15 / 27

More matrix factorizations UofT CSC 411: 18-Matrix Factorizations 16 / 27

K-Means It s even possible to view K-means as a matrix factorization! Stack the indicator vectors r i for assignments into a matrix R, and stack the cluster centers µ k into a matrix M. Reconstruction of the data is given by RM. K-means distortion function in matrix form: N n=1 K k=1 r (n) k m k x (n) 2 = X RM 2 F UofT CSC 411: 18-Matrix Factorizations 17 / 27

K-Means Can sort by cluster for visualization: UofT CSC 411: 18-Matrix Factorizations 18 / 27

Co-clustering (optional) We can take this a step further. Co-clustering clusters both the rows and columns of a data matrix, giving a block structure. We can represent this as the indicator matrix for rows, times the matrix of means for each block, times the indicator matrix for columns UofT CSC 411: 18-Matrix Factorizations 19 / 27

Sparse Coding Efficient coding hypothesis: the structure of our visual system is adapted to represent the visual world in an efficient way E.g., be able to represent sensory signals with only a small fraction of neurons having to fire (e.g. to save energy) Olshausen and Field fit a sparse coding model to natural images to try to determine what s the most efficient representation. They didn t encode anything specific about the brain into their model, but the learned representations bore a striking resemblance to the representations in the primary visual cortex UofT CSC 411: 18-Matrix Factorizations 20 / 27

Sparse Coding This algorithm works on small (e.g. 20 20) image patches, which we reshape into vectors (i.e. ignore the spatial structure) Suppose we have a dictionary of basis functions {a k } K k=1 which can be combined to model each patch Each patch is approximated as a linear combination of a small number of basis functions This is an overcomplete representation, in that typically K > D (e.g. more basis functions than pixels) K x s k a k = As k=1 Since we use only a few basis functions, s is a sparse vector. UofT CSC 411: 18-Matrix Factorizations 21 / 27

Sparse Coding We d like choose s to accurately reconstruct the image, but encourage sparsity in s. What cost function should we use? Inference in the sparse coding model: min s x As 2 + β s 1 Here, β is a hyperparameter that trades off reconstruction error vs. sparsity. There are efficient algorithms for minimizing this cost function (beyond the scope of this class) UofT CSC 411: 18-Matrix Factorizations 22 / 27

Sparse Coding: Learning the Dictionary We can learn a dictionary by optimizing both A and {s i } N i=1 to trade off reconstruction error and sparsity min {s i },A N x As i 2 + β s i 1 i=1 subject to a k 2 1 for all k Why is the normalization constraint on a k needed? Reconstruction term can be written in matrix form as X AS 2 F, where S combines the s i. Can fit using an alternating minimization scheme over A and S, just like K-means, EM, low-rank matrix completion, etc. UofT CSC 411: 18-Matrix Factorizations 23 / 27

Sparse Coding: Learning the Dictionary Basis functions learned from natural images: UofT CSC 411: 18-Matrix Factorizations 24 / 27

Sparse Coding: Learning the Dictionary The sparse components are oriented edges, similar to what a conv net learns But the learned dictionary is much more diverse than the first-layer conv net representations: tiles the space of location, frequency, and orientation in an efficient way Each basis function has similar response properties to cells in the primary visual cortex (the first stage of visual processing in the brain) UofT CSC 411: 18-Matrix Factorizations 25 / 27

Sparse Coding Applying sparse coding to speech signals: (Grosse et al., 2007, Shift-invariant sparse coding for audio classification ) UofT CSC 411: 18-Matrix Factorizations 26 / 27

Summary PCA can be viewed as fitting the optimal low-rank approximation to a data matrix. Matrix completion is the setting where the data matrix is only partially observed Solve using ILS, an alternating procedure analogous to EM PCA, K-means, co-clustering, sparse coding, and lots of other interesting models can be viewed as matrix factorizations, with different kinds of structure imposed on the factors. UofT CSC 411: 18-Matrix Factorizations 27 / 27