Locally Linear Landmarks for large-scale manifold learning

Similar documents
Fast, Accurate Spectral Clustering Using Locally Linear Landmarks

Locally Linear Landmarks for Large-Scale Manifold Learning

Large-Scale Face Manifold Learning

Locally Linear Landmarks for Large-Scale Manifold Learning

Non-linear dimension reduction

The K-modes and Laplacian K-modes algorithms for clustering

Learning independent, diverse binary hash functions: pruning and locality

Data fusion and multi-cue data matching using diffusion maps

The Laplacian Eigenmaps Latent Variable Model

Learning a Manifold as an Atlas Supplementary Material

Fast, Accurate Spectral Clustering Using Locally Linear Landmarks

Advanced Machine Learning Practical 2: Manifold Learning + Clustering (Spectral Clustering and Kernel K-Means)

Image Similarities for Learning Video Manifolds. Selen Atasoy MICCAI 2011 Tutorial

Dimension Reduction CS534

Cluster Analysis (b) Lijun Zhang

Application of Spectral Clustering Algorithm

People Tracking with the Laplacian Eigenmaps Latent Variable Model

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Isometric Mapping Hashing

Locality Preserving Projections (LPP) Abstract

Global versus local methods in nonlinear dimensionality reduction

Robust Pose Estimation using the SwissRanger SR-3000 Camera

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

Sparse Manifold Clustering and Embedding

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Iterative Non-linear Dimensionality Reduction by Manifold Sculpting

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis:midyear Report

Multidimensional scaling Based in part on slides from textbook, slides of Susan Holmes. October 10, Statistics 202: Data Mining

Hashing with Binary Autoencoders

Large-scale Manifold Learning

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Unsupervised Learning

Visual Representations for Machine Learning

Appearance Manifold of Facial Expression

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Non-linear CCA and PCA by Alignment of Local Models

Manifold Clustering. Abstract. 1. Introduction

CIE L*a*b* color model

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Global versus local methods in nonlinear dimensionality reduction

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

Semi-Supervised Normalized Embeddings for Fusion and Land-Use Classification of Multiple View Data

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

A New Orthogonalization of Locality Preserving Projection and Applications

A Taxonomy of Semi-Supervised Learning Algorithms

Low-dimensional Representations of Hyperspectral Data for Use in CRF-based Classification

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

A Discriminative Non-Linear Manifold Learning Technique for Face Recognition

Remote Sensing Data Classification Using Combined Spectral and Spatial Local Linear Embedding (CSSLE)

Basis Functions. Volker Tresp Summer 2017

Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction

Locality Preserving Projections (LPP) Abstract

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Face Recognition using Laplacianfaces

无监督学习中的选代表和被代表问题 - AP & LLE 张响亮. Xiangliang Zhang. King Abdullah University of Science and Technology. CNCC, Oct 25, 2018 Hangzhou, China

Non-Local Manifold Tangent Learning

Linear local tangent space alignment and application to face recognition

Efficient Iterative Semi-supervised Classification on Manifold

RDRToolbox A package for nonlinear dimension reduction with Isomap and LLE.

Automatic Alignment of Local Representations

Dimension reduction for hyperspectral imaging using laplacian eigenmaps and randomized principal component analysis

Globally and Locally Consistent Unsupervised Projection

Nonlinear Dimensionality Reduction Applied to the Classification of Images

A Stochastic Optimization Approach for Unsupervised Kernel Regression

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Learning Two-View Stereo Matching

Manifold Learning for Video-to-Video Face Recognition

Distributed Optimization of Deeply Nested Systems

Laplacian Faces: A Face Recognition Tool

Manifold Spanning Graphs

Divide and Conquer Kernel Ridge Regression

10-701/15-781, Fall 2006, Final

Hyperspectral image segmentation using spatial-spectral graphs

Model compression as constrained optimization, with application to neural nets

COMPRESSED DETECTION VIA MANIFOLD LEARNING. Hyun Jeong Cho, Kuang-Hung Liu, Jae Young Park. { zzon, khliu, jaeypark

Local Linear Embedding. Katelyn Stringer ASTR 689 December 1, 2015

Kernel spectral clustering: model representations, sparsity and out-of-sample extensions

Lecture 10 CNNs on Graphs

Kernel Principal Component Analysis: Applications and Implementation

Chap.12 Kernel methods [Book, Chap.7]

Network Traffic Measurements and Analysis

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Enhanced Multilevel Manifold Learning

Sparsity Preserving Canonical Correlation Analysis

Machine learning - HT Clustering

Learning an Affine Transformation for Non-linear Dimensionality Reduction

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

School of Computer and Communication, Lanzhou University of Technology, Gansu, Lanzhou,730050,P.R. China

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Discovering Shared Structure in Manifold Learning

Document Clustering Using Locality Preserving Indexing

Towards Multi-scale Heat Kernel Signatures for Point Cloud Models of Engineering Artifacts

Angular Decomposition

Graph based machine learning with applications to media analytics

Behavioral Data Mining. Lecture 18 Clustering

Sensitivity to parameter and data variations in dimensionality reduction techniques

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Transcription:

Locally Linear Landmarks for large-scale manifold learning Max Vladymyrov and Miguel Á. Carreira-Perpiñán Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Spectral dimensionality reduction methods Given high-dim data points Y D N = (y 1,...,y N ), find low-dim points X d N = (x 1,...,x N ), with d D, as the solution of the following optimization problem: min X tr( XAX T) s.t. XBX T = I. A N N : symmetric psd, contains information about the similarity between pairs of data points (y n,y m ) User parameters: number of neighbours k, bandwidth σ, etc. B N N : symmetric pd (usually diagonal), sets the scale of X. Examples: Laplacian eigenmaps: A = graph Laplacian Also: spectral clustering Isomap: A = shortest-path distances Kernel PCA, multidimensional scaling, LLE, etc. p. 1

Computational solution with large-scale problems Solution: X = U T B 1 2, where U = (u 1,...,u d ) are the d trailing eigenvectors of the N N matrix C = B 1 2AB 1 2. With large N, solving this eigenproblem is infeasible even if A and B are sparse. Goal of this work: a fast, approximate solution for the embedding X. Applications: When N is so large that the direct solution is infeasible. To select hyperparameters (k, σ... ) efficiently even if N is not large since a grid search over these requires solving the eigenproblem many times. As an out-of-sample extension to spectral methods. p. 2

Computational solution with large-scale problems (cont. The Nyström method is the standard way to approximate large-scale eigenproblems. Essentially, an out-of-sample formula: 1. Solve the eigenproblem for a subset of points (landmarks) Ỹ = ỹ 1,...,ỹ L, where L N. 2. Predict x for any other point y through an interpolation formula: L L x k = K(y,ỹ l )u lk k = 1,...,d Problems: λ k l=1 Needs to know the interpolation kernel K(y,y ) (sometimes tricky). It only uses the information in A about the landmarks, ignoring the non-landmarks. This requires using many landmarks to represent the data manifold well. If too few landmarks are used: Bad solution for the landmarks X = x 1..., x L... and bad prediction for the non-landmarks. p. 3

Locally Linear Landmarks (LLL) Assume each projection is a locally linear function of the landmarks: x n = L l=1 z ln x l, n = 1,...,N X = XZ Solving the original eigenproblem of N N with this constraint results in a reduced eigenproblem of the same form but of L L on X: ) mintr ( Xà X T s.t. X B X T = I X with reduced affinities à = ZAZT, B = ZBZ T. After X is found, the non-landmarks are predicted as X = XZ (out-of-sample mapping). Advantages over Nyström s method: The reduced affinities à = ZAZT involve the entire dataset and contain much more information about the manifold that the landmark landmark affinities, so fewer landmarks are needed. Solving this smaller eigenproblem is faster. The out-of-sample mapping requires less memory and is faster. p. 4

LLL: reduced affinities Affinities between landmarks: Nyström (original affinities): A a ij path i j. LLL (reduced affinities): Ã = ZAZT ã ij = N n,m=1 z ina nm z jm paths i n m j n, m. So landmarks i and j can be farther apart and still be connected along the manifold. 10 5 0 5 Dataset 10 10 5 0 5 20 40 60 80 All points 100 0 20 40 60 80 100 Affinities between... Landmarks (Nyström) 0.8 5 0.6 10 0.4 15 0.2 20 5 10 15 20 0.5 0.4 5 0.3 10 0.2 15 0.1 20 Landmarks (LLL) 5 10 15 20 1.5 1 0.5 0 p. 5

LLL: construction of the weight matrix Z Most embedding methods seek to preserve local neighbourhoods between the high- and low-dim spaces. Hence, if we assume that a point may be approximately linearly reconstructed from its nearest landmarks in high-dim space: y n L l=1 z lnỹ l, n = 1,...,N Y ỸZ the same will happen in low-dim space: X XZ. We consider only the K Z nearest landmarks, d+1 K Z L. So: 1. Find the K Z nearest landmarks of each data point. 2. Find their weights as Z = argmin Z Y ỸZ 2 s.t. Z T 1 = 1. These are the same weights used by Locally Linear Embedding (LLE) (Roweis & Saul 2000). This implies the out-of-sample mapping (projection for a test point) is globally nonlinear but locally linear: x = M(y)y where matrix M(y) of d D depends only on the set of nearest landmarks of y. p. 6

LLL: computational complexity We assume the affinity matrix is given. If not, use approximate nearest neighbours to compute it. Time: the exact runtimes depend on the sparsity structure of the affinity matrix A and the weight matrix Z, but in general the time is dominated by: LLL: finding the nearest landmarks for each data point Nyström: computing the out-of-sample mapping for each data point and this is O(NLD) in both cases. Note LLL uses fewer landmarks to achieve the same error. Memory: LLL and Nyström are both O(Ld). p. 7

LLL: user parameters Location of landmarks: a random subset of the data works well. Refinements such as k-means improve a little with small L but add runtime. Total number of landmarks L: as large as possible. The more landmarks, the better the result. Number of neighbouring landmarks K Z for the projection matrix Z: K Z d+1, where d is the dimension of the latent space. Each point should be a locally linear reconstruction of its K Z nearest landmarks: K Z landmarks span a space of dimension K Z 1 K Z d+1. Having more landmarks protects against occasional collinearities, but decreases the locality. p. 8

LLL: algorithm Given spectral problem min X tr ( XAX T) s.t. XBX T = I for dataset Y: 1. Choose the number of landmarks L, as high as your computer can support, and K Z d+1. 2. Pick L landmarks ỹ 1,...,ỹ L at random from the dataset. 3. Compute local reconstruction weights Z L N for each data point wrt its nearest K Z landmarks: 4. Solve reduced eigenproblem Z = argmin Z Y ỸZ 2 s.t. Z T 1 = 1. min Xtr( Xà X T ) s.t. X B X T = I with à = ZAZT, B = ZBZ T for the landmark projections X. 5. Predict non-landmarks with out-of-sample mapping X = XZ. p. 9

Experiments: Laplacian eigenmaps We apply LLL to Laplacian eigenmaps (LE) (Belkin & Niyogi, 2003): A: graph Laplacian matrix L = D W for a Gaussian affinity matrix W = ( exp ( (y n y m )/σ 2)) on k-nearest-neighbour graph. nm B: degree matrix D = diag( N m=1 w nm). min X tr ( XLX T) s.t. XDX T = I, XD1 = 0. LLL s reduced eigenproblem has à = ZLZT, B = ZDZ T. We compare LLL with 3 baselines: 1. Exact LE runs LE on the full dataset. Ground-truth embedding, but the runtime is large. Landmark LE runs LE only on a set of landmark points. Once their projection is found, the rest of the points are embedded using: 2. LE(Nys.): out-of-sample mapping using Nyström s method. 3. LE(Z): out-of-sample mapping using reconstruction weights. p. 10

Experiments: effect of the number of landmarks N = 60000 MNIST digits, project to d = 50, K Z = 50 landmarks. Choose landmarks randomly, from L = 50 to L = N. LLL produces an embedding with quite lower error than Nyström s method for the same number of landmarks L. Runtime (s) 10 2 10 1 10 0 10 1 Exact LE LLL LE (Z) LE (Nys.) 10 2 10 3 10 4 Number of landmarks L Error wrt Exact LE 0.8 0.6 0.4 0.2 0 10 2 10 3 10 4 Number of landmarks L p. 11

Experiments: effect of the number of landmarks (cont.) Embeddings after 5 s runtime: Exact LE, 80 s. LLL, 5 s. LE (Z), 5 s. LE (Nys.), 5 s. p. 12

Experiments: model selection in Swiss roll dataset Vary the hyperparameters of Laplacian eigenmaps (affinity bandwidth σ, k-nearest-neighbour graph) and compute for each combination the relative error of the embedding X wrt the ground truth on N = 4000 points using L = 300 landmarks. Matrix Z need only be computed once. The minima of the model selection error curves for LLL and Exact LE align well. Runtime102 0 10 2 10 4 Error 10 1 Exact LE LLL LE (Z) LE (Nys.) 10 2 10 0 10 1 Bandwidth σ (for k = 150) 10 1 10 2 10 3 # neighbours k (for σ = 1.6) p. 13

Experiments: model selection in classification task Find hyperparameters to achieve low 1-nn classification error in MNIST. 50000 points as training, 10000 as test, 10000 as out-of-sample. Project to d = 500 using LLL (K Z = 50, L = 1000). In runtime, LLL is 15 40 faster than Exact LE. The model selection curves align well, except eigs in Exact LE fails to converge for small k. Exact LE test error (%) LLL test error (%) 5 10 5.5 5.5 σ 22 46 100 5 5 215 4.5 4.5 464 1000 1 2 3 6 11 19 34 62 111 200 k 4 1 2 3 6 11 19 34 62 111 200 k p. 14 4

Experiments: large-scale dataset N = 1020000 points from infinitemnist. L = 10 4 random landmarks (1%), K Z = 5. LLL (18 runtime) LE(Z) p. 15

Experiments: large-scale dataset (cont.) The reason for the improved result with LLL is that it uses better affinities, so the landmarks are better projected. Landmarks with LLL reduced affinities Landmarks with original affinities p. 16

Conclusions The basic reason why LLL improves over Nyström s method is that, by using the entire dataset, it constructs affinities that better represent the manifold for the same number of landmarks. Hence, it requires fewer landmarks, and is faster at training and test time. It applies to any spectral method. No need to work out a special kernel as in Nyström s method. LLL can be used: to find a fast, approximate embedding of large dataset to do fast model selection as an out-of-sample extension to spectral methods. Matlab code: http://eecs.ucmerced.edu. Partially supported by NSF CAREER award IIS 0754089. p. 17