Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Similar documents
Curvilinear Distance Analysis versus Isomap

Non-linear models and model selection for spectral data. The data

How to project circular manifolds using geodesic distances?

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Dimension Reduction CS534

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Unsupervised Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Sensitivity to parameter and data variations in dimensionality reduction techniques

ESANN'2006 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2006, d-side publi., ISBN

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Local multidimensional scaling with controlled tradeoff between trustworthiness and continuity

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

CIE L*a*b* color model

Topology Representing Network Map - A new Tool for Visualization of High-Dimensional Data

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Selecting Models from Videos for Appearance-Based Face Recognition

Width optimization of the Gaussian kernels in Radial Basis Function Networks

Unsupervised Learning

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Artificial Neural Networks Unsupervised learning: SOM

Unsupervised Learning

Feature selection. Term 2011/2012 LSI - FIB. Javier Béjar cbea (LSI - FIB) Feature selection Term 2011/ / 22

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Non-linear dimension reduction

Chapter 7: Competitive learning, clustering, and self-organizing maps

Graph projection techniques for Self-Organizing Maps

Network Traffic Measurements and Analysis

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Nonlinear dimensionality reduction of large datasets for data exploration

Lecture Topic Projects

Robust Pose Estimation using the SwissRanger SR-3000 Camera

On the Kernel Widths in Radial-Basis Function Networks

A Course in Machine Learning

Geometric Registration for Deformable Shapes 3.3 Advanced Global Matching

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

DD-HDS: a method for visualization and exploration of high-dimensional data

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

DIMENSION REDUCTION FOR HYPERSPECTRAL DATA USING RANDOMIZED PCA AND LAPLACIAN EIGENMAPS

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering and Dimensionality Reduction

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Supervised Variable Clustering for Classification of NIR Spectra

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Chap.12 Kernel methods [Book, Chap.7]

Discriminate Analysis

Free Projection SOM: A New Method For SOM-Based Cluster Visualization

Face recognition based on improved BP neural network

What is machine learning?

PATTERN RECOGNITION USING NEURAL NETWORKS

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Generalized Principal Component Analysis CVPR 2007

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018

Self-Organizing Maps for cyclic and unbounded graphs

Locality Preserving Projections (LPP) Abstract

SOM+EOF for Finding Missing Values

Dimension Reduction of Image Manifolds

Surfaces, meshes, and topology

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Tight Clusters and Smooth Manifolds with the Harmonic Topographic Map.

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

Fundamentals of Digital Image Processing

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

MTTS1 Dimensionality Reduction and Visualization Spring 2014, 5op Jaakko Peltonen

ESANN'2000 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2000, D-Facto public., ISBN ,

Local Linear Approximation for Kernel Methods: The Railway Kernel

Gender Classification of Face Images: The Role of Global and Feature-Based Information

Dimensionality Reduction and Visualization

10-701/15-781, Fall 2006, Final

Shape Classification and Cell Movement in 3D Matrix Tutorial (Part I)

SGN (4 cr) Chapter 10

Deep Learning. Volker Tresp Summer 2014

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Locally Linear Landmarks for large-scale manifold learning

Large-Scale Face Manifold Learning

Dimension reduction : PCA and Clustering

Grundlagen der Künstlichen Intelligenz

Mathematical morphology for grey-scale and hyperspectral images

LS-SVM Functional Network for Time Series Prediction

Slides adapted from Marshall Tappen and Bryan Russell. Algorithms in Nature. Non-negative matrix factorization

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Function approximation using RBF network. 10 basis functions and 25 data points.

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

Challenges motivating deep learning. Sargur N. Srihari

THIS algorithm was proposed (for example in [4] and [5])

On the Effect of Clustering on Quality Assessment Measures for Dimensionality Reduction

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 5, SEPTEMBER

Alternative Statistical Methods for Bone Atlas Modelling

Lecture 10 CNNs on Graphs

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

The Pre-Image Problem in Kernel Methods

Transcription:

Nonlinear projections Université catholique de Louvain (Belgium) Machine Learning Group http://www.dice.ucl ucl.ac.be/.ac.be/mlg/ 1 Motivation High-dimensional data are difficult to represent difficult to understand difficult to analyze Example: : MLP (Multi( Multi-Layer Perceptron) ) or RBFN (Radial-Basis Function Network) with many inputs: difficult convergence, local minima, etc. Need to reduce the dimension of data while keeping information content! 2 1

Motivation: example Supervised learning with MLP x x x M 1 2 3 x d x x x M 1 2 3 x d dimension reduction d > p x1 x2 x p x 1 x d x 1 x p y y 3 What we have: High-dimensional numerical data coming from: sensors pictures, biomedical measures (EEG/ECG), etc. 4 2

What we would like to have: A low-dimensional representation of the data in order to: visualize compress, preprocess, etc. 5 Why? Empty space phenomenon: # points necessary for learning grows exponentially with space dimension Curse of dimensionality «Spiky» hypercube Empty hypersphere Narrow spectrum of distances Hypercube Hypersphere 6 3

How? Build a (bijective) relation between the data in the original space the data in the projected space If bijection: possibility to switch between representation spaces («information» rather than «measure») Problems to consider: noise twists and folds impossibility to build a bijection 7 Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons Conclusions 8 4

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons 9 Conclusions NLP <-> < > VQ Non-Linear Projection Vector Quantization X 1 vector number of vectors X space dimension space dimension Reduction of the dimension of the data (from( d to p) Reduction of the number of data (from( N to M) Warning: «lines and columns» convention adopted in linear algebra contrary to some neural network courses and books 10 5

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons 11 Conclusions Principal Component Analysis (PCA) Goal: To project linearly while keeping the variance of the data Computation: 1. Covariance matrix C of the data C = E{X i X it } = 1/N X X T 2. Eigenvectors and eigenvalues of C V i = main directions λ i = variance along each direction 3. Projection & Reconstruction Y = V 1 i pt X X Z = V 1 i p Y projection «unprojection» Also called «Karhunen-Loeve» transform 12 6

Metric Multi-Dimensional Scaling (MDS) Goal: To project linearly while keeping the (N-1)*N/2 pairwise distances Computation: 1. Matrix D of the squared distances D = [ d² d i,j ] = [ (X( i - X j ) T (X i - X ) j ] 2. EigenVectors and eigenvalues of D aften centering (= X X T ) V i = coordinates along the main directions λ i = variance along each direction 3. Projection Y = sqrt( diag( λ ) ) V T 1 i p 1 i p 13 Result of PCA = result of metric MDS!!! Only distances are needed -> > more independent from representation! Limitations of linear projections Detection of linear dependencies only What happens with non- linear dependencies? 14 7

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons 15 Conclusions Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons Local PCA Kernel PCA 16 Conclusions 8

Local PCA (1/2) Criterion: Preserve variance (like( PCA) locally Calculation: 1. Vector quantization: prototypes C r = representave points of data X i 2. Tesselation: Voronoï zones = set of X i with same BMU index r(i) 3. PCA on each zone: the model is locally linear and globally non linear 4. Encoding: X i (dimension d) transformed in r(i) & Y i (dimension p) 17 Local PCA (2/2) Example r(i) Shortcomings: No «continuous» representation 18 Mosaic of «disconnected» coordinate systems 9

Kernel PCA (1/3) Criterion: To preserve variance (like( PCA) of transformed data How? To transform data non-linearly (in fact,, to transform non-linearly the MDS distance matrix) Transformation: allows to give more weigth to small distances Tranformation used: often Gaussian Interesting theoretical properties: non-linear mapping to high-dimensional spaces Mercer s s condition on Gaussian kernels 19 Kernel PCA (2/3) Calculation: 1. Dual Problem (cfr PCA <-> MDS): (C = X X T ) D = X T X = [ X T i X j ] 2. Nonlinear transformation of data: D = [ Φ( X i, X ) ] j with Φ s.t. Φ(u,v) (u,v) = ϕ(u) (u) ϕ(v) (v) condition) 3. Centering of D 4. Eigenvalues and eigenvectors of D : V i = coordinates along the main directions 5. Projection: T Y = V 1 i p (Mercer 20 10

Kernel PCA (3/3) Example: 21 Shortcomings: Eigenvalues = 0.138, 0.136, 0.099,, 0.029, Dimensionality reduction is not guaranteed Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Sammon s NLM CCA / CDA Isomap Comparisons 22 Conclusions 11

Sammon s Non-Linear Mapping (NLM) 1/2 Criterion to be optimized: Distance preservation (cfr metric MDS) Sammon s stress = 1 δi, j i < j i < j ( δi, j d i, j ) Preservation of small distances firstly δ i, j 2 distances in projection space distances in original space Calculation: Minimization by gradient descent 23 Sammon s Non-Linear Mapping (NLM) 2/2 Example: Shortcomings: Global gradient: lateral faces are «compacted» Computational load (preprocess with VQ) Euclidean distance (use curvilinear distance) 24 12

Curvilinear Component Analysis (1/2) Criterion to be optimized: Distance preservation Preservation of small distances firstly (but «tears» are allowed) E CCA Calculation: 2 ( r, s d r, s ) F ( d r, s ) = δ r < s 1. Vector Quantization as preprocessing 2. Minimization by stochastic gradient descent (±) 3. Interpolation 25 Curvilinear Component Analysis (2/2) Example: Shortcomings: Convergence of the gradient descent: «torn» faces Euclidean distance (use curvilinear distance) 26 13

NLP: use of curvilinear distance (1/4) Principle: Curvilinear (or geodesic) ) distance = Length of the shortest path from one node to another in a weighted graph 27 NLP: use of curvilinear distance (2/4) Useful for NLP Curvilinear distances are easier to preserve! 28 14

NLP: use of curvilinear distance (3/4) Integration in projection algorithms: ENLM = 1 δi,j i< j i< j ( δi,j di, j ) δi, j 2 use curvilinear distance (instead of Euclidean one) E CCA = 2 ( δr,s dr,s ) F(dr, s ) r < s 29 NLP: use of curvilinear distance (4/4) Projected open box: Sammon s s NLM with Euclidean distance Projected open box: Sammon s s NLM with curvilinear distance Faces are «compacted» «Perfect»! 30 15

Isomap (1/2) Published in Science 290 (December( 2000): A global geometric framework for nonlinear dimensionality reduction. Criterion: Preservation of geodesic distances Calculation: 1. Choice of some representative points (randomly( randomly, without VQ!) 2. Classical MDS, but applied on the matrix of geodesic distances 31 Isomap (2/2) Example: Shortcomings: No weighting of distances: faces are heavily «compacted» 32 No vector quantization 16

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error SOM Isotop Comparisons 33 Conclusions Self-Organizing Map (SOM) (1/2) Criterion to be optimized: Quantization error & neighborhood preservation No unique mathematical formulation of neighborhood criteria Calculation: Preestablished 1D or 2D grid: distance d(r,s) Learning rule: r ( i ) = argmin r X i C r C r = α e 2 d ( r, r ( i )) 2 2 λ ( X i C r ) 34 17

Self-Organizing Map (SOM) (2/2) Example: 35 Shortcomings: Inadequate grid shape: : faces are «cracked» 1D or 2D grid only Isotop (1/3) Inspired from SOM and CCA/CDA Criterion: Neighborhood preservation No known math. formula Calculation within 4 steps: 1. Vector quantification 2. Linking prototypes C r 3. Mapping (between d-dim. and p-dim. spaces) 4. Linear interpolation 36 18

Isotop (2/3) 1. Vector quantification 2. Linking of all prototypes No preestablished shape 3D 3D «Data-driven neighborhoods» 37 Isotop (3/3) 3. Mapping 4. Linking of all prototypes 2D VQ (~SOM) of a Gaussian pdf 2D Local linear interpolations 38 19

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Autoassociative MLP Comparisons 39 Conclusions Autoassociative MLP (1/2) Criterion to be minimized: Reconstruction error (MSE) after coding and decoding of the data with an autoassociative neural network (MLP) Autoassociative MLP: unsupervised (in=out) Auto-encoding 40 20

Autoassociative MLP (2/2) Example: 41 Shortcomings: «Non-geometric» method Slow and hasardous convergence (5 layers!) Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons 42 Conclusions 21

Comparisons: dataset Abalone (UCI Machine learning repository): 4177 shells 8 features (+ sex) Length Diameter Height Whole weight Shucked de la chair Viscera des viscères Shell weight Age (# rings) VQ with 200 prototypes Reduction from dimension 7 to 2 and visualization of the age (colors) 43 Comparisons: results (1/4) Sammon s nonlinear mapping: 44 22

Comparisons: results (2/4) Curvilinear Component Analysis: 45 Comparisons: results (3/4) Self-organizing map: 46 23

Comparisons: results (4/4) Isotop: 47 Comparisons: summary Distance preservation Neighborhood preservation «Rigid» method Sammon s mapping (fixed weighting) Self-Organizing Map (fixed neighborhood) «Flexible» method Curv. Comp. Analysis (adaptative weighting) Isotop (adaptative neighborhoods) Warning: model complexity! 48 24

Content Vector Quantization and Non-Linear Projections Limitations of linear methods Principal Component Analysis (PCA) Metric Multi-Dimensional Scaling (MDS) Limitations Nonlinear Algorithms Variance preservation Distance preservation (like MDS) Neighborhood preservation (like SOM) Minimal reconstruction error Comparisons 49 Conclusions Research directions NLP methods Neighborhood decrease in CCA/CDA Curvilinear distance (geodesic( geodesic) Study and implementation Integration in SOM, CCA, Sammon s s NLM and Isotop Non-Euclidean distances Alternative metrics are considered (L inf, L, L, etc.) 1 0.5 Integration in curvilinear distance, VQ and NLP Piecewise linear interpolation Study and implementation Integration in Sammon s s NLM, CCA and Isotop New algorithm: Isotop 50 25

Sources and references Most of the content of these slides is based on the work of my colleague John Lee References P. Demartines and J. Herault. Curvilinear Component Analysis: A self- organizing neural network for nonlinear mapping of data sets. IEEE Transaction on Neural Networks, 8(1):148{154, January 1997. W. Sammon,, J. A nonlinear mapping algorithm for data structure analysis. IEEE Transactions on Computers, CC-18(5):401{409, 1969. J. Lee,,, Nonlinear Projection with the Isotop Method, ICANN 2002, International Conference on Artificial Neural Networks, Madrid (Spain), 28-30 August 2002. Artificial Neural Networks, J. R. Dorronsoro ed., Springer-Verlag Verlag,, Lecture Notes in Computer Science 2415, 2002, pp. 933-938. 938. J.A. Lee, A. Lendasse,,, Curvilinear Distance Analysis versus Isomap,, ESANN 2002, European Symposium on Artificial Neural Networks, Bruges (Belgium), April 2002, pp. 185-192. 192. 51 26