RDRToolbox A package for nonlinear dimension reduction with Isomap and LLE.

Similar documents
Non-linear dimension reduction

Image Similarities for Learning Video Manifolds. Selen Atasoy MICCAI 2011 Tutorial

Learning a Manifold as an Atlas Supplementary Material

Dimension Reduction CS534

Robust Pose Estimation using the SwissRanger SR-3000 Camera

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

Proximal Manifold Learning via Descriptive Neighbourhood Selection

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Lecture Topic Projects

Head Frontal-View Identification Using Extended LLE

Global versus local methods in nonlinear dimensionality reduction

Global versus local methods in nonlinear dimensionality reduction

Differential Structure in non-linear Image Embedding Functions

CIE L*a*b* color model

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Locality Preserving Projections (LPP) Abstract

Manifold Clustering. Abstract. 1. Introduction

Locally Linear Landmarks for large-scale manifold learning

How to project circular manifolds using geodesic distances?

Locality Preserving Projections (LPP) Abstract

Selecting Models from Videos for Appearance-Based Face Recognition

Classification Performance related to Intrinsic Dimensionality in Mammographic Image Analysis

Remote Sensing Data Classification Using Combined Spectral and Spatial Local Linear Embedding (CSSLE)

Neighbor Line-based Locally linear Embedding

Large-Scale Face Manifold Learning

Isometric Mapping Hashing

IMAGE MASKING SCHEMES FOR LOCAL MANIFOLD LEARNING METHODS. Hamid Dadkhahi and Marco F. Duarte

Dimension Reduction of Image Manifolds

Finding Structure in CyTOF Data

Technical Report. Title: Manifold learning and Random Projections for multi-view object recognition

Iterative Non-linear Dimensionality Reduction by Manifold Sculpting

Data fusion and multi-cue data matching using diffusion maps

Face Recognition using Tensor Analysis. Prahlad R. Enuganti

Dimension reduction : PCA and Clustering

Keywords: Linear dimensionality reduction, Locally Linear Embedding.

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Learning Manifolds in Forensic Data

CPSC 340: Machine Learning and Data Mining. Multi-Dimensional Scaling Fall 2017

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Robot Manifolds for Direct and Inverse Kinematics Solutions

InterAxis: Steering Scatterplot Axes via Observation-Level Interaction

Extended Isomap for Pattern Classification

Manifold Learning for Video-to-Video Face Recognition

Lecture 8 Mathematics of Data: ISOMAP and LLE

Automatic Alignment of Local Representations

Sparse Manifold Clustering and Embedding

WITH the wide usage of information technology in almost. Supervised Nonlinear Dimensionality Reduction for Visualization and Classification

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Local multidimensional scaling with controlled tradeoff between trustworthiness and continuity

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

Local Linear Embedding. Katelyn Stringer ASTR 689 December 1, 2015

A Stochastic Optimization Approach for Unsupervised Kernel Regression

Indian Institute of Technology Kanpur. Visuomotor Learning Using Image Manifolds: ST-GK Problem

Advanced Applied Multivariate Analysis

ABSTRACT. Keywords: visual training, unsupervised learning, lumber inspection, projection 1. INTRODUCTION

A Data-dependent Distance Measure for Transductive Instance-based Learning

DIMENSIONALITY REDUCTION OF FLOW CYTOMETRIC DATA THROUGH INFORMATION PRESERVATION

Sensitivity to parameter and data variations in dimensionality reduction techniques

Curvilinear Distance Analysis versus Isomap

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Geodesic Based Ink Separation for Spectral Printing

Localization from Pairwise Distance Relationships using Kernel PCA

Alternative Model for Extracting Multidimensional Data Based-on Comparative Dimension Reduction

Unsupervised Learning

Globally and Locally Consistent Unsupervised Projection

Dimensionality Reduction and Microarray data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering Data That Resides on a Low-Dimensional Manifold in a High-Dimensional Measurement Space. Avinash Kak Purdue University

Estimating Cluster Overlap on Manifolds and its Application to Neuropsychiatric Disorders

Advanced Machine Learning Practical 2: Manifold Learning + Clustering (Spectral Clustering and Kernel K-Means)

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

Nonlinear Dimensionality Reduction Applied to the Classification of Images

Understanding Clustering Supervising the unsupervised

Pattern Recognition Letters

IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family

School of Computer and Communication, Lanzhou University of Technology, Gansu, Lanzhou,730050,P.R. China

Appearance Manifold of Facial Expression

Réduction de modèles pour des problèmes d optimisation et d identification en calcul de structures

Dynamic Facial Expression Recognition Using A Bayesian Temporal Manifold Model

Non-linear CCA and PCA by Alignment of Local Models

Supervised Isomap with Dissimilarity Measures in Embedding Learning

Locality Condensation: A New Dimensionality Reduction Method for Image Retrieval

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

CHAPTER 5 CLUSTER VALIDATION TECHNIQUES

Transferring Colours to Grayscale Images by Locally Linear Embedding

Data-driven Kernels via Semi-Supervised Clustering on the Manifold

Non-Local Manifold Tangent Learning

The role of Fisher information in primary data space for neighbourhood mapping

Support Vector Machines for visualization and dimensionality reduction

Random Projections for Manifold Learning

LEARNING ON STATISTICAL MANIFOLDS FOR CLUSTERING AND VISUALIZATION

A Developmental Framework for Visual Learning in Robotics

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

A Hierarchial Model for Visual Perception

Cluster Analysis for Microarray Data

Generalized Principal Component Analysis CVPR 2007

A Taxonomy of Semi-Supervised Learning Algorithms

COMPRESSED DETECTION VIA MANIFOLD LEARNING. Hyun Jeong Cho, Kuang-Hung Liu, Jae Young Park. { zzon, khliu, jaeypark

Assessment of Dimensionality Reduction Based on Communication Channel Model; Application to Immersive Information Visualization

Supervised vs unsupervised clustering

Transcription:

RDRToolbox A package for nonlinear dimension reduction with Isomap and LLE. Christoph Bartenhagen October 30, 2017 Contents 1 Introduction 1 1.1 Loading the package...................................... 2 2 Datasets 2 2.1 Swiss Roll............................................ 2 2.2 Simulating microarray gene expression data......................... 3 3 Dimension Reduction 3 3.1 Locally Linear Embedding................................... 3 3.2 Isomap.............................................. 4 4 Plotting 5 5 Cluster distances 7 6 Example 8 1 Introduction High-dimensional data, like for example microarray gene expression data, can often be reduced to only a few significant features without losing important information. Dimension reduction methods perform a mapping between an original high-dimensional input space and a target space of lower dimensionality. A good dimension reduction technique should preserve most of the significant information and generate data with similar characteristics like the high-dimensional original. For example, clusters should also be found within the reduced data, preferably more distinct. This package provides the Isomap and Locally Linear Embedding algorithm for nonlinear dimension reduction. Nonlinear means in this case, that the methods were designed with respect to data lying on or near a nonlinear submanifold in the higher dimensional input space and perform a nonlinear mapping. Further, both algorithms belong to the so called feature extraction methods, which in contrast to feature selection methods combine the information from all features. These approaches are often most suited for low-dimensional representations of the whole data. For cluster validation purposes, the package also includes a routine for computing the Davis-Bouldin- Index. 1

Further, a plotting tool visualizes two and three dimensional data and, where appropriate, its clusters. For testing, the well known Swiss Roll dataset can be computed and a data generator simulates microarray gene expression data of a given (high) dimensionality. 1.1 Loading the package The package can be loaded into R by typing > library(rdrtoolbox) into the R console. To create three dimensional plots, the toolbox requires the package rgl. Otherwise, only two dimensional plots will be available. Furthermore, the package MASS has to be installed for gene expression data simulation (see section 2.2). 2 Datasets In general, the dimension reduction, cluster validation and plot functions in this toolbox expect the data as N D matrix, where N is the number of samples and D the dimension of the input data (number of features). After processing, the data is returned as N d matrix, for a given target dimensionality d (d < D). But before describing the dimension reduction methods, this section shortly covers two ways of generating a dataset. 2.1 Swiss Roll The function SwissRoll computes and plots the three dimensional Swiss Roll dataset of a given size N and Height. If desired, it uses the package rgl to visualize the Swiss Roll as a rotatable 3D scatterplot. The following example computes and plots a Swiss Roll dataset containing 1.000 samples (argument Height set to 30 by default): > swissdata=swissroll(n = 1000, Plot=TRUE) See figure 1 below for the plot. Figure 1: The three dimensional SwissRoll dataset. 2

2.2 Simulating microarray gene expression data The function generatedata is a simulator for gene expression data, whose values are normally distributed values with zero mean. The covariance structure is given by a configurable block-diagonal matrix. To simulate differential gene expression between the samples, the expression of a given number of features (parameter diffgenes) of some of the samples will be increased by adding a given factor diff (0.6 by default). The parameter diffsamples controls how many samples have higher expression values compared to the rest (by default, the values of half of the total number of samples will be increased). The simulator generates two labelled classes: label 1: samples with differentially expressed genes. label -1: samples without differentially expressed genes. Finally, generatedata returns a list containing the data and its class labels. The following example computes a dim=1.000 dimensional dataset where 10 of 20 samples contain diffgenes=100 differential features: > sim = generatedata(samples=20, genes=1000, diffgenes=100, diffsamples=10) > simdata = sim[[1]] > simlabels = sim[[2]] The covariance can be modified using the arguments cov1, cov2, to set the covariance within and between the blocks of size blocksize blocksize of the block-diagonal matrix: > sim = generatedata(samples=20, genes=1000, diffgenes=100, cov1=0.2, cov2=0, blocksize=10) > simdata = sim[[1]] > simlabels = sim[[2]] 3 Dimension Reduction 3.1 Locally Linear Embedding Locally Linear Embedding (LLE) was introduced in 2000 by Roweis, Saul and Lawrence [1, 2]. It preserves local properties of the data by representing each sample in the data by a linear combination of its k nearest neighbours with each neighbour weighted independently. LLE finally chooses the lowdimensional representation that best preserves the weights in the target space. The function LLE performs this dimension reduction for a given dimension dim and neighbours k. The following examples compute a two and a three dimensional LLE embedding of the simulated 1.000 dimensional dataset seen in the example in section 2.2 using k=10 and 5 neighbours: > simdata_dim3_lle = LLE(data=simData, dim=3, k=10) > head(simdata_dim3_lle) [,1] [,2] [,3] [1,] 1.5939207 0.7123963 0.2035166 [2,] 0.5784794 1.6184846-0.3607617 [3,] 0.6741447-0.5194532 1.2543364 [4,] 0.5045147-0.6605937-0.3861041 [5,] 0.3647202-0.7275116 0.5004380 [6,] 1.2738958 0.1934067-1.7392043 3

> simdata_dim2_lle = LLE(data=simData, dim=2, k=5) > head(simdata_dim2_lle) [,1] [,2] [1,] -0.4297359-1.0071471 [2,] -1.6467228-0.9210467 [3,] -0.3568195 1.9721247 [4,] -1.3865762 0.3570472 [5,] -1.0673264 1.2239923 [6,] -1.2717538-1.0972511 3.2 Isomap Isomap (IM) is a nonlinear dimension reduction technique presented by Tenenbaum, Silva and Langford in 2000 [3, 4]. In contrast to LLE, it preserves global properties of the data. That means, that geodesic distances between all samples are captured best in the low dimensional embedding. This implementation uses Floyd s Algorithm to compute the neighbourhood graph of shortest distances, when calculating the geodesic distances. The function Isomap performs this dimension reduction for a given vector of dimensions dims and neighbours k. It returns a list of low-dimensional datasets according to the given dimensions. The following example computes a two dimensional Isomap embedding of the simulated 1.000 dimensional dataset seen in the example in section 2.2 using k=10 neighbours: > simdata_dim2_im = Isomap(data=simData, dims=2, k=10) > head(simdata_dim2_im$dim2) [,1] [,2] [1,] -30.17137-44.613450 [2,] 22.55894-30.423895 [3,] -31.51850 32.039460 [4,] -31.25418 25.213477 [5,] -22.55836 4.800841 [6,] -21.56143-1.979045 Setting the argument plotresiduals to TRUE, Isomap shows a plot with the residuals between the highand the low-dimensional data (here, for target dimensions 1-10). It can help estimating the intrinsic dimension of the data: > simdata_dim1to10_im = Isomap(data=simData, dims=1:10, k=10, plotresiduals=true) 4

residual variance 0.2 0.3 0.4 0.5 0.6 0.7 2 4 6 8 10 dimension This implementation further includes a modified version of the original Isomap algorithm, which respects nearest and farthest neighbours. The next call of Isomap varies the upper example by setting the argument mod: > Isomap(data=simData, dims=2, mod=true, k=10) 4 Plotting The function plotdr creates two and three dimensional plots of (labelled) data. It uses the library rgl for rotatable 3D scatterplots. The data points are coloured according to given class labels (max. six classes when using default colours). A legend will be printed in the R console by default. The parameter legend or the R command legend can be used to add a legend to a two dimensional plot (a legend for three dimensional plots is not supported). The first example plots the two dimensional embedding of the artificial dataset from section 2.2: > plotdr(data=simdata_dim2_lle, labels=simlabels) class colour 1-1 black 2 1 red 5

y 1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 x By specifying the arguments axeslabels and text, labels for the axes and for each data point (sample) respectively can be added to the plot: > samples = c(rep("", 10), rep("", 10)) #letters[1:20] > labels = c("first component", "second component") > plotdr(data=simdata_dim2_lle, labels=simlabels, axeslabels=labels, text=samples) class colour 1-1 black 2 1 red 6

second component 1 0 1 2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 first component A plot of a three dimensional LLE embedding of a 1.000 dimensional dataset is given by > plotdr(data=simdata_dim3_lle, labels=simlabels) 5 Cluster distances The function DBIndex computes the Davis-Bouldin-Index (DB-Index) for cluster validation purposes. The index relates the compactness of each cluster to their distance: To compute a clusters compactness, this version uses the Euclidean distance to determine the mean distances between the samples and the cluster centres. The distance of two clusters is given by the distance of their centres. The smaller the index the better. Values close to 1 or smaller indicate well separated clusters. The following example computes the DB-Index of a 50 dimensional dataset with 20 samples separated into two classes/clusters: > d = generatedata(samples=20, genes=50, diffgenes=10, blocksize=5) > DBIndex(data=d[[1]], labels=d[[2]]) [1] 3.64442 As the two dimensional plots in section 4 anticipated, the low-dimensional LLE dataset has quite well separated clusters. Accordingly, the DB-Index is low: > DBIndex(data=simData_dim2_lle, labels=simlabels) [1] 2.073152 7

Figure 2: Three dimensional simulated microarray dataset (reduced from formerly 1.000 dimensions by LLE) 6 Example This section demonstrates the dimension reduction workflow for the publicly available the Golub et al. leukemia dataset. The data is available as R package and can be loaded via > library(golubesets) > data(golub_merge) The dataset consists of 72 samples, divided into 47 ALL and 25 AML patients, and 7129 expression values. In this example, we compute a two dimensional LLE and Isomap embedding and plot the results. At first, we extract the features and class labels: > golubexprs = t(exprs(golub_merge)) > labels = pdata(golub_merge)$all.aml > dim(golubexprs) [1] 72 7129 > show(labels) [1] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL [15] ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML [29] AML AML AML AML AML AML ALL ALL ALL ALL ALL ALL ALL ALL [43] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL [57] ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML [71] AML AML Levels: ALL AML The residual variance of Isomap can be used to estimate the intrinsic dimension of the dataset: 8

> Isomap(data=golubExprs, dims=1:10, plotresiduals=true, k=5) residual variance 0.1 0.2 0.3 0.4 0.5 0.6 2 4 6 8 10 dimension Regarding the dimensions for which the residual variances stop to decrease significantly, we can expect a low intrinsic dimension of two or three and therefore, a visualization true to the structure of the original data. Next, we compute the LLE and Isomap embedding for two target dimensions: > golubisomap = Isomap(data=golubExprs, dims=2, k=5) > golublle = LLE(data=golubExprs, dim=2, k=5) The Davis-Bouldin-Index shows, that the ALL and AML patients are well separated into two clusters: > DBIndex(data=golubIsomap$dim2, labels=labels) [1] 1.131335 > DBIndex(data=golubLLE, labels=labels) [1] 1.172901 Finally, we use plotdr to plot the two dimensional data: > plotdr(data=golubisomap$dim2, labels=labels, axeslabels=c("", ""), legend=true) > title(main="isomap") > plotdr(data=golublle, labels=labels, axeslabels=c("", ""), legend=true) > title(main="lle") 9

2e+05 1e+05 0e+00 1e+05 2e+05 150000 50000 0 50000 150000 ALL AML Isomap 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 2 1 0 1 2 ALL AML LLE Figure 3: Two dimensional embedding of the Golub et al. leukemia dataset (Left: Isomap; Right: LLE). Both visualizations, using either Isomap or LLE, show distinct clusters of ALL and AML patients, although the cluster overlap less in the Isomap embedding. This is consistent with the DB-Index, which is very low for both methods, but slightly higher for LLE. A three dimensional visualization can be generated in the same manner and is best analyzed interactively within R. References [1] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000. [2] L. K. Saul and S. T. Roweis. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119 155, June 2003. [3] V. D. Silva and J. B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Advances in Neural Information Processing Systems 15, pages 705 712. MIT Press, 2003. [4] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, December 2000. 10