On the Effect of Clustering on Quality Assessment Measures for Dimensionality Reduction

Similar documents
Visualizing the quality of dimensionality reduction

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Comparison of cluster algorithms for the analysis of text data using Kolmogorov complexity

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Assessment of Dimensionality Reduction Based on Communication Channel Model; Application to Immersive Information Visualization

Figure (5) Kohonen Self-Organized Map

SOM+EOF for Finding Missing Values

Effects of sparseness and randomness of pairwise distance matrix on t-sne results

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Time Series Prediction as a Problem of Missing Values: Application to ESTSP2007 and NN3 Competition Benchmarks

Local multidimensional scaling with controlled tradeoff between trustworthiness and continuity

Unsupervised Learning : Clustering

CHAPTER 2 TEXTURE CLASSIFICATION METHODS GRAY LEVEL CO-OCCURRENCE MATRIX AND TEXTURE UNIT

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

A Topography-Preserving Latent Variable Model with Learning Metrics

CSE 6242 / CX October 9, Dimension Reduction. Guest Lecturer: Jaegul Choo

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

Self-Organizing Maps for cyclic and unbounded graphs

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

How to quantitatively compare data dissimilarities for unsupervised machine learning?

Graph projection techniques for Self-Organizing Maps

Global Coordination Based on Matrix Neural Gas for Dynamic Texture Synthesis

Clustering CS 550: Machine Learning

Nonlinear projections. Motivation. High-dimensional. data are. Perceptron) ) or RBFN. Multi-Layer. Example: : MLP (Multi(

Non-linear dimension reduction

Robotics Programming Laboratory

Curvilinear Distance Analysis versus Isomap

Visual object classification by sparse convolutional neural networks

Dimension Reduction of Image Manifolds

Rank Measures for Ordering

Advanced Data Visualization

Learning a Manifold as an Atlas Supplementary Material

3 Nonlinear Regression

Nonparametric Importance Sampling for Big Data

13. Learning Ballistic Movementsof a Robot Arm 212

Publication 4. Jarkko Venna, and Samuel Kaski. Comparison of visualization methods

Seismic regionalization based on an artificial neural network

A SOM-view of oilfield data: A novel vector field visualization for Self-Organizing Maps and its applications in the petroleum industry

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Clustering and Visualisation of Data

Topological Correlation

The Curse of Dimensionality

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Supervised Hybrid SOM-NG Algorithm

3 Nonlinear Regression

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Unsupervised Learning

Large-Scale Face Manifold Learning

Unsupervised learning in Vision

Sensitivity to parameter and data variations in dimensionality reduction techniques

Selecting Models from Videos for Appearance-Based Face Recognition

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

Chapter 2 Basic Structure of High-Dimensional Spaces

Face Recognition by Combining Kernel Associative Memory and Gabor Transforms

Distribution-free Predictive Approaches

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Feature Extraction of High-Dimensional Structures for Exploratory Analytics

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Nonlinear dimensionality reduction of large datasets for data exploration

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

CS 664 Segmentation. Daniel Huttenlocher

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CIE L*a*b* color model

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Processing Missing Values with Self-Organized Maps

Locality Preserving Projections (LPP) Abstract

SYMBOLIC FEATURES IN NEURAL NETWORKS

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

Dimensionality Reduction and Visualization

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

University of Florida CISE department Gator Engineering. Clustering Part 5

Using Machine Learning to Optimize Storage Systems

Week 7 Picturing Network. Vahe and Bethany

Locality Preserving Projections (LPP) Abstract

Topics in Machine Learning

10-701/15-781, Fall 2006, Final

Cluster Analysis. Ying Shen, SSE, Tongji University

Assessing a Nonlinear Dimensionality Reduction-Based Approach to Biological Network Reconstruction.

The Journal of MacroTrends in Technology and Innovation

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Robust Shape Retrieval Using Maximum Likelihood Theory

Structured Light II. Thanks to Ronen Gvili, Szymon Rusinkiewicz and Maks Ovsjanikov

CS 664 Slides #11 Image Segmentation. Prof. Dan Huttenlocher Fall 2003

Data fusion and multi-cue data matching using diffusion maps

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Diffusion Wavelets for Natural Image Analysis

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Multiple View Geometry in Computer Vision

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

A Population Based Convergence Criterion for Self-Organizing Maps

Fundamentals of Digital Image Processing

Artifacts and Textured Region Detection

Recognizing Handwritten Digits Using the LLE Algorithm with Back Propagation

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Transcription:

On the Effect of Clustering on Quality Assessment Measures for Dimensionality Reduction Bassam Mokbel, Andrej Gisbrecht, Barbara Hammer CITEC Center of Excellence Bielefeld University D-335 Bielefeld {bmokbel agisbrec bhammer}@techfak.uni-bielefeld.de Abstract Visualization techniques constitute important interfaces between the rapidly increasing digital information available today and the human user. In this context, numerous dimensionality reduction models have been proposed in the literature, see, e.g., [, 2]. This fact recently motivated the development of frameworks for a reliable and automatic quality assessment of the produced results [3, 2]. These specialized measurements represent a means to objectively assess an overall qualitative change under spatial transformation. The rapidly increasing size of the data sets, however, causes the need to not only map given data points to low dimensionality, but to priorly compress the available information, e.g., by means of clustering techniques. While a standard evaluation measure for clustering is given by the quantization error, its counterpart, the reconstruction error, is usually infeasible for dimensionality reduction, such that one has to rely on alternatives. In this paper, we investigate in how far two existing quality measures used for dimensionality reduction are appropriate to evaluate the quality of clustering outputs, such that only one quality measure can be used for the full visualization process in the context of large data sets. In this contribution, we present empirical results from different basic clustering scenarios as a first study to analyze how structural characteristics of the data are reflected in the quality assessment. Introduction In order to visualize high-dimensional data, there are numerous dimensionality reduction (DR) techniques available to map the data points to a low-dimensional space, e.g., the two-dimensional Euclidean plane, see [, 2] for an overview. As a general setting, original data are given as a set of N vectors x i X S n, i {,...,N}. Using DR, each data vector is mapped to a low-dimensional counterpart for visualization, called target y k Y R v, k {,...,N}, where typically n v and v = 2. While linear DR methods, e.g., principal component analysis (PCA), are well-known, many nonlinear DR approaches became popular later on, such as isomap, t-distributed stochastic neighbor embedding (t-sne), maximum variance unfolding (MVU), Laplacian eigenmaps, etc., see [2, ] for an overview. Since spatial distortion is usually unavoidable in the mapping process, the strategies to achieve the most meaningful low-dimensional representation differ strongly among the available methods. Some methods focus on the preservation of distances, while others rely on locally linear relationships, local probability distributions, or local rankings. With an increasing number of such methods, a reliable assessment of the quality of produced visualizations becomes more and more important, in order to achieve comparability. One objective of DR is to preserve the available information as much as possible. In this sense, the reconstruction error E reconstr := i x i f (f(x i )) 2 where f denotes the DR mapping of the data, and f its approximate inverse, could serve as a general quality measure. This has the drawback that,

for most DR methods, no explicit mapping f is available and an approximate inverse f is also not known. As an alternative, the existing quality assessment (QA) measures rely on statistics over input-versus-output discrepancies, which can be evaluated based solely on the given data points and their projections. Different QA approaches have been proposed in the last years, like mean relative rank errors (MRREs), the local continuity meta-criterion (LCMC), as well as trustworthiness and continuity (T&C); see [3] for an overview. Recently, two generalized approaches have been introduced that can serve as unifying frameworks, including some earlier QA concepts as special cases: CRM: The coranking matrix and its derived quality measure, presented in [3]. IR: An information retrieval perspective measuring precision & recall for visualization, see [2]. These frameworks have been evaluated extensively in the context of DR tools, which map given data points to low-dimensional coordinates. The rapidly increasing size of modern data sets, however, causes the need to not only project single points, but to also priorly compress the available information. Hence, further steps such as, e.g., clustering become necessary. While the QA approaches can be used to evaluate DR methods, it is not clear in how far they can also reliably measure the quality of clustering. Conversely, typical QA measures for clustering such as the quantization error cannot be extended to DR methods, since this would lead to the (usually infeasible) reconstruction error. Hence, it is interesting to investigate if QA approaches for DR methods can be transferred to the clustering domain. This would open the way towards an integrated evaluation of the two steps. In the following, we briefly review QA for DR methods, we discuss in how far they can be applied or adapted, respectively, to evaluate prototype-based clustering algorithms, and we experimentally evaluate their suitability to assess clustering results in two benchmarks. 2 Quality Assessment for Dimensionality Reduction Coranking matrix The CRM framework, presented in [3], offers a general approach to QA for DR. δ ij denotes the distance of data points x i, x j in input space, d ij denotes the corresponding distance of projections y i, y j in the output space. The quantity ρ ij := {t : δ it < δ ij } denotes the rank of points in input space, the corresponding rank in output space is denoted asr ij := {t : d it < d ij }. The coranking matrix itself is defined as: Q = [q kl ] k,l N : q kl = {(i,j) : ρ ij = k r ij = l} So, every elementq kl gives the number of pairs, with a rank equal tok in input space and equal tolin output space. In the framework as proposed in [3], ties of the ranks are broken deterministically, such that no two equal ranks occur. This has the advantage that several properties of the coranking matrix (such as constant row and column sum) hold, which are, however, not necessary for the evaluation measure. For our purposes, it is more suitable to allow equal ranks, e.g., if distances are identical. Reflexive ranks are zero: r ii = ρ ii =. For a pair (i,j), the event of r ij < ρ ij, i.e., a positive rank error, is called an intrusion, while the opposite event ρ ij < r ij, i.e., a negative rank error, is called extrusion, since data points intrude ( extrude ) their neighborhood in the projection as compared to the original setting. Based on the coranking matrix, various different quality measurements can be computed, which offer different ways how to weight intrusions and extrusions. In [3], the overall quality is proposed as a reasonable objective which takes into account weighted averages of all intrusions and extrusions, see the publication [3] for details. Information retrieval In the IR framework, presented in [2], visualization is viewed as an information retrieval task. One data point x i X is seen as a query of a user, which has a certain neighborhood A i in the original data space, called input neighborhood. The retrieval result to this query is then based solely on the visualization which is presented to the user. There, the neighborhood of its respective target y i Y is denoted by B i, we call it the output neighborhood. If both neighborhoods are defined over corresponding notions of proximity, it becomes possible to evaluate, how truthful the query result is with respect to the given query. One can define the neighborhoods by a fixed distance radius α d, valid in input, as well as in output space, so A i and B i consist of all data points (other than i itself), which have a smaller or equal distance tox i and y i respectively: A i = {z j : δ ij α d i j}, B i = {z j : d ij α d i j} Analogously, the neighborhoods can be defined by a fixed rank radius α r, so the neighborhood sets A i and B i contain the α r nearest neighbors. Note that A i and B i usually differ from each other 2

due the projection of data to low dimensionality. Following the information retrieval perspective, for a pair(x i,y i ), points that are ina i B i are called true positives, the points inb i \A i are called false positives, and the points that are in A i \B i are called misses. To evaluate a mapping, the false positives and misses can be counted for every pair, and costs can be assigned. For the pair (x i,y i ), the numbers of true positives, false positives and misses are E TP,i, E FP,i, and E MI,i respectively. Then, the information retrieval measures precision(i) = E TP,i / B i and recall(i) = E TP,i / A i can be defined. For a whole set X of data points, one can calculate the mean precision and mean recall by averaging over all data points x i. 3 Quality Assessment for Clustering Clustering aims at decomposing the given data x i into homogeneous clusters, see [4]. Prototypebased clustering achieves this goal by specifying M prototypes p u P S n, u {,...,M}, which decompose the data by means of their receptive fields R u, which are given by the points x i closer to p u than to any other prototype, breaking ties deterministically. Many prototype-based clustering algorithms exist, including, for example, neural gas (NG), affinity propagation (AP), k- means and fuzzy extensions thereof. A prototype-based clustering algorithm has the benefit that, after clustering, the prototypes can serve as a compression of the original data, and visualization of very large data sets is possible by visualizing the smaller number of representative prototypes only. Typically, prototype-based clustering is evaluated by means of the quantization error E qe := u x i R u x i p u 2 which evaluates the averaged distance within clusters. If a large data set has to be visualized, a typical procedure is to first represent the data set by means of representative prototypes and to visualize these prototypes in low dimensions, afterwards. In consequence, a formal evaluation of this procedure has to take into account both, the clustering step and the dimensionality reduction. To treat the two steps, clustering and visualization, within one common framework, we interpret clustering as a visualization which maps data points to their closest prototype respectively: x i y i := p u such that x i R u. In this case, the visualization space R v coincides with the data space S n. Obviously, by further projecting the prototypes, a proper visualization could be obtained. The usual error measure for clustering, the quantization error, obviously coincides with the reconstruction error of visualization as introduced above. Hence, since the latter can usually not be evaluated for standard DR methods, the quantization error can not serve as evaluation for simultaneous clustering and visualization. As an alternative, one can investigate whether the QA tools for DR, as introduced above, give meaningful results for clustering algorithms. There exist some general properties of these measures which indicate that this leads to reasonable results: for fixed neighborhood radius α r, an intrusion occurs only if distances between clusters are smaller than α r ; an extrusion occurs only if the diameter of a cluster is larger than α r. Hence, the QA measures for DR punish small between-cluster-distances and large within-cluster-distances. Unlike the global quantization error, they take into account local relationships and they are parameterized by the considered neighborhood sizes. In the following, we experimentally test in how far the QA measures for DR as introduced above lead to reasonable evaluations for typical clustering scenarios. 4 Experiments We use two artificial 2-dimensional scenarios with randomly generated data points, where data are arranged in clusters ( clouds), or data are distributed uniformly (random square), respectively. We use the batch neural gas (BNG) algorithm [5] for clustering as a robust classical prototypebased clustering algorithm. We use different numbers of prototypes per scenario, covering various resolutions of the data. clouds data This consists of random data vectors as shown in Fig.. We used,, and 5 prototypes, which lead to three respective situations: (I) M = each cloud is covered by prototypes, none of them located in-between the clouds, so one cluster consists of data points; (II) M = there is one prototype approximately in the center of each cloud, so one cluster consists of data points; (III) M = 5 there are not enough prototypes to cover each cloud separately, so only two of them are situated near cloud centers and three are spread in-between 3

clouds. Cluster sizes vary between and 3 data points, which includes more than one cloud on average. The resulting prototypes are depicted in Fig., the according QA results are shown in Fig. 3(a) to 3(f). The graphs show the different quality values, sampled over neighborhood radii from the smallest to the largest possible. The left column refers to distance-based neighborhoods, the right column refers to rank-based neighborhoods. In several graphs, especially in Fig. 3(c), 3(d), the grouped structure of the clouds data is resembled by wave or sawtooth patterns of the QA curves, showing that the total amount of rank or distance errors change rapidly as the neighborhood range coincides with cluster boundaries. Similarly, in all graphs there is a first peak in quality at the neighborhood radius where only a single cloud is approximately contained in each neighborhood. Within such neighborhoods, rank or distance errors are rare, even under the mapping of points to their closest prototypes. This effect is visible, e.g., in Fig. 3(e), 3(f). Interestingly, in Fig. 3(e), 3(f), there is a first peak where both, precision and recall are close to corresponding to the perfect cluster structure displayed in the model, while Fig. 3(a), 3(b) do not possess such value for small neighborhood corresponding to the structural mismatch because of the small number of prototypes. Unlike the IR measures, the CRM measure leads to smaller qualities for smaller numbers of prototypes in all cases. The structural match in the context of prototypes can be observed in a comparably large increase of the absolute value, but the situation is less clear as compared to the IR measures. For the IR measures, the main difference in between the situations where a structural match can be observed ( and prototypes, respectively) is the smoothness of the curves, but not their absolute value. Random square data Data and prototype locations for and prototypes are depicted, respectively, in Fig. 2. For M =, each cluster consists of data points, and with M = the sizes were. As expected, the QA graphs in Fig. 3(g) and 3(h) are continuously rising for the setting M =, whereas the curves are less stable but still following an upward trend for M =. This shows how the sparse quantization of the whole uniform data distribution leads to more topological mismatches over various neighborhood sizes. 5 Conclusions In this contribution, we investigated the suitability of recent QA measures for DR to also evaluate clustering, such that visualization of large data sets, which commonly requires both, clustering and dimensionality reduction, could be evaluated based on one quality criterion only. While a formal transfer of the QA measures to clustering is possible, there exist qualitative differences between the IR and CRM evaluation criteria. It seems that IR based evaluation criteria allow to also detect appropriate cluster structures, corresponding to high precision and recall, while the situation is less pronounced for the CRM measures where a smooth transition of the measures corresponding to the cluster sizes can be observed. One drawback of the IR evaluation is its dependency on the local scale as specified by the radii used for the evaluation. This makes it difficult to interpret the graphs due to the resulting sawtooth patterns even for data which possess only one global scale. If the scaling or density of the data varies locally, results are probably difficult to interpret. This question is subject of ongoing work. References [] L.J.P. van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9:2579 265, 28. [2] Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, and Samuel Kaski. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. J. Mach. Learn. Res., :45 49, 2. [3] John A. Lee and Michel Verleysen. Quality assessment of dimensionality reduction: Rank-based criteria. Neurocomput., 72(7-9):43 443, 29. [4] Shi Zhong, Joydeep Ghosh, and Claire Cardie. A unified framework for model-based clustering. Journal of Machine Learning Research, 4: 37, 23. [5] Marie Cottrell, Barbara Hammer, Alexander Hasenfuß, and Thomas Villmann. Batch and median neural gas. Neural Networks, 9:762 77, 26. 4

4 2 Data M = M = M = 5 2 4 8 6 4 2 2 4 6 8 Figure : clouds dataset and three independent prototype distributions with,, and 5 prototypes..9 Data M = M =.7.5.3. Figure 2: random square dataset with two independent prototype distributions with and prototypes. 5

.9.7.5 Distance based QA by IR: M=5.3. 5 5 2 25 (a).9.7.5.3. Rank based QA by IR and CRM: M=5 Quality (CRM) 2 4 6 8 2 QA for clouds dataset, clustered with 5 prototypes (b).9.7.5 Distance based QA by IR: M=.3. 5 5 2 25 (c).9.7.5.3. Rank based QA by IR and CRM: M= Quality (CRM) 2 4 6 8 2 QA for clouds dataset, clustered with prototypes (d).9.7.5 Distance based QA by IR: M=.3. 5 5 2 25 (e).9.7.5.3. Rank based QA by IR and CRM: M= Quality (CRM) 2 4 6 8 2 QA for clouds dataset, clustered with prototypes (f) Distance based QA by IR Rank based QA by IR and CRM, M =, M =, M =, M =.2.4 (g), M =, M = Quality (CRM), M =, M =, M = Quality (CRM), M = 2 3 4 5 6 7 8 9 QA for random square data, clustered with and prototypes (h) Figure 3: QA results from the IR and CRM frameworks for two artificial clustering scenarios ( clouds & random square). In the left column are the results with neighborhoods defined over distance radii; in the right column the neighborhoods were based on rank radii. 6