REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

Similar documents
6. Concluding Remarks

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Dimension reduction : PCA and Clustering

IT-Dendrogram: A New Member of the In-Tree (IT) Clustering Family

Generalized Principal Component Analysis CVPR 2007

Non-linear dimension reduction

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Head Frontal-View Identification Using Extended LLE

ECS 234: Data Analysis: Clustering ECS 234

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Planning: Part 1 Classical Planning

K-Means Clustering With Initial Centroids Based On Difference Operator

An Optimal Regression Algorithm for Piecewise Functions Expressed as Object-Oriented Programs

Triangle Graphs and Simple Trapezoid Graphs

A Data Dependent Triangulation for Vector Fields

Adaptive osculatory rational interpolation for image processing

Table of Contents. Recognition of Facial Gestures... 1 Attila Fazekas

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

Digital Geometry Processing

Training Digital Circuits with Hamming Clustering

Coloring Fuzzy Circular Interval Graphs

Sensitivity to parameter and data variations in dimensionality reduction techniques

A Comparative Study on Optimization Techniques for Solving Multi-objective Geometric Programming Problems

Texture Image Segmentation using FCM

Unit 5: Part 1 Planning

A Cumulative Averaging Method for Piecewise Polynomial Approximation to Discrete Data

10.2 Basic Concepts of Limits

Nearest Neighbor Classification. Machine Learning Fall 2017

Supervised Variable Clustering for Classification of NIR Spectra

Discrete Optimization. Lecture Notes 2

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

Polynomial Curve Fitting of Execution Time of Binary Search in Worst Case in Personal Computer

Multi-scale Techniques for Document Page Segmentation

Graph projection techniques for Self-Organizing Maps

Selecting Models from Videos for Appearance-Based Face Recognition

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

An Improved Upper Bound for the Sum-free Subset Constant

Subspace Clustering. Weiwei Feng. December 11, 2015

Stick Numbers of Links with Two Trivial Components

Relative Constraints as Features

Dimension Reduction of Image Manifolds

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Perfect Matchings in Claw-free Cubic Graphs

Texture Mapping using Surface Flattening via Multi-Dimensional Scaling

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Dimension Reduction CS534

Color Space Projection, Feature Fusion and Concurrent Neural Modules for Biometric Image Recognition

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Proximal Manifold Learning via Descriptive Neighbourhood Selection

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

LOESS curve fitted to a population sampled from a sine wave with uniform noise added. The LOESS curve approximates the original sine wave.

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

A TESSELLATION FOR ALGEBRAIC SURFACES IN CP 3

Theorem 2.9: nearest addition algorithm

From acute sets to centrally symmetric 2-neighborly polytopes

Robust Moving Least-squares Fitting with Sharp Features

Tensor Sparse PCA and Face Recognition: A Novel Approach

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

11.1 Facility Location

Strong Chromatic Number of Fuzzy Graphs

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

A technique for adding range restrictions to. August 30, Abstract. In a generalized searching problem, a set S of n colored geometric objects

The Near Greedy Algorithm for Views Selection in Data Warehouses and Its Performance Guarantees

New algorithms for sampling closed and/or confined equilateral polygons

Machine Learning for OR & FE

MUSI-6201 Computational Music Analysis

GLOBAL GEOMETRY OF POLYGONS. I: THE THEOREM OF FABRICIUS-BJERRE

MODULE 7 Nearest Neighbour Classifier and its Variants LESSON 12

Evgeny Maksakov Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages: Advantages and disadvantages:

Texture classification using fuzzy uncertainty texture spectrum

ED STIC - Proposition de Sujets de Thèse. pour la campagne d'allocation de thèses 2014

LEARNING ON STATISTICAL MANIFOLDS FOR CLUSTERING AND VISUALIZATION

Clustering. Lecture 6, 1/24/03 ECS289A

THE RAINBOW DOMINATION SUBDIVISION NUMBERS OF GRAPHS. N. Dehgardi, S. M. Sheikholeslami and L. Volkmann. 1. Introduction

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Exploratory data analysis for microarrays

COURSE: NUMERICAL ANALYSIS. LESSON: Methods for Solving Non-Linear Equations

Alternative Exponential Equations for. Rectangular Arrays of Four and Five Data

Fingerprint Classification Using Orientation Field Flow Curves

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

REDUCING GRAPH COLORING TO CLIQUE SEARCH

Role of dimensionality reduction in segment-based classification of damaged building roofs in airborne laser scanning data. Kourosh Khoshelham

LEARNING COMPRESSED IMAGE CLASSIFICATION FEATURES. Qiang Qiu and Guillermo Sapiro. Duke University, Durham, NC 27708, USA

Part I. Graphical exploratory data analysis. Graphical summaries of data. Graphical summaries of data

Locality Preserving Projections (LPP) Abstract

RATIONAL CURVES ON SMOOTH CUBIC HYPERSURFACES. Contents 1. Introduction 1 2. The proof of Theorem References 9

Problem Set 3. MATH 776, Fall 2009, Mohr. November 30, 2009

CS 340 Lec. 4: K-Nearest Neighbors

School of Computer and Communication, Lanzhou University of Technology, Gansu, Lanzhou,730050,P.R. China

9.1. K-means Clustering

Modelling Data Segmentation for Image Retrieval Systems

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Basis Functions. Volker Tresp Summer 2017

Clustering Techniques

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

5.6 Self-organizing maps (SOM) [Book, Sect. 10.3]

A New Method For Forecasting Enrolments Combining Time-Variant Fuzzy Logical Relationship Groups And K-Means Clustering

Transcription:

Fundamental Journal of Mathematics and Mathematical Sciences Vol. 4, Issue 1, 2015, Pages 23-34 This paper is available online at http://www.frdint.com/ Published online November 29, 2015 REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION A. G. RAMM * and CONG TUAN SON VAN Mathematics Department Kansas State University Manhattan, KS 66506 USA e-mail: ramm@math.ksu.edu congvan@math.ksu.edu Abstract Suppose the data consists of a set S of points, 1 j J, distributed in a bounded domain D R, where N and J are large numbers. In this paper, an algorithm is proposed for checking whether there exists a manifold M of low dimension near which many of the points of S lie and finding such M if it exists. There are many dimension reduction algorithms, both linear and non-linear. Our algorithm is simple to implement and has some advantages compared with the known algorithms. If there is a manifold of low dimension near which most of the data points lie, the proposed algorithm will find it. Some numerical results are presented illustrating the algorithm and analyzing its performance compared to the classical PCA (principal component analysis) and Isomap. Keywords and phrases: dimension reduction, data representation. 2010 Mathematics Subject Classification: 68P10, 68P30, 94A15. * Corresponding author Received August 21, 2015; Accepted October 30, 2015 N 2015 Fundamental Research and Development International x j

24 A. G. RAMM and CONG TUAN SON VAN 1. Introduction There is a large literature on dimension reduction. Without trying to describe the published results, we refer the reader to the reviews [3] and [7] and references therein. Our algorithm is simple, it is easy to implement, and it has some advantages over the known algorithms. We compare its performance with PCA and Isomap algorithms. The main point in this short paper is the new algorithm for computing a manifold M of low dimension in a neighborhood of which most of the data points lie, or finding out that there is no such manifold. In Section 2, we describe the details of the proposed algorithm. In Section 3, we analyze the performance of the algorithm and compare its performance with PCA and Isomap algorithms performances. In Section 4, we show some numerical results. 2. Description of the Algorithm N Let S be the set of the data points x j, 1 j J, x j D R, where D is a unit cube and J and N are very large. Divide the unit cube D into a grid with step size 0 < r 1. Let 1 a = be the number of intervals on each side of the unit cube D. Let r V be the upper limit of the total volume of the small cubes C m, L be the lower limit of the number of the data points near the manifold M, and p be the lower limit of the number of the data points in a small cube C m with the side r. By a manifold in this paper, a union of piecewise-smooth manifolds is meant. By a smooth manifold is N 3 meant a smooth lower-dimensional domain in R, for example, a line in R, or a 3 plane in R. The one-dimensional manifold that we construct is a union of piecewise-linear curves, a linear curve is a line. Let us describe the algorithm, which is a development of an idea in [8]. The steps of this algorithm are: 1. Take a = a 1, and a 1 = 2. N 2. Take M = a, where M is the number of the small cubes with the side r.

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION 25 Denote these small cubes by C m, 1 m M. Scan the data set S by moving the small cube with the step size r and calculating the number of points in this cube for each position of this cube in D. Each of the points of S lie in some cube C m. Let µ m be the number of the points of S in C m. Neglect the small cubes with µ m < p, where p is the chosen lower limit of the number of the data points in a small cube C m. One chooses p << S, for example, p = 0.005 S, where S is the number of the data points in S. Let V t be the total volume of the cubes that we keep. 3. If V t = 0, we conclude that there is no manifold found. 4. If V t > V, we set a = a 2 = 2a 1. If a 2 > N S, we conclude that there is no manifold found. Otherwise, repeat steps 2 and 3. 5. If V t < V, then denote by C k, 1 k K << M, the small cubes that we keep ( µ k p ). Denote the centers of K C K by c k and let C : = k = 1 Ck. 6. If the total number of the points in C is less than L, where L is the chosen lower limit of the number of the data points near the manifold M, then we conclude that there is no manifold found. 7. Otherwise, we have a set of small cubes C k, 1 k K << M with the side r, which has total volume less than V and the number of the data points L. Given the set of small cubes C k, 1 k K, one can build the sets L s of dimension s << N, in a neighborhood of which maximal amount of points x j S lie. For example, to build L 1, one can start with the point y 1 : = c 1 where C 1 is the small cube closest to the origin, and join y 1 with y 2 = c2 where C 2 is the small cube closest to C 1. Continuing joining K small cubes, one gets a one-dimensional piecewise-linear manifold L 1. To build L 2, one considers the triangles T k with vertices yk, yk + 1, yk + 2, 1 k K 2. The union of T k forms a two-dimensional manifold L 2. Similarly, one can build s-dimensional manifold L s from the set of C k.

26 A. G. RAMM and CONG TUAN SON VAN While building manifolds L s, one can construct several such manifolds because the closest to C i 1 cube C i may be non-unique. However, each of the constructed manifolds contains the small cubes C k, which have totally at least L data points. For example, for L = 0. 9 S each of these manifolds contains at least 0.9 S data points. Of course, there could be a manifold containing 0.99 S and another one containing 0.9 S. But if the goal is to include as many data points as possible, the experimenter can increase L, for example, to 0.95 S. Choice of V, L, and p The idea of the algorithm is to find a region containing most of the data points ( L) that has small volume ( V ) by neglecting the small cubes C k that have data points. Depending on how sparse the data set is, one can choose an appropriate L. If the data set has a small number of outliers (an outlier is an observation point that is distant from other observations), one can choose L = 0.9 S, as in the first numerical experiment in Section 3. If the data set is sparse, one can choose L = 0.8 S, as in the second numerical experiment in Section 3. In general, when the experimenter does not know how sparse the data are, one can try L to be 0.95 S, 0. 9 S or 0.8 S. One should not choose V too big, for example 0.5, because it does not make sense to have a manifold with big volume. One should not choose V too small because then either one cannot find a manifold or the number of small cubes < p C k is the same as the number of data points. So, the authors suggest V to be between 0. 3 and 0. 4 of the volume of the original unit cube where the data points are. The value p is used to decide whether one should neglect the small cubes C k. So, it is recommended to choose p << S. Numerical experiments show that p = 0. 005 S works well. The experimenter can also try smaller values of p but then more computation time is needed.

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION 27 3. Performance Analysis In the algorithm one doubles a each time and 2 a N S. So the main step runs at most N 1 log S or S N log points of the data set S in each of times. For each a one calculates the number of M a small cubes. The total computational time is 1 S M a log S. In N-dimensional space, calculating the distance between two N points takes N operations. Thus, the asymptotic speed of the algorithm is 1 O( N S M a log S ) or O( S M log S ). N a Since M a S, in the worst case, the asymptotic speed is O ( S 2 log S ). If one compares this algorithm with the principal component analysis (PCA) algorithm, which has asymptotic speed O ( S 3 ), one sees that our algorithm is much faster than the PCA algorithm. PCA theory is presented in [5]. In paper [1], the fastest algorithm for finding a manifold of lower dimension containing many points of S, called Isomap, has the asymptotic speed of O ( sl S log S ), where s is the assumed dimension of the manifold and l is the number of landmarks. However, in the Isomap algorithm one has to use a priori assumed dimension of the manifold, which is not known a priori, while our algorithm finds the manifold and its dimension. Also, in the Isomap algorithm one has to specify the landmarks, which can be arbitrarily located in the domain D. Our algorithm is simpler to implement than the Isomap algorithm. For large S, one can make our algorithm faster by putting the upper limit for a. For example, instead of 1 a N S, one can require 1 a 2N S. 4. Numerical Results In the following numerical experiments, we use different data sets from paper [2] and [6]. We apply our proposed algorithm above to each data set and get the results as shown in the pictures. The first picture of each data set shows the original data

28 A. G. RAMM and CONG TUAN SON VAN points (as blue points) and the found small cubes (as red points). The second picture of each data set shows the found manifold. 1. Consider the data set from paper [2]. This is a 2-dimensional set containing S = 308 points in a 2-dimensional Cartesian coordinate system. Below we take the integer value for p nearest to p and larger than p. Figure 1. A data set from paper [2] with S = 308 and p = 0.005 S = 1.56. Figure 2. A data set from paper [2] with S = 308 and p = 0.005 S = 1.56. Run the algorithm with V = 0. 5 (i.e., the final set of C k has to have total volume less than 0.5), L = 0. 9 S (the manifold has to contain at least 90% of the initial data points), and p = 0.005 S = 1. 56 (we take p = 2 and neglect the small

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION 29 cubes that have 1 point). One finds the manifold L 1 when a = 16, r = 1 16, the number of small cubes 0.92 S data points. C k is M = 89, which has total volume of 0. 35 and contains 2. Consider another data set from paper [2]. This set has S = 296 data points. Figure 3. Another data set from paper [2] with S = 296 and p = 0.005 S = 1.5. Figure 4. Another data set from paper [2] with S = 296 and p = 0.005 S = 1.5. Run the algorithm with V = 0. 5 (i.e., the final set of C k has to have total volume less than 0.5), L = 0. 8 S (since some part of the data is uniformly distributed), and p = 0.005 S = 1. 5 (we take p = 2 and neglect the small cubes

30 A. G. RAMM and CONG TUAN SON VAN that have 1 point). One finds the manifold L 1 when a = 16, r = 1 16, the number of small cubes data points. C k is M = 76, which has total volume of 0. 3 and contains 0.83 S 3. Use the data set from paper [2] as in the second experiment. In this experiment, p = 4 was used. This is to show that one can try different values of p (besides the suggested one p = 0. 005 S ) and may find a manifold with a smaller number of small cubes. Figure 5. Another data set from paper [2] with S = 296 and p = 4. Figure 6. Another data set from paper [2] with S = 296 and p = 4.

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION 31 Choose p = 4, V = 0. 5 and L = 0.8 S. One finds the manifold L 1 when a = 8, r = 1 8, the number of small cubes C k is M = 30 (less than 76 in experiment 2), which has total volume of 0. 47 and the total number of points 0.89 S. The conclusion is: one can try different values for p and can have different manifolds. 4. Consider a data set from paper [4] (Figure 7). Figure 7. A data set from paper [4] with S = 235. Run the algorithm with V = 0.5, L = 0. 8 S and let p run from 1 to 10. For any p, one cannot find a set of C k which has total volume V and contains at least 0.8 S data points. This data set does not have a manifold of lower dimension in a neighborhood of which most of the data points lie. 5. Consider a data set from paper [6]. This set has S = 373 data points.

32 A. G. RAMM and CONG TUAN SON VAN Figure 8. A data set from paper [6] with S = 373 and p = 4. Figure 9. A data set from paper [6] with S = 373 and p = 4. Run the algorithm with V = 0. 5 (i.e., the final set of C k has to have total volume less than 0.5), L = 0. 8 S (since some part of the data is uniformly distributed), and p = 4 (we neglect the small cubes that have 3 points). One finds the manifold L 1 when a = 8, r = 1 8, the number of small cubes C k is M = 26. 6. Use the data set from paper [6] as in the fifth experiment. In this experiment, p = 2 was used. This is to show that one can try different values of p (besides the suggested one

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION 33 p = 4) and may find a different manifold. Figure 10. A data set from paper [6] with S = 373 and p = 2. Figure 11. A data set from paper [6] with S = 373 and p = 2. Choose p = 2, V = 0. 5 and L = 0.8 S. One finds the manifold L 1 when a = 16, r = 1 16, the number of small cubes C k is M = 72. The conclusion is: one can try different values for p and can have different manifolds. 5. Conclusion We conclude that our proposed algorithm can compute a manifold M of low

34 A. G. RAMM and CONG TUAN SON VAN dimension in a neighborhood of which most of the data points lie, or find out that there is no such manifold. Our algorithm is simple, it is easy to implement, and it has some advantages over the known algorithms (PCA and Isomap). References [1] L. Cayton, Algorithms for Manifold Learning, 2005. [2] H. Chang, and D. Yeung, Robust path-based spectral clustering, Pattern Recognition 41(1) (2008), 191-203. [3] I. Fodor, A survey of dimension reduction techniques, 2002. www.researchgate.net/publication/2860580 [4] L. Fu and E. Medico, FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data, BMC Bioinformatics 8(1) (2007), 3. [5] I. Jolitle, Principal Component Analysis, Springer, New York, 1986. [6] A. Jain and M. Law, Data clustering: A user s dilemma, Lecture Notes in Computer Science 3776 (2005), 1-10. [7] L. van der Maaten, E. Postma and J. van der Herik, Dimensionality reduction: a comparative review, 2009, http://www.uvt.nl/ticc. [8] A. G. Ramm, 2009, arxiv:0902.4389.