Spherical Randomized Gravitational Clustering

Similar documents
The Parameter-less Randomized Gravitational Clustering algorithm with online clusters structure characterization

Modelling Data Segmentation for Image Retrieval Systems

Tilings of the Euclidean plane

Chapter 11 Arc Extraction and Segmentation

Supplementary Material

[8] that this cannot happen on the projective plane (cf. also [2]) and the results of Robertson, Seymour, and Thomas [5] on linkless embeddings of gra

the number of states must be set in advance, i.e. the structure of the model is not t to the data, but given a priori the algorithm converges to a loc

Geodesic and parallel models for leaf shape

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

A GENTLE INTRODUCTION TO THE BASIC CONCEPTS OF SHAPE SPACE AND SHAPE STATISTICS

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

Analysis of Planar Anisotropy of Fibre Systems by Using 2D Fourier Transform

R Package CircSpatial for the Imaging - Kriging - Simulation. of Circular-Spatial Data

An Automatic 3D Face Model Segmentation for Acquiring Weight Motion Area

On Geodesics of 3D Surfaces of Rotations in Euclidean and Minkowskian Spaces

DETC APPROXIMATE MOTION SYNTHESIS OF SPHERICAL KINEMATIC CHAINS

Pre AP Geometry. Mathematics Standards of Learning Curriculum Framework 2009: Pre AP Geometry

Self-Organizing Feature Map. Kazuhiro MINAMIMOTO Kazushi IKEDA Kenji NAKAYAMA

Mineral Exploation Using Neural Netowrks

Novel Intuitionistic Fuzzy C-Means Clustering for Linearly and Nonlinearly Separable Data

Properties of Bertrand curves in dual space

Unsupervised Learning

Algorithm That Mimics Human Perceptual Grouping of Dot Patterns

Hierarchical 3-D von Mises-Fisher Mixture Model

P ^ 2π 3 2π 3. 2π 3 P 2 P 1. a. b. c.

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Review 1. Richard Koch. April 23, 2005

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

MATH 209, Lab 5. Richard M. Slevinsky

DETERMINATION OF THE OPTIMUM PATH ON THE EARTH S SURFACE. (extended abstract) Abstract

Data Informatics. Seon Ho Kim, Ph.D.

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

Texture Image Segmentation using FCM

Stochastic Models for Gravity Flow: Numerical Considerations

Globally Stabilized 3L Curve Fitting

CFD modelling of thickened tailings Final project report

Color-Based Classification of Natural Rock Images Using Classifier Combinations

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

IMPLEMENTATION OF RBF TYPE NETWORKS BY SIGMOIDAL FEEDFORWARD NEURAL NETWORKS

Rule extraction from support vector machines

Tensor Based Approaches for LVA Field Inference

Data Mining 4. Cluster Analysis

TENTH WORLD CONGRESS ON THE THEORY OF MACHINES AND MECHANISMS Oulu, Finland, June 20{24, 1999 THE EFFECT OF DATA-SET CARDINALITY ON THE DESIGN AND STR

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Motivation. Technical Background

6-DOF Model Based Tracking via Object Coordinate Regression Supplemental Note

Tutorial 3. Jun Xu, Teaching Asistant csjunxu/ February 16, COMP4134 Biometrics Authentication

Keyword Extraction by KNN considering Similarity among Features

Two-Dimensional Visualization for Internet Resource Discovery. Shih-Hao Li and Peter B. Danzig. University of Southern California

Rearrangement of DNA fragments: a branch-and-cut algorithm Abstract. In this paper we consider a problem that arises in the process of reconstruction

Comparison of Reconstruction Methods for Computed Tomography with Industrial Robots using Automatic Object Position Recognition

Images Reconstruction using an iterative SOM based algorithm.

Parameterization. Michael S. Floater. November 10, 2011

Ray Trace Notes. Charles Rino. September 2010

Design of Tactile Fixtures for Robotics and Manufacturing. Walter Willem Nederbragt and Bahram Ravani. University of California, Davis.

Introduction to Computer Science

Copyright 2009 Pearson Education, Inc. Chapter 9 Section 7 - Slide 1 AND

PPGEE Robot Dynamics I

Gaussian elds: a new criterion for 3Drigid registration

m Environment Output Activation 0.8 Output Activation Input Value

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

APPLICATION OF ALGORITHMS FOR AUTOMATIC GENERATION OF HEXAHEDRAL FINITE ELEMENT MESHES

DEPARTMENT - Mathematics. Coding: N Number. A Algebra. G&M Geometry and Measure. S Statistics. P - Probability. R&P Ratio and Proportion

Low-stress data embedding in the hyperbolic plane using multidimensional scaling

WETTING PROPERTIES OF STRUCTURED INTERFACES COMPOSED OF SURFACE-ATTACHED SPHERICAL NANOPARTICLES

PROJECTION MODELING SIMPLIFICATION MARKER EXTRACTION DECISION. Image #k Partition #k

Cluster Analysis. Ying Shen, SSE, Tongji University

Fuzzy Voronoi Diagram

MODELLING AND MOTION ANALYSIS OF FIVE-BAR 5R MECHANISM

Goals: Course Unit: Describing Moving Objects Different Ways of Representing Functions Vector-valued Functions, or Parametric Curves

Texture Segmentation by Windowed Projection

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Fast 3D Mean Shift Filter for CT Images

Improved Performance of Unsupervised Method by Renovated K-Means

Using Classical Mechanism Concepts to Motivate Modern Mechanism Analysis and Synthesis Methods

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Numerical Treatment of Geodesic Differential. Equations on a Surface in

Proceedings of the 6th Int. Conf. on Computer Analysis of Images and Patterns. Direct Obstacle Detection and Motion. from Spatio-Temporal Derivatives

Subpixel Corner Detection Using Spatial Moment 1)

Visualizing the Life and Anatomy of Dark Matter

Online Calibration of Path Loss Exponent in Wireless Sensor Networks

2. Background. 2.1 Clustering

A Summary of Projective Geometry

Approximate Smoothing Spline Methods for Large Data Sets in the Binary Case Dong Xiang, SAS Institute Inc. Grace Wahba, University of Wisconsin at Mad

A Singular Example for the Averaged Mean Curvature Flow

CSE 5243 INTRO. TO DATA MINING

Validation of Heat Conduction 2D Analytical Model in Spherical Geometries using infrared Thermography.*

STATISTICS AND ANALYSIS OF SHAPE

Geodetic Iterative Methods for Nonlinear Acoustic Sources Localization: Application to Cracks Eects Detected with Acoustic Emission

In this lecture we introduce the Gauss-Bonnet theorem. The required section is The optional sections are

Geometric and Solid Modeling. Problems

Survey of the Mathematics of Big Data

Issues with Curve Detection Grouping (e.g., the Canny hysteresis thresholding procedure) Model tting They can be performed sequentially or simultaneou

Based on Raymond J. Mooney s slides

A Geometric Approach to the Bisection Method

USING STATISTICAL METHODS TO CORRECT LIDAR INTENSITIES FROM GEOLOGICAL OUTCROPS INTRODUCTION

International Journal of Advanced Research in Computer Science and Software Engineering

IMAGE ANALYSIS DEDICATED TO POLYMER INJECTION MOLDING

Transcription:

Spherical Randomized Gravitational Clustering Jonatan Gomez and Elizabeth Leon 2 ALIFE Research Group, Computer Systems, Universidad Nacional de Colombia jgomezpe@unal.edu.co 2 MIDAS Research Group, Computer Systems, Universidad Nacional de Colombia eleonguz@unal.edu.co Abstract. Circular data, i.e., data in the form of 'natural' directions or angles are very common in a number of dierent areas such as biological, meteorological, geological, and political sciences. Clustering circular data is not an easy task due to the circular geometry of the data space. Some clustering approaches, such as the spherical k-means, use the cosine distance instead of the euclidean distance in order to measure the dierence between points. In this paper, we propose a variation of the randomized gravitational clustering algorithm in order to deal with circular data. Basically, we use the cosine distance, we modify the gravitational law in order to use the cosine distance and we use geodesics ('straight' lines in curved spaces) in order to move points according to the gravitational dynamic. Our initial experiments indicate that the spherical gravitational clustering algorithm is able to nd clusters in noisy circular data. Introduction Circular data, i.e., data in the form of 'natural' directions or angles, are observations taken on compact Riemannian manifolds (curved spaces following a Riemannian geometry) []. Circular data are seen in dierent scientic areas: wave directions in oceanography, directions of animal movement in biology, wind directions in meteorology, rock fracture orientations in geology, periodic time in economy and conformational angles obtained from the 3D coordinates of the backbone chain of a protein in bio-informatics [,2,3]. Modeling circular data (in particular using machine learning or statistical approaches) is not an easy task due to the curved geometry of the data space (Riemannian geometry). In particular, some approaches in the directional statistics eld (the eld of statistics that deals with circular data) consider the circular data as data drawn from a set of distributions of von Mises-Fisher (vmf) [4]. Some clustering approaches (approaches that try to nd groups of similar points) in the data mining and machine learning elds replace the euclidean distance with the cosine distance (angular distance between points) in order to measure the dierence between points in the Riemannian space. Such is the case of the spherical k-means [5]. Gomez et al. in [6,7] proposed a clustering algorithm (the randomized gravitational clustering algorithm, Rgc), which is robust to noise and unsupervised in the number of cluster, based on concepts of eld theory in physics. Basically,

the gravitational dynamics is determined by moving points (in an euclidean space) according to the gravitational eld generated by other point randomly selected. In this paper, we propose a variation of such randomized gravitational clustering algorithm in order to deal with circular data. In this way, we replace the euclidean distance with the cosine distance, we modify the gravitational law in order to use the cosine distance and we use geodesics, the 'straight' lines in curved spaces, in order to move points according to the gravitational dynamic. This paper is divided in 5 Sections. Section 2 sumarizes the Rgc algorithm proposed by Gomez et al in [6,7]. Section 3 explains the changes introduced to the Rgc algorithm in order to deal with circular datasets using a cosine distance. Section 4 analyzes preliminar results obtained with Rgc. Finally, Section 5 outlines some conclusions. 2 Randomized Gravitational Clustering (Rgc) Gravitational clustering (Gc) algorithms are considered agglomerative hierarchical algorithms based on concepts of eld theory in physics [8,9]. The Gc algorithm simulates the gravitational system obtained of considering data points as initial particles with mass equal one in a space exposed to gravitational elds. Gomez et al. in [6,7] proposed a Gc algorithm which is robust to noise and unsupervised in the number of clusters (see Algorithm ). Basically, points do not change in mass and are not removed during the simulation of the system dynamic. Two points are merged into virtual clusters using a union-disjoint set structure [] (lines -2, 7 and, i.e., functions Make, Find and Union), when they are close enough (line 7). Gomez et al. estimated the greatest minimal separation ( ˆd) between N uniformly separated points 3, (here N is the size of the data set), using the 2-dimensional hexagonal packing of circles approach [2] and used it as re-normalization factor to reduce the eect of the data set size in the system dynamic. Moreover, instead of considering all points to move a point, just another point is randomly selected and both points are moved (line 6) according to an oversimplied Universal Gravitational ( F x,y = G d x,y 3 d d ) x,y 3 and Second Newton's Motion Laws (y t+ = y t + G d x,y d d ), here d x,y x,y is the vector ('straight' line between points y and x). These equations are the vectorial representation of the Newton Gravitational and Second Laws[3,4]. The big crunch eect (one single big cluster at the end) is eliminated by introducing a cooling mechanism similar to the one used in simulated annealing (line 3 The problem of determining the optimal arrangement of points in such a way that the greatest minimal separation between points is obtained, is an open problem in Geometry [].

Algorithm Randomized Gravitational Clustering Rgc( x, G, (G), M, ε) x: Data set, G: Gravity strength, (G): Cooling factor, M: Iterations, ε: Fusion distance. for i= to N do //Creates the Union-Disjoint cluster set 2. Make(i) //Each data points is initially a cluster 3. for i= to M do 4. for j= to N do 5. k = random point index such that k j 6. Move( x j, x k ) //Move both points using gravity motion equation. 7. if d xj,x k ε then Union( j, k) //Merges (virtually) to clusters 8. G = (- (G))*G //Reduces the gravity strength 9. for i= to N do //Canonical Union-Disjoint clusters set. Find(i). return disjoint-sets 8). When the simulation is terminated, clusters are extracted if they have a minimum number of points (α). Finally, Gomez et al. determine an appropriated value of G by using an extended bisection search algorithm []: the number of clustered points q M (points that were assigned to some cluster with two or more points), after some checking iterations of the Rgc algorithm M, is used as indicator of the quality G, by comparing it against an expected value Q ± τ. 3 Spherical Randomized Gravitational Clustering (Sgc) Two elements should be considered to apply the Rgc algorithm to circular data: (i) the system dynamic (gravitational eld denition and movement of points in hyper-sphere surface) and (ii) the greatest minimal separation between 'uniformly' separated points in the surface of the hyper-sphere. 3. Spherical Gravity Law (Gravity Law in the n sphere surface) Although classic Newton Gravitational Law is dened for Euclidean spaces (spaces described by Euclidean geometry), it has been generalized to Curved spaces (spaces described by Riemannian geometry) such as the surface of an hyper sphere (in short n sphere surface) [3,4]. In such curved spaces (the n sphere surface), a 'straight' line becomes an arc (segment of circle) and is called geodesic. Since there is a one to one correspondence between the chord (the Euclidean straight line between points in a n sphere surface) and the geodesic between them, and any Curved space behaves locally as an Euclidean space (it is a manifold), it is possible to approximate the motion of a point due to Gravitational Law and Newton motion law by using a cosine distance, the Euclidean straight line vector and projecting the obtained point to the n sphere surface (renormalizing). In this way, the moving function of a point y due to the gravitational eld of other point x (line 6 in Algorithm ), is computed using Equation.

( y t+ = normalize yt + G d x y cd x,y ) 3 () where, cd x,y is the cosine distance between points x and y, x y is the Euclidean straight line between points x and y, and normalize ( z ) = Z z ( z is the Euclidean norm of vector z ). 3.2 Greatest Minimal Separation between Points ( ˆd) We analize the behavior of ˆd in the 2 sphere surface (circle in two dimensions) to nd a better estimation of ˆd when dealing with a n sphere surface. Notice that, the maximum distance between closest points is the cosine distance dened by the angle 2π N. Similar behavior is observed when working in a 3 sphere surface, where, it is close to 2π N. In this way, a rough approximation 4 of the angle dened by two closest points (of a data set of N points) in a n sphere surface is provided 2π by n. Therefore, an estimation of ˆd in circular data is given by Equation 2. N ( ) 2π d = k cd (2) n N Here N is the number of data points in the n-sphere surface, cd is the cosine distance and k is a correction factor due to high dimensionality of the data set (in our case we set it to 2). 4 Experiments We use the Sgc algorithm for nding clusters in dierent real and synthetic circular data sets. Due to the lack of space, we show (as proof of concept) the results obtained by Sgc on two 3D synthetic data sets with six clusters, each cluster following a vmf distribution with dierent concentration parameter and number of samples. The rst data set is free of noise N while the second one contains 2% of noise. In this paper, we xed Q = 2, M = and τ = 2Q for estimating the value of G, since these values are good approximations to the ones proposed by Gomez et al in [7] and reduce the time complexity of this estimation algorithm to lineal O (N) respect to the number of data points. Figure shows the evolution of the Srg on the clean synthetic data set after 25, 5 and iterations. When noisy points are not part of the data set, while points are moved to their clusters centers (lower row Figure ), clusters are formed quickly and points are not asigned to incorrect clusters (upper row Figure ). The algorithm stops at iteration 9 when no more clusters can be formed due to the cooling factor. Clearly, the Sgc algorithm is able to nd cluster in clean circular data sets. 4 By using the manifold property of the hyper-sphere surface, i.e. considering local vecinities of the hyper-sphere surface as hyper-cubes.

.8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8.8.6.4.2.2.4.6.8.5.5.5.5.5.5.5.5.5.5.5.5.8.6.4.2.2.4.6.8.6.4.2.2.4.6.8.6.4.2.2.4.6.5.5.5.5.5.5.5.5.5.5.5.5 Fig.. Evolution of the Sgc algorithm on the clean 3D circular data set: Set of clusters (upper row) obtained by Sgc and position of the moving points (lower row) after 25 iterations (left), after 5 iterations (middle), and after iterations (right). Figure 2 shows the evolution of the Srg on the noisy synthetic data set after 25, 5 and iterations. When noisy points are part of the data set, while noisy points are maintained at their original positions, 'good' points are moved to their clusters centers (lower row Figure 2), clusters are formed quickly (without including noisy points) and points are not asigned to incorrect clusters (upper row Figure 2). The algorithm stops at iteration 27 when no more clusters can be formed due to the cooling factor, so noisy points can be removed since they form cluster with size equal one. Clearly, the Sgc algorithm is able to nd cluster in noisy circular data sets. 5 Conclusions Mining circular data (data in the form of 'natural' directions or angles) is a challenging task due to the curve geometry of the data space. However, it is possible to accomplish this task by considering the clustering process as the result of a dynamic system, in particular, the result of a gravitational dynamic system. In this direction, we were able to generalize the Newton gravitational law and Newton Second Motion law to curved space (like the surface of an hypersphere) in order to use the cosine distance instead of the euclidean distance for simulating such dynamic system in a curved space. Our results indicate that the Spherical Gravitational Clustering algorithm is able to nd clusters in curved spaces and in the presence of noise. Our future work will concentrate in using the Sgc on analysis of protein structure and in higher dimensional spaces.

.8.6.4.2.2.4.6.8.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5.5 Fig. 2. Evolution of the Sgc algorithm on the noisy 3D circular data set: Set of clusters (upper row) obtained by Sgc and position of the moving points (lower row) after 25 iterations (left), after 5 iterations (middle), and after iterations (right). References. F. Wang, Space and Space-Time Modeling of Directional Data. PhD thesis, Duke University, 23. 2. K. Mardia, C. Taylor, and G. Subramaniam, Protein bioinformatics and mixtures of bivariate von mises distributions for angular data, Biometrics, vol. 63, no. 2, pp. 5552, 27. 3. I. S. Dhillon and S. Sra, Modeling data using directional distributions, tech. rep., The University of Texas at Austin, 23. 4. K. Mardia and P. Jupp, Directional Statistics. Jhon Wiley and Sons, 2. 5. K. Hornik, I. Feinerer, M. Kober, and C. Buchta, Spherical k-means clustering, Journal of Statistical Software, vol. 5, pp. 22, 9 22. 6. J. Gomez, D. Dasgupta, and O. Nasraoui, A new gravitational clustering algorithm, in Proceedings of the Third SIAM International Conference on Data Mining 23, pp. 8394, 23. 7. J. Gomez, O. Nasraoui, and E. Leon, Rain: Data clustering using randomized interactions between data points, in Proceedings of the Third International Conference on Machine Learning and Applications (ICMLA 24), pp. 25255, 24. 8. W. E. Wright, Gravitational clustering, Pattern Recognition, no. 9, pp. 566, 977. 9. S. Kundu, Gravitational clustering: a new approach based on the spatial distribution of the points, Pattern Recognition, no. 32, pp. 496, 999.. T. Cormer, C. Leiserson, and R. Rivest, Introduction to Algorithms. McGraw Hill.. H. T. Croft, K. J. Falconer, and R. K. Guy, Unsolved Problems in Geometry. New York: Springer-Verlag, 99. 2. H. Steinhaus, Mathematical Snapshots, 3rd ed. New York: Dover, 999. 3. M. C. Hazewinkel, Theory of Gravitation. Springer, 2. 4. T. Padmanabhan, Gravitation: foundations and Frontier. Cambridge University Press, 2.