self-organizing maps and symbolic data

Similar documents
A Self Organizing Map for dissimilarity data 0

Clustering Functional Data with the SOM algorithm

LS-SVM Functional Network for Time Series Prediction

Speeding Up the Dissimilarity Self-Organizing Maps by Branch and Bound

A comparison between dissimilarity SOM and kernel SOM for clustering the vertices of a graph

Neural networks for variable star classification

Self-Organizing Maps for Analysis of Expandable Polystyrene Batch Process

9.1. K-means Clustering

A Topography-Preserving Latent Variable Model with Learning Metrics

Comparison of supervised self-organizing maps using Euclidian or Mahalanobis distance in classification context

Unsupervised Learning

Kapitel 4: Clustering

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

A Graph Theoretic Approach to Image Database Retrieval

Machine Learning : Clustering, Self-Organizing Maps

Automatic Group-Outlier Detection

Validation for Data Classification

Graph projection techniques for Self-Organizing Maps

Career-Path Analysis Using Optimal Matching and Self-Organizing Maps

Self-Organizing Maps for clustering and visualization of bipartite graphs

Improving A Trajectory Index For Topology Conserving Mapping

Figure (5) Kohonen Self-Organized Map

Visualizing Changes in Data Collections Using Growing Self-Organizing Maps *

application of learning vector quantization algorithms. In Proceedings of the International Joint Conference on

Mixed Topological Map

Kernel Methods and Visualization for Interval Data Mining

6. Concluding Remarks

RPKM: The Rough Possibilistic K-Modes

Toward a robust 2D spatio-temporal self-organization

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

SOM Quality Measures: An Efficient Statistical Approach

Function approximation using RBF network. 10 basis functions and 25 data points.

Supervised Variable Clustering for Classification of NIR Spectra

Feature Extraction in Wireless Personal and Local Area Networks

A Comparative Study of Conventional and Neural Network Classification of Multispectral Data

Johns Hopkins Math Tournament Proof Round: Point Set Topology

A SOM-view of oilfield data: A novel vector field visualization for Self-Organizing Maps and its applications in the petroleum industry

The Knowledge Mining Suite (KMS)

An Incremental Hierarchical Clustering

Two-step Modified SOM for Parallel Calculation

SOM+EOF for Finding Missing Values

Clustering of Self-Organizing Map

Images Reconstruction using an iterative SOM based algorithm.

Modeling the Real World for Data Mining: Granular Computing Approach

K-Means Clustering With Initial Centroids Based On Difference Operator

Fuzzy sensor for the perception of colour

ECG782: Multidimensional Digital Signal Processing

A NEW ALGORITHM FOR OPTIMIZING THE SELF- ORGANIZING MAP

Selecting Models from Videos for Appearance-Based Face Recognition

2 Renata M.C.R. de Souza et al cluster with its own representation. The advantage of these adaptive distances is that the clustering algorithm is able

DISCRETIZATION BASED ON CLUSTERING METHODS. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Enhancing K-means Clustering Algorithm with Improved Initial Center

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

AN IMPROVED MULTI-SOM ALGORITHM

The Application of K-medoids and PAM to the Clustering of Rules

Image Smoothing and Segmentation by Graph Regularization

Mining Customers Spatio-temporal Behavior Data using Topographic Unsupervised Learning

Spatial Outlier Detection

A 3-D RECOGNITION AND POSITIONING ALGORITHM USING GEOMETRICAL MATCHING BETWEEN PRIMITIVE SURFACES

Signatures of Combinatorial Maps

Smoothing Dissimilarities for Cluster Analysis: Binary Data and Functional Data

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

Clustering and Visualisation of Data

BRACE: A Paradigm For the Discretization of Continuously Valued Data

Songklanakarin Journal of Science and Technology SJST R1 Ghareeb SPATIAL OBJECT MODELING IN SOFT TOPOLOGY

Representation of 2D objects with a topology preserving network

Chapter 7: Competitive learning, clustering, and self-organizing maps

Image Classification Using Wavelet Coefficients in Low-pass Bands

Dimension reduction : PCA and Clustering

Extract an Essential Skeleton of a Character as a Graph from a Character Image

Unsupervised Recursive Sequence Processing

Cluster Analysis on Statistical Data using Agglomerative Method

S-Class, A Divisive Clustering Method, and Possible Dual Alternatives

Fuzzy-Kernel Learning Vector Quantization

What to come. There will be a few more topics we will cover on supervised learning

Topology Proceedings. COPYRIGHT c by Topology Proceedings. All rights reserved.

Proximal Manifold Learning via Descriptive Neighbourhood Selection

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Contextual priming for artificial visual perception

I. INTRODUCTION II. RELATED WORK.

A fuzzy k-modes algorithm for clustering categorical data. Citation IEEE Transactions on Fuzzy Systems, 1999, v. 7 n. 4, p.

Line Simplification Using Self-Organizing Maps

Content-Based Image Retrieval of Web Surface Defects with PicSOM

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Fast Efficient Clustering Algorithm for Balanced Data

Artificial Neural Networks Unsupervised learning: SOM

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

SOMSN: An Effective Self Organizing Map for Clustering of Social Networks

Some Other Applications of the SOM algorithm : how to use the Kohonen algorithm for forecasting

Reducing topological defects in self-organizing maps using multiple scale neighborhood functions

Image Segmentation Based on Watershed and Edge Detection Techniques

Advanced visualization techniques for Self-Organizing Maps with graph-based methods

Associative Cellular Learning Automata and its Applications

Seismic regionalization based on an artificial neural network

Self-Organized Similarity based Kernel Fuzzy Clustering Model and Its Applications

Modification of the Growing Neural Gas Algorithm for Cluster Analysis

Transcription:

self-organizing maps and symbolic data Aïcha El Golli, Brieuc Conan-Guez, Fabrice Rossi AxIS project, National Research Institute in Computer Science and Control (INRIA) Rocquencourt Research Unit Domaine de Voluceau, Rocquencourt, BP 105, 78153 Le Chesnay Cedex, France aichaelgolli, brieucconan-guez, fabricerossi@inriafr arxiv:07093587v1 [csne] 22 Sep 2007 Submitted: February 2004; Accepted: September 2004 Abstract In data analysis new forms of complex data have to be considered like for example (symbolic data, functional data, web data, trees, SQL query and multimedia data, ) In this context classical data analysis for knowledge discovery based on calculating the center of gravity can not be used because input are not R p vectors In this paper, we present an application on real world symbolic data using the selforganizing map To this end, we propose an extension of the self-organizing map that can handle symbolic data keywords: Classification, Self organizing map, symbolic data, dissimilarity 1 Introduction The self-organizing map(som) introduced by Kohonen [6] is an unsupervised neural network method which has both clustering and visualization properties It can be considered as an algorithm that maps a high dimensional data space, R p, to a lower dimension, generally 2, and which is called a map This projection enables the input data to be partitioned into similar clusters while preserving their topology Its most similar predecessors are the k-means algorithm [7] and the dynamic clustering method [3], which operate as a SOM without topology preservation and therefore without easy visualization In data analysis, new forms of complex data have to be considered, most notably symbolic data (data with an internal structure such as interval data, distributions, functional data, etc) and semistructured data (trees, XML documents, SQL queries, etc) In this context, classical data analysis based on calculating the center of gravity can not be used because input are not R p vectors In order to solve this problem, several methods can be considered depending on the type of data (for example projection operators for functional data [8]) However, those methods are not fully general and an adaptation of every data analysis algorithm to the resulting data is needed The Kohonen s SOM is based on the center of gravity notion and unfortunately, this concept is not applicable to many kinds of complex data In this paper we propose an The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 1

adaptation of the SOM to dissimilarity data as an alternative solution Our goal is to modify the SOM algorithm to allow its implementation on dissimilarity measures rather than on raw data To this end, we take one s inspiration from the work of Kohonen [5] To apply the method, only the definition of a dissimilarity for each type of data is necessary and so complex data can be processed 2 Batch self-organizing map for dissimilarity data The SOM can be considered as carrying out vector quantization and/or clustering while preserving the spatial ordering of the prototype vectors (also called referent vectors) in one or two dimensional output space The SOM consists of neurons organized on a regular low-dimensional map More formally, the map is described by a graph (C,Γ) C is a set of m interconnected neurons having a discrete topology defined by Γ Foreach pairofneurons (c, r) onthemap, thedistance δ(c,r), isdefined astheshortest path between c and r on the graph This distance imposes a neighborhood relation between neurons The batch training algorithm is an iterative algorithm in which the whole data set (noted Ω) is presented to the map before any adjustments are made We note z i an element of Ω and z i the representation of this element in the space D called representation space of Ω In our case, the main difference with the classical batch algorithm is that the representation space is not R p but an arbitrary set on which dissimilarity (denoted d) is defined Each neuron c is represented by a set A c = z 1,,z q of elements of Ω with a fixed cardinality q, where z i belongs to Ω A c is called an individual referent We denote A the set of all individual referents, ie the list A = A 1,,A m In our approach each neuron has a finite number of representations We define a new adequacy function d T from Ω P(Ω) to R + by: d T (z i,a c ) = K T (δ rc ) d 2 (z i,z j ) r C z j A r d T is based on the kernel positive function K K T (δ(c,r)) is the neighborhood kernel aroundtheneuron r This functionissuch that lim K(δ) = 0 andallows usto transform δ the sharp graphdistance between two neurons on the map (δ(c,r)) into a smooth distance K is used to define a family of functions K T parameterized by T, with k T (δ) = K( δ ) T is T used to control the size of the neighborhood [1]: when the parameter T is small, there are few neurons in the neighborhood A simple example of K T is defined by K T (δ) = e δ2 T 2 During the learning, we minimize a cost function E by alternating an assignment step and a representation step During the assignment step, the assignment function f assigns each individual z i to the nearest neuron, here in terms of the function d T : f(z i ) = argmin c C dt (z i,a c ) If there is equality, we assign the individual z i to the neuron with the smallest label During the representation step, we have to find the new individual referents A that The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 2

represent the set of observations in the best way in terms of the following cost function E: E(f,A) = d T (z i,a f(zi )) = K T (δ(f(z i ),r)) d 2 (z i,z j ) z i Ω z i Ω r C z j A r This function calculates the adequacy between the induced partition by the assignment function and the map referents A The criterion E is additive so this optimization step can be carried out independently for each neuron Indeed, we minimize the m following functions: E r = K T (δ(f(z i ),r)) d 2 (z i,z j ) z i Ω z j A r In the classical batch version, this minimization of E function is immediate because the positions of the referent vectors are the averages of the data samples weighted by the kernel function 3 Experiments To evaluate our method, we consider real world interval data Our adaptation of the SOM to dissimilarity data is directly applied to this kind of interval structured data, once we can associate dissimilarity to these data This application concerns monthly minimal and maximal temperatures observed in 265 meteorological stations in China A natural representation of the monthly temperature recorded by a station is the interval constituted bythemeanofthedailyminimalandthemeanofthedailymaximaltemperaturesobserved at this station over a month Table 1 depicts the temperature recorded by the 265 stations over a 10-year period (between 1979 and 1988) Each interval is the mean of the minimal and the mean of the maximal monthly temperatures for these 10 years Station January February November December Abag Qi [-249; -17] [-223; -128] [-164; -62] [-247; -148] Hailaer [-286; -225] [-255; -197] [-174; -93] [-255; -200] Table 1: Temperatures of the 265 Chinese stations between 1979 and 1988 We will now describe the parameters used for this application (dissimilarity, map dimensions, number of iterations, ) The choice of these parameters is important for the algorithm We will then describe the obtained results We use the factorial dissimilarity analysis (for more details [9], [4]) to visualize the maps The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 3

31 Hausdorff distance First, we choose to work with the Hausdorff-type L2-distance on interval data defined as follows: p d(q,q ) = (max{ a j a j, b j,b j })2 j=1 with Q = (I 1,,I p ) and Q = (I 1,,I p ) a pair of items described by p intervals and I j = [a j,b j ] It combines the p one-dimensional, coordinate-wise Hausdorff distances in a way which is similar to the definition of the Euclidean distance in R p The map dimension is m = 30 neurons (10 3) We use the elements of Ω in a random order to initialize the map and to choose the initial individual referents A 0 The cardinality of the individual referent q is fixed to 1 Figure 1 shows the initial map on factorial dissimilarity analysis plans Figure 1: Initial map and the data on factorial dissimilarity analysis plan Figure 2 shows the projection of the map that was finally obtained on the training data in factorial plans Figure 2: Final map and the data on factorial dissimilarity analysis plans Each color represents a cluster The details of the result, shown in Figure 3, provide a nice representation of all the stations displayed over 30 clusters These resulting clusters on the geographical map of China provide the representation of the stations attached to their referent station The clusters on the right of Figure 2 are cold stations and correspond to the north and west of China The warm stations are on the left of Figure 2 and correspond to the south and south-east of China and are characterized by very large variations in temperature There The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 4

20 25 30 35 40 45 50 80 90 100 110 120 130 Figure 3: Distribution of the clusters on the geographical map of China using the same colors of Figure 2 is a continuity from cold stations to warm and hot ones The analysis of the distribution of the clusters on the geographical map of China made it possible to deduce that the variations in temperature depend on latitude than on longitude 32 Euclidean distance Secondly, we use the Euclidean distance on interval data defined as follows: d(q,q ) = 1/4 (a a )+(b b ) 2 with Q = (I 1,,I p ) and Q = (I 1,,I p) a pair of items described by p intervals and I j = [a j,b j ] We use the same parameters than for the Hausdorff distance Figure 4: Final map on factorial dissimilarity analysis plan Each color represent a cluster The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 5

Figure 4 shows the projection of the final map on factorial dissimilarity analysis plan Figure 5 provides the details of the result 20 25 30 35 40 45 50 80 90 100 110 120 130 Figure 5: Distribution of the clusters on the geographical map of China using the same colors of Figure 4 The classification is meaningful but different from the one with the Hausdorff distance We will now compare the different results 33 Discussion We use some other metrics for this application but we can t detail the results for lack of space We use the vertex-type distance defined as the sum of the squared Euclidean distances between the 2 p vertices We use also the mean temperatures of the stations (it s a non symbolic representation of the temperatures) In order to compare the different results obtained by these different metrics, we calculate longitude and latitude distortions of the different obtained clustering The distortion is defined as the quadratic mean error between the referent and their assigned individuals The longitudinal distortion is defined as follows: (D long ) 2 = c C z i c 1 c Lo z i Lo f(zi ) 2 with c the cardinal of the cluster c, Lo zi Lo f( z i ) the longitude distance between the station z i and his referent f(z i ) The latitude distortion is defined as follows: (D lati ) 2 = c C z i c 1 c La z i La f(zi ) 2 The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 6

with La zi La f(zi ) 2 the latitude distance between the station z i and his referent f(z i ) In the table 2, we represent the longitude and the latitude distortions of the different obtained clustering with the different metrics Data type Used metric Longitude distortion Latitude distortion Intervals Euclidean distance 9250688 1993213 Intervals Vertex-type distance 8625175 2165838 Means(numerics) Euclidean distance 7656033 1936692 Intervals Hausdorff distance 738314 1911461 Table 2: The different longitude and latitude distortions for the different metrics We can deduce that the clustering obtained with the Hausdorff distance induced the smallest latitude and longitude distortions 4 Conclusion In this paper, we proposed an adaptation of the self-organizing map to dissimilarity data This adaptation is based on the batch algorithm and can handle both numerical data and complex data The experiments showed the usefulness of the method and that it can be applied to symbolic data or other complex data once we can define dissimilarity for these data References [1] Anouar, F and Badran, F and Thiria, S: Self Organized Map, a Probabilistic Approach Proceedings of the Workshop on Self-Organized Maps (1997) Helsinki University of Technology, Espoo, Finland, June 4-6 [2] De Reynis, A: Classification et discrimination en analyse de donnes symboliques Thse de Doctorat Universit Paris Dauphine, 2003 [3] Diday, E and Schroeder, A and Ok, Y: The Dynamic Clusters Method in Pattern Recognition IFIP Congress 1974, 691-697 [4] El Golli, A: Extraction de donnes symboliques et cartes topologiques: application aux donnes ayant une structure complexe Thse de Doctorat Universit Paris Dauphine, 2004 [5] Kohonen, T: Self-Organizing maps of symbol strings, Technical Report A42, Laboratory of Computer and Information Science, Helsinki University of Technology, Finland 1996 [6] Kohonen, T: self-organizing maps, Springer-Verlag, 1997 The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 7

[7] MacQueen, J: Some methods for classification and analysis of multivariate observations The fifth Proc Berkeley symposium on Math, stat and Prob, Vol 1, pp281-297, 1965 [8] Ramsay, J and Silverman, B: Functional Data Analysis Springer series in statistics Springer-Verlag, June 1997 [9] Saporta, G:Probabilités, analyse des données et statistique Technip Edition, 1990 The Electronic Journal of Symbolic Data Analysis - Vol 2, N 1 (2004), 8