Multivariate Data Visualisation

Size: px

Start display at page:

Download "Multivariate Data Visualisation"

Emory Armstrong
6 years ago
Views:

1 A Survey of Methods for Multivariate Data Projection, Visualisation and Interactive Analysis Andreas K ONIG Chair of Electronic Devices and Integrated Circuits Dresden University of Technology Mommsenstr. 13, Dresden, Germany Tel.: Fax.: koenig@iee.et.tu-dresden.de Key Words: data structure analysis, visual exploratory data analysis, topology preserving mapping, distance preserving mapping, Gestalt-Theory, Gestalt preserving mapping Abstract In this paper, algorithms for multivariate data projection, based on topology or distance preserving mappings, as well as tools and techniques for projection display and user interaction are briey reviewed and compared in an unifying approach. Advanced mapping algorithms, that focuse on improved data structure preservation, following laws of perception as given by Gestalt-theory, as well as advanced features of data visualisation and navigation are introduced. These methods help to exploit the remarkable human perceiptive and associative capabilities in man/computer dialog, e.g. for visual exploratory data analysis. 1. Introduction Projection of multivariate data by dimension reducing mapping and ensuing visualisation and interactive analysis is a topic of interest for more than three decades [1] [2]. Recently, there is strong renewed interest in this topic incented by, e.g. data-mining/dataware-house and knowledge discovery applications. To cope with todays ood of data from rapidly growing data bases and related computational resources, especially to discover salient structures and interesting correlations in data requires advanced methods of machine learning, pattern recognition, data analysis and visualisation. The remarkable ability of human observers to perceive clusters and correlations, and thus structure in data, is of great interest and can be well exploited by eective systems for data projection and interactive visualisation. Methods from various domains, e.g. pattern recognition and neural networks were developed and individually applied in conjunction with various visualisation techniques. A survey will be given in section 2 and Multivariate Data Projection signal- preserving linear methods Visualisation of multivariate data requires a dimension reduction to a two or three dimensional representation. Fig. 1 gives a taxonomy of state-of-the-art projection methods in a unied presentation. The focus of this paper is on unsupervised mapping procedures, that work without any a priori information. For demon- Multivariate Data Visualisation discriminance based nonlinear methods topology preserving discriminance based manual PCA scatter NLM Visor Triang. TOPAS Koontz und selection Factor Analysis matrices Method Fukunaga Backpropagation Network (BP) (Autoassociative) distance preserving Kohonen Feature Map BP (Discr.An.) Figure 1: Taxonomy of Projection Methods stration of some mapping properties, an articial data set, denoted as Cube-data will be used in the following. A cube with points only on eight edges of two opposite sides was generated. The cube was rotated by 45 o with regard to the coordinate axes. A mapping can most simply be achieved by a priori selection of two (or three) salient components or factors. However, context knowledge is required for such a selection. The simulataneous display of mul-

tiple pairwise plots of the data set is proposed in some tool kits. However, the combinatorial explosion ( M(M?1) 2 plots for M-dimensional data) limits practical application of this approach.

presented by the rst two principal components, then suitable plots can be achieved by this linear method [4].

NLM-projection (right) have been applied in autoassociative mode and a bottleneck topology to achieve mappings comparable to PCA.

However, thus achieved projections are hard to interpret and strongly depend on other network parameters.

2 tiple pairwise plots of the data set is proposed in some tool kits. However, the combinatorial explosion ( M(M?1) 2 plots for M-dimensional data) limits practical application of this approach. A mapping can also be achieved by the rst two (three) principal components of Principal Component Analysis (PCA) [3] If the PCA-assumption of gaussion distribution is met and most variance is presented by the rst two principal components, then suitable plots can be achieved by this linear method [4]. Backpropagation networks Figure 3: U-Matrix principle (left) and plot for Cube-data (right) Figure 4: SOM component planes for cube data Figure 2: SOM trained with Cube-data: grid (left) and NLM-projection (right) have been applied in autoassociative mode and a bottleneck topology to achieve mappings comparable to PCA. With a ve layer topology a non-linear, signal preserving mapping is computed from the input to the middle layer. However, thus achieved projections are hard to interpret and strongly depend on other network parameters. The most promising mapping methods in terms of structure preservation use the criteria of either topology or distance preservation for the nonlinear mapping process. Kohonen's Self-Organzing Map (SOM) [5] is perhaps the most popular method for data visualisation. During training, the SOM unfolds in pattern space and creates a topology preserving mapping of the multivariate data on the xed neuron grid (or cube for 3D-SOM). Though this mapping, as given in Fig. 2, is interesting concerning neighborhood relations, no information is given on intra/inter-cluster distance. Researchers working on exploratory data analysis thus complemented the SOM with this distance information by a method denoted as Unied-Distance-Matrix (U- Matrix) [6]. The U-matrix method exploits the third dimension to plot interpoint distance information as a landscape on the SOM-grid, i.e. a mountain range implicates a large distance between clusters. Kohonen proposed Sammon's Non-Linear-Mapping [1] as means to include distance information in SOM visualisation (s. Fig. 2 right). In addition to the missing distance information, several other practical problems will be met using SOM displays. First, due to the quantisation carried out by SOM along with the topology preserving mapping the SOM is not suited for identifying the position of individual data vectors in the map visualisation. All data vectors falling in the voronoi cell of a certain SOM weight vector are represented by the same point on the SOM-grid. Second, SOM interpolation properties cause the placement of weight vectors in pattern space regions actually void of data vectors. Third, if the training data does not only possess a high absolute dimension but also a high intrinsic dimension (larger than two or three dimensions) [7], the SOM starts to fold and twist in the attempt to establish a mapping to the plane of the neuron grid. This can lead to the scattered representation of an intrinsically high-dimensional cluster all over the map and consequently to misinterpretations by human observers. The Growing-Cells of Fritzke [8] oer a remedy to the second SOM problem, however no improvement is offered for the other issues. The NLM [1] in contrast is a distance preserving nonlinear mapping. Interpoint distances dxij, and thus implicitly the data structure, shall be preserved in the mapping according to the cost function E(m): NX jx E(m) = 1 (dxij? dy ij(m)) 2 (1) c j=1 i=1 dxij q Pd Here dy ij(m) = q=1 (y iq(m)? yjq(m)) 2 denotes the distance of the respective data points in the visu-

3 alisation plane and dxij = q PM q=1 (v iq? vjq) 2 in the original data space and c = P N j=1 P j i=1 d Xij. Based on a gradient descent approach, the new coordinates of the N pivot vectors in the visualisation plane ~yi are determined by: yiq(m + 1) = yiq(m)? MF yiq(m) (2) with 2 iq(m) 2 and 0 < M F 1. It was shown that for large data sets the computational eort is considerable and that the gradient procedure does not always achieve an accurate projection [9]. Lee, Slaggle and Blum [10] presented a fast distance preserving mapping, that focuses on the exact preservation of only a limited number of distances (2(N?3)+3). For this mapping, the Minimal- Spanning-Tree (MST) of the data distance graph is computed. Then the MST is traversed and points are mapped by a triangulation method, based on the previously mapped MST-neighbors (s. Fig. 5). This Figure 5: MST and triangulation mapping achieves a fast and quite accurate projection. They also introduced the idea of a global reference point to be used in the triangulation step. However, MST computation has O(N 2 ) complexity. Thus, in prior work, a mapping algorithm with fast determination of global pivot points for the triangulation step was developed [11]. This algorithm has O(N) complexity and thus provides data projections with a very short response time. As shown by prior investigations with a quantitative mapping quality measure, achievable mapping quality is similar to the NLM [11] (s.fig 6). In com- P2 P1 P3 Figure 7: Four Iris NLM component cards parison with other mapping techniques, distance preserving mappings are esteemed as the most convenient and powerful alternative [4]. 3. Data Visualisation After data projection, the next step is visualisation and user interaction. Some examples of SOM visualisation have already been shown in Fig. 2, 3, and 4. Abstraction to other mapping techniques, e.g. NLM, LSB or Visor will be demonstrated here. In prior work, the WeightWatcher (WW), initially devised to analyse neural networks, was developed. WW oers numerous visualisation aids as, e.g. SOM/NLM display, component planes for SOM/NLM, Voronoi tesselation, SOM orthogonal mesh and weight icons (all component values are plotted contiguously at the projection point, e.g. as grey-value image block). Additional attribute display complements the projection points by class labels, pattern index or textual description in the plot. Further, navigation in the projection Figure 8: Actual local neighborhood display Figure 6: Two-dimensional projections of well known Iris data by NLM (left) and Visor (right) is supported by zoom and pan functions and an additional overview or navigation window. This feature allows to interactively explore the data from global to local aspects. Also, here is already a link to our recent work on more powerful, hierarchical mapping schemes that closely connect mapping and visualisa-

4 tion. The achieved mappings of NLM, LSB or Visor provide mappings that give a fair impression of the underlying global data structure. However, with growing intrinsic dimensionality, mapping faults occur and lead to distortions and twists in the 2D-representation. For accurate interpretation of displays WW provides a Actual local neighborhood display demonstrated in Fig. 8. The user can nd, traverse, and analyse the actual k-nearest neighbors of selected points (data entries) of interest. Thus, mapping faults can be identied and largely overcome by this interactive feature. Currently, search functions are implemented that direct navigation according to simple and in the future more complex search cues. 4. Advanced Projection Methods Advance from the state-of-the-art described in the last two sections is highly desirable concerning improved structure preservation of the mapping without impairing mapping speed. Also, concepts to cope with the often unavoidable mapping error caused by high intrinsic dimensionality are of interest. For instance, a mapper should aim on preserving and accurately map structures of perceptional relevance for the human operator, focusing on the current region of interest (ROI). The unavoidable mapping error should be relocated by the mapping algorithm between perceptionally relevant structures and out of the ROI. Linking mapping and visualisation more closely opens the way to improved data displays and more lucent analysis. For instance, the reference point in the LSBmethod can be interactively selected and a new projection on this ROI-like selection can be carried out in the advanced visualisation tool. In a similar multistep approach, the rapidly computed LSB or Visor mappings [11] give an idea of the basic data structure. Zooming in, using WW, either the rst, simple mapping is just scaled, or a more demanding mapping is started for the data subset enclosed in the current ROI. Higher complexity of this second mapping could be accepted, as the ROI typically comprises only a limited number of data entries. Following this idea, in prior work we developed an experimental mapping (TOpology Preserving mapping of sample Sets, TOPAS) to achieve mappings of high quality and reliability. In brief, this method is based on rank order evaluation of data vectors X in the original space and Y in the mapping space. If the respective rank positions in X- and Y-space are not identical, a correction is required to obtain the proper rank order. To achieve gradual corrections in the iterative mapping process, as in the NLM, distance information is used in the correction rule with and yij(t + 1) = yij(t) + yij(t) (3) yij(t) = p(t; Y (~yi)) [y ij? ypj] dy ip (4) dy ip = vu u t X d [yij? ypj] 2 (5) j=1 Similar to SOM, p(t; Y (~yi)) denotes a time and position dependent learning rate. The temporal factor decays with time. In addition, the regarded neighborhood shrinks as in SOM, so that the mapping focuses more and more on close neighbors. Fig. 9 shows some achieved results in comparison to NLM. Obvi- Figure 9: NLM and TOPAS mapping of Cube ously, TOPAS provides a much better structure preservation than NLM. However, besides a scaling problem associated with the current learning rule, the complexity of O(N 3 ) is a disadvantage of this algorithm. The salient features of TOPAS can be easily integrated in an Enhanced NLM (ENLM). Replacing Sammons Magic Factor MF, the basic correction rule is enhanced to yiq(m + 1) = yiq(m)? (m) yiq(m; (r( ~ Xi); m)) (6)

5 by a time and position dependent learn rate. For the neigborhood function (r( ~ Xi); m), a gaussian function with decaying (m) is chosen, so that after achieving a global arrangement, long distances dxij are more and more neglected in the mapping and a better ne tuning of local data structures is achieved. The idea of the reference point introduced in [10] can be incorporated in TOPAS and ENLM by adding an additional weighting factor in the correction rule: gr (dx ri ) = 1? d X ri? dx rimin dx rimax (7) By this factor, computed for each data vector before the actual mapping step, emphasis is on accurately preserving distances (rank order) close to the reference point, while accuracy of distance preservation may decay with growing displacement from the reference point. Another hierarchical approach, targeting on the preservation of perceptually relevant structures, is pursued in our current work. Data partitioning in clusters takes place as a rst step according to the work of Zahn [12]. This clustering process exploits simple laws from Gestalt theory. After this identication of perceptually relevant structures, mapping will take place in a hierarchical scheme, e.g. using variants of the ENLM or the presented triangulation techniques. 5. Conclusions and Future Work A review and a brief assessment of state-of-the-art data projection and visualisation techniques was given. Salient SOM visualisation techniques were generalized to distance preserving mappings. Advanced mapping techniques were introduced in conjunction with improved visualisation techniques and tools. The presented work based on the experience gathered with our PD SUN-tool-package ( koeniga) ~ that incorporates several mapping methods and the WW. In autumn 1998 our QuickCog-System on PC (Windows'95, NT) becomes commercially available which comprises most of the described mappers and the WW as part of an environment for image/signal processing and rapid cognitive systems design. Our new mapping algorithms and visualisation techniques will be implemented on this platform to be instantly available for applications, ranging from pattern recognition, medical data analysis to data base navigation tasks. References [1] J. W. Sammon. A Nonlinear Mapping for Data Structure Analysis. In IEEE Transactions on Computers C-18, No.5, pages 401{409, [2] J. W. Sammon. Interactive Pattern Analysis and Classication. In IEEE Transactions on Computers C-19, No.7, pages 594{616, [3] K. Fukunaga. Introduction to Statistical Pattern Recognition. ACADEMIC PRESS, INC. Harcourt Brace Jovanovich, Publishers Boston San Diego New York London Sydney Tokyo Toronto, [4] W. Siedlecki, K. Siedlecka, and J. Sklansky. An Overview of Mapping Techniques for Exploratory Pattern Analysis. In Pattern Recognition, Vol. 21, No.5, Pergammon Press plc, pages 411 { 429, [5] T. Kohonen. Self-Organization and Associative Memory. Springer Verlag Berlin Heidelberg London Paris Tokyo Hong Kong, [6] A. Ultsch and H. P. Siemon. Exploratory Data Analysis: Using Kohonen Networks on Transputers. In Interner Bericht Nr. 329 Universitat Dortmund, Dezember 1989, [7] K. Fukunaga and D. R. Olsen. An Algorithm for Finding Intrinsic Dimensionality of Data. In IEEE Transactions on Computers C-20, No.2, pages 176{183, [8] B. Fritzke. Growing Cell Structures { A Self- Organizing Network for Unsupervised and Supervised Learning. In Neural Networks, Vol. 7, No. 9, pages 1441{1460. Pergammon Press, [9] W. Dzwinel. How to make Sammon's Mapping useful for Multidimensional Data Structure Analysis. In Pattern Recognition, Vol. 27, No.7, Elsevier Science Ltd, pages 949 { 959, [10] R. C. T. Lee, J. R. Slagle, and H. Blum. A Triangulation Method for the Sequential Mapping of Points from N-Space to Two-Space. In IEEE Transactions on Computers C-26, pages 288{292, [11] A. Konig, O. Bulmahn, and M. Glesner. Systematic Methods for Multivariate Data Visualization and Numerical Assessment of Class Separability and Overlap in Automated Visual Industrial Quality Control. In Proceedings of the 5th British Machine Vision Conference BMVC'94, pages 195 { 204, September [12] C. T. Zahn. Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters. In IEEE Transactions on Computers, Vol. C-20, pages 68 { 86, 1971.

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Cluster Analysis and Visualization Workshop on Statistics and Machine Learning 2004/2/6 Outlines Introduction Stages in Clustering Clustering Analysis and Visualization One/two-dimensional Data Histogram,