Machine Learning and Visualisation

Size: px

Start display at page:

Download "Machine Learning and Visualisation"

Ernest Stevenson
5 years ago
Views:

1 Machine Learning and Visualisation Ian T. Nabney Aston University, Birmingham, UK March 2015 Ian T. Nabney Machine Learning and Visualisation 1/45

2 Outline The challenge of hidden knowledge Data visualisation: latent variable models Data visualisation: topographic mappings Non-linear modelling and feature selection Ian T. Nabney Machine Learning and Visualisation 2/45

3 Acknowledgements Collaborators Chris Bishop, Mike Tipping, David Lowe, Markus Svénsen, Chris Williams Peter Tiño, Yi Sun, Dharmesh Maniyar, John Owen Phil Laflin, Bruce Williams, Paola Gaolini, Jens Lösel Martin Schroeder, Ain Abdul Karim, Dan Cornford, Cliff Bailey, Naomi Hubber, Shahzad Mumtaz, Midhel Randrianandrasana Richard Barnes, Colin Smith, Dan Wells Ian T. Nabney Machine Learning and Visualisation 3/45

4 Hidden Knowledge Hidden Knowledge Understanding the vast quantities of data that surround us is a real challenge; particularly in situations with a lot of variables We can understand more of it with help. Machine learning is the computer-based generation of models from data. A model is a parameterised function from input attributes to an output prediction. Parameters in the model express the hidden connection between inputs and predictions. They are learned from data. Ian T. Nabney Machine Learning and Visualisation 4/45

5 Data Visualisation What is Visualisation? Goal of visualisation is to present data in a human-readable way. Visualisation is an important tool for developing a better understanding of large complex datasets. It is particularly helpful for users such as research scientists or clinicians who are not specialists in data modelling. Detection of outliers. Clustering and segmentation. Aid to feature selection. Feedback on results of analysis. Two aspects: data projection and information visualisation. Ian T. Nabney Machine Learning and Visualisation 5/45

6 Data Projection Data Visualisation The goal is to project data to a lower-dimensional space (usually 2d) while preserving as much information or structure as possible. Once the projection is done standard information visualisation approaches can be used to support user interaction. The quantity and complexity of many datasets means that simple visualisation methods, such as Principal Component Analysis, are not very effective. Ian T. Nabney Machine Learning and Visualisation 6/45

7 Data Visualisation Information Visualisation Shneiderman: Overview first; zoom and filter; details on demand. Overview provided by projection. Zooming possible in Matlab plots. Filtering by user interaction; e.g. specify pattern of values that is of interest. Details by providing local information. See more of this later on practical examples. Ian T. Nabney Machine Learning and Visualisation 7/45

8 Data Visualisation Information Visualisation Examples Word Cloud ( Ian T. Nabney Machine Learning and Visualisation 8/45

9 Uncertainty Data Visualisation Doubt is not a pleasant condition, but certainty is absurd. Voltaire Real data is noisy. We are forced to deal with uncertainty, yet we need to be quantitative. The optimal formalism for inference in the presence of uncertainty is probability theory. We assume the presence of an underlying regularity to make predictions. Bayesian inference allows us to reason probabilistically about the model as well as the data. Ian T. Nabney Machine Learning and Visualisation 9/45

10 Data Projection Data Visualisation D f(y; W) y3 Define f to optimise some criterion. V y1 y2 PCA is minimal variance; Sammon mapping is minimal stress. Ian T. Nabney Machine Learning and Visualisation 10/45

11 Data Visualisation What can we learn from this? 10 Sinus VEL VER Ian T. Nabney Machine Learning and Visualisation 11/45

12 Projection Data Visualisation What is the simplest way to project data? A linear map. What is the best way to linearly project data? Want to preserve as much information as possible. If we assume that information is measured by variance this implies choosing new coordinate axes along directions of maximal variance; these can be found by analysing the covariance matrix of the data. This gives Principal Component Analysis (PCA). For large datasets, the end result is usually a circular blob in the middle of the screen. Ian T. Nabney Machine Learning and Visualisation 12/45

13 PCA Data Visualisation Let S be the covariance matrix of the data, so that S ij = 1 (xi n x i )(xj n x j ) N n The first q principal components are the first q eigenvectors w j of S, ordered by the size of the eigenvalues λ j. The percentage of the variance explained by the first q PC s is q j=1 λ j d j=1 λ j where the data dimension is d. These vectors are orthonormal (perpendicular and unit length). The variance when the data is projected onto them is maximal. Plot the sorted principal values: plot(-sort(-eig(cov(data)))); Ian T. Nabney Machine Learning and Visualisation 13/45

14 Data Visualisation: Topographic Mappings Topographic Mappings Basic aim is that distances in the visualisation space are as close a possible to those in original data space. Given a dissimilarity matrix d ij, we want to map data points x i to points y i in a feature space such that their dissimilarities in feature space, d ij, are as close as possible to the d ij. We say that the map preserves similarities. The stress measure is used as objective function ) 2 E = 1 (d ij d ij ij d ij d ij i<j Ian T. Nabney Machine Learning and Visualisation 14/45

15 Data Visualisation: Topographic Mappings Multi-Dimensional Scaling Given distances or dissimilarities d rs between every pair of observations try to preserve these as far as possible in lower dimensional space. In classical scaling, the distance between the objects is assumed to be Euclidean. A linear projection then corresponds to PCA. The Sammon mapping is a non-linear multidimensional scaling technique more general (and more widely used) than classical scaling. Neuroscale is a neural network based scaling technique that has the advantage of actually giving a map that generalises! Ian T. Nabney Machine Learning and Visualisation 15/45

16 Data Visualisation: Topographic Mappings Neuroscale Ian T. Nabney Machine Learning and Visualisation 16/45

17 Data Visualisation: Topographic Mappings Biological Application: Streptomyces Gene Expression Data supplied by Colin Smith (Surrey University). Streptomyces Coelicolor is a bacterium which undergoes developmental changes correlated to sporulation and production of antibiotics genes include more than 20 clusters coding for secondary metabolites including a large proportion of regulatory genes. The dataset consists of ten time points from 16 to 67 hours after inoculation of the growth medium. Analysis based on 3067 genes that were significantly expressed. SCO6283, SCO6284, SCO6277, SCO6278 co-regulated genes involved in synthesis of type I polyketide, SCO3245 in synthesis of lipid. Ian T. Nabney Machine Learning and Visualisation 17/45

expression levels of thousands of genes over multiple

18 Data Visualisation: Topographic Mappings Streptomycin Life of streptomycin Bioinformatics Measuring the expression levels of thousands of genes over multiple timepoints. Ian T. Nabney Machine Learning and Visualisation 18/45

19 Data Visualisation: Topographic Mappings SCO6283, SCO6284, SCO6277, SCO6278 in cluster 11, SCO3245 in cluster 12. Ian T. Nabney Machine Learning and Visualisation 19/45

20 Data Visualisation: Topographic Mappings Genes involved with synthesis of two distinct secondary metabolites may be coregulated by a common network. Ian T. Nabney Machine Learning and Visualisation 20/45

21 Data Visualisation: Latent Variable Models Latent Variable Models The projection approach is one way of reducing the data complexity. An alternative view is to hypothesise how the data might have been generated. Hidden Connections A hidden connection is stronger than an obvious one. Heraclitus Ian T. Nabney Machine Learning and Visualisation 21/45

22 Data Visualisation: Latent Variable Models Latent Variable Models How is the idea of hidden connections applied to statistical pattern recognition? Separate the observed variables and the latent variables. Latent variables generate observations. Use (probabilistic) inference to deduce what is happening in latent variable space. Often use Bayes Theorem: P(L O) = P(O L) P(L) P(O) Static case: GTM. Two latent variables and a non-linear transformation to observation space. Dynamic case: Hidden Markov Models: discrete state space. Speech recognition. State Space Models: continuous state space. Tracking. Ian T. Nabney Machine Learning and Visualisation 22/45

23 Data Visualisation: Latent Variable Models Visualisation with Density Models Construct a generative model for the data mapping from a low-dimensional latent space H to the data space D. Maps latent variables r to observed variables x giving a probability density p(x r). To visualise the data we want to map from observed variables to latent variables: use Bayes theorem to compute p(r x) = p(x r)p(r). p(x) Plot a summary statistic of p(r i x i ) for each data point x i : usually the mean. If the mapping is linear and there is a single Gaussian noise model, we recover PCA. Ian T. Nabney Machine Learning and Visualisation 23/45

24 Data Visualisation: Latent Variable Models Latent space x 3 z 2 y(z;w) z 1 Data space x2 x 1 Ian T. Nabney Machine Learning and Visualisation 24/45

25 Data Visualisation: Latent Variable Models The Generative Topographic Mapping GTM (Bishop, Svensén and Williams) is a latent variable model with a non-linear RBF f M mapping a (usually two dimensional) latent space H to the data space D. Data doesn t live exactly on manifold, so smear it with Gaussian noise. Introduce latent space density p(x): approximate by a data sample. This is a generative probabilistic model. This model assumes that the data lies close to a two dimensional manifold; however, this is likely to be too simple a model for interesting data. We can measure the non-linearity of the sheet and use this to understand the visualisation plot. Train the model in maximum likelihood framework using an iterative algorithm (EM). Ian T. Nabney Machine Learning and Visualisation 25/45

26 Data Visualisation: Latent Variable Models Enhancements to GTM Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the other goodies work for it as well. Can cope with missing data in training and visualisation. Ian T. Nabney Machine Learning and Visualisation 26/45

27 Data Visualisation: Latent Variable Models Enhancements to GTM Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the other goodies work for it as well. Can cope with missing data in training and visualisation. MML methods for feature selection. Ian T. Nabney Machine Learning and Visualisation 26/45

28 Data Visualisation: Latent Variable Models Enhancements to GTM Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the other goodies work for it as well. Can cope with missing data in training and visualisation. MML methods for feature selection. Structured covariance. Ian T. Nabney Machine Learning and Visualisation 26/45

29 Data Visualisation: Latent Variable Models Enhancements to GTM Curvatures give more information about shape of manifold. Hierarchy allows the user to drill down into data; either user-defined or automated (MML) selection of sub-model positions. Temporal dependencies in data handled by GTM through Time. Discrete data handled by Latent Trait Model (LTM): all the other goodies work for it as well. Can cope with missing data in training and visualisation. MML methods for feature selection. Structured covariance. Mixed data types. Ian T. Nabney Machine Learning and Visualisation 26/45

30 Data Visualisation: Latent Variable Models Local Parallel Coordinates Parallel coordinates maps d-dimensional data space onto two display dimensions by using d equidistant axes parallel to the y-axis. Each data point is displayed as a piecewise linear graph intersecting each axis at the position corresponding to the data value for that dimension. It is impractical to display this for all the data points, so allow the user to select a region of interest. The user can also interact with the local parallel coordinates plot to obtain detailed information. Ian T. Nabney Machine Learning and Visualisation 27/45

31 Data Visualisation: Latent Variable Models Hierarchical GTM: Drilling Down Bishop and Tipping introduced the idea of hierarchical visualisation for probabilistic PCA. We have developed a general framework for arbitrary latent variable models. Because GTM is a generative latent variable model, it is straightforward to train hierarchical mixtures of GTMs. We model the whole data set with a GTM at the top level, which is broken down into clusters at deeper levels of the hierarchy. Because the data can be visualised at each level of the hierarchy, the selection of clusters, which are used to train GTMs at the next level down, can be carried out interactively by the user. Ian T. Nabney Machine Learning and Visualisation 28/45

32 Data Visualisation: Latent Variable Models Chemometric Application: HTS Data Exploration Scientists at Pfizer searching for active compounds can now screen millions of compounds in a fortnight. Gain a better understanding of the results of multiple screens through the use of novel data visualisation and modelling techniques. Find clusters of similar compounds (measured in terms of biological activity) and using a representative subset to reduce the number of compounds in a screen. Build local prediction models. Ian T. Nabney Machine Learning and Visualisation 29/45

33 Data Visualisation: Latent Variable Models We have taken data from Jens Lösel (Pfizer) which consists of dimensional vectors representing chemical compounds using topological indices developed at Pfizer. The task is to predict LogP. Plots segment the data (by responsibility) which can be used to build local predictive models which are often more accurate than global models. Only 14 inputs, compared with c for other methods of predicting logp. Results comparable with other algorithms for logp. Ian T. Nabney Machine Learning and Visualisation 30/45

34 Data Visualisation: Latent Variable Models Ian T. Nabney Machine Learning and Visualisation 31/45

35 Data Visualisation: Latent Variable Models Ian T. Nabney Machine Learning and Visualisation 32/45

36 Data Visualisation: Latent Variable Models Gaussian Process Latent Variable Model Ian T. Nabney Machine Learning and Visualisation 33/45

37 Non-linear Modelling and Feature Selection Non-linear Modelling and Feature Selection Many chemometric problems can best be addressed using non-linear predictive models (e.g. QSAR). Models must be multivariate (there is no single silver bullet ), but there are hundreds (thousands, tens of thousands) of possible features (e.g. for small molecules, proteins,... ). Linear models have a constant sensitivity to input variables. Non-linear models have a variable sensitivity; niches of good performance/variable importance. Ian T. Nabney Machine Learning and Visualisation 34/45

38 Non-linear Modelling and Feature Selection GTM-FS d 1 and d 2 have high saliency, d 3 has low saliency Ian T. Nabney Machine Learning and Visualisation 35/45

39 Non-linear Modelling and Feature Selection Chemometric Data GTM Visualisation GTM-FS Visualisation Magnification factors on a log scale Ian T. Nabney Machine Learning and Visualisation 36/45

40 Non-linear Modelling and Feature Selection Feature Saliencies Both GTM models outperform Kohonen SOM GTM-FS performs better than GTM on magnification factors (71 to 126) and (subjectively) has more coherent clusters GTM-FS performs worse than GTM on nearest-neighbour error (41% to 38%) Ian T. Nabney Machine Learning and Visualisation 37/45

41 Block GTM Block-structured Covariance Include prior information about the correlations of variables into a GTM by using a full covariance matrix in the noise model and enforcing a block structure. This results in a reasonably sparse covariance matrix and keeps the number of unknown parameters low. The additional flexibility of the model allows the model to fit the data more closely. The extension of the learning algorithm is straightforward and the only changes occur in the computation of responsibilities in the E-step and of Σ in the M-step. Σ Σ = 0 Σ Σ p Ian T. Nabney Machine Learning and Visualisation 38/45

42 Block-structured Covariance Finding the Blocks: I Find the block structure by visualising the correlation coefficients as a heat map. However for this method to be successful one needs to order this heat map highly correlated variables are close to each other (i.e. forming blocks). Generate a dendrogram using hierarchical clustering combined with heuristics to reorder the leaves to reflect their proximity. To achieve this the tree is ordered in such a way that the distance between neighbouring leaves is minimized. Use a recursive algorithm: Optimal Leaf Ordering (OLO). (Available in the Matlab Bioinformatics Toolbox). Swaps sub-trees if this reduces distances to neighbours. Ian T. Nabney Machine Learning and Visualisation 39/45

43 Block-structured Covariance Finding the Blocks: II Bayesian Correlation Estimation based on the paper of Liechty et al. (2004). For the grouping one is only interested in the off-diagonal elements of the empirical correlation matrix C. Assume that C ij N(µ, σ 2 ) with priors µ N(0, τ 2 ) and σ 2 IG(α, β) with the hyperparameters known. Extend this to groups with µ θi,θ j where the posterior p(θ i ) defines the groups. The full posterior distribution of θ i, µ and σ can be sampled using the Metropolis Hastings algorithms. Very slow. Created a simpler Quick BCE which just estimates p(θ i = k). Ian T. Nabney Machine Learning and Visualisation 40/45

44 Block-structured Covariance Results on Toy Data The nearest neighbour label error with high (ST=20) and low (ST=2) structure for the GTM model with different covariance structures. PCA=(blue, dotted line with big dot). S-GTM=(green, constant line with X). B-GTM=(red, dashed line with diamond). F-GTM=(black, dashed and dotted line). Ian T. Nabney Machine Learning and Visualisation 41/45

45 Conclusions Block-structured Covariance Visualisation is an important tool for all types of user; the domain expert must be involved in the process. Interaction with the plots allows the user to query the data more effectively. Presenting the data in the right way is key. Feature selection is a very important tool. Accounting for known structure (e.g. block covariance) improves results. Ian T. Nabney Machine Learning and Visualisation 42/45

46 Block-structured Covariance AgustaWestland AW has pioneered CVM, the continuous recording of airframe vibration (0-200Hz), to improve the investigation of unusual occurrences and monitor airframe integrity. Develop a probabilistic framework for inferring flight mode and key parameters from multiple streams of vibration data. Improve indicators of airframe condition: the wavelet transform and kernel entropy to assess the dynamics (i.e. non-stationary characteristics) of the vibration signal. Integrated diagnosis based on probabilistic models of normality and using a belief network to model prior knowledge about the domain and interactions between key variables. Ian T. Nabney Machine Learning and Visualisation 43/45

47 Block-structured Covariance Understanding the Data 8 sensors measuring vibration; 108 frequency bands per sensor. Ian T. Nabney Machine Learning and Visualisation 44/45

48 Block-structured Covariance Ian T. Nabney Machine Learning and Visualisation 45/45

Machine Learning Methods in Visualisation for Big Data 2018

Machine Learning Methods in Visualisation for Big Data 2018 Daniel Archambault1 Ian Nabney2 Jaakko Peltonen3 1 Swansea University 2 University of Bristol 3 University of Tampere, Aalto University Evaluating