Learning Manifolds in Forensic Data

Size: px

Start display at page:

Download "Learning Manifolds in Forensic Data"

Asher O’Neal’
5 years ago
Views:

1 Learning Manifolds in Forensic Data Frédéric Ratle 1, Anne-Laure Terrettaz-Zufferey 2, Mikhail Kanevski 1, Pierre Esseiva 2, and Olivier Ribaux 2 1 Institut de Géomatique et d Analyse du Risque, Faculté desgéosciences et de l Environnement, Université de Lausanne, Amphipôle, CH-1015, Switzerland frederic.ratle@unil.ch 2 Institut de Police Scientifique et de Criminologie, Ecole des Sciences Criminelles, Université de Lausanne, Batochime, CH-1015, Switzerland Abstract. Chemical data related to illicit cocaine seizures is analyzed using linear and nonlinear dimensionality reduction methods. The goal is to find relevant features that could guide the data analysis process in chemical drug profiling, a recent field in the crime mapping community. The data has been collected using gas chromatography analysis. Several methods are tested: PCA, kernel PCA, isomap, spatio-temporal isomap and locally linear embedding. ST-isomap is used to detect a potential time-dependent nonlinear manifold, the data being sequential. Results show that the presence of a simple nonlinear manifold in the data is very likely and that this manifold cannot be detected by a linear PCA. The presence of temporal regularities is also observed with ST-isomap. Kernel PCA and isomap perform better than the other methods, and kernel PCA is more robust than isomap when introducing random perturbations in the dataset. 1 Introduction Chemical profiling of illicit drugs has become an important field in crime mapping in recent years. While traditional crime mapping research has focused on criminal events, i.e., the analysis of spatial and temporal events with traditional statistical methods, the analysis of the chemical composition of drug samples can reveal important information related to the evolution and the dynamics of illicit drugs market. As described in [2], many types of substances can be found in a cocaine sample seized from a street dealer. Among those, there are of course the main constituants of the drug itself, but also chemical residues of the fabrication process and cutting agents used to dilute the final product. Each of these can possibly provide information about a certain stage of drug processing, from the growth conditions of the original plant to the street distribution. This study will focus on cocaine main constituants, which are enumerated in section 3. This work was supported by the Swiss National Science Foundation (grant no ). S. Kollias et al. (Eds.): ICANN 2006, Part II, LNCS 4132, pp , c Springer-Verlag Berlin Heidelberg 2006

2 Learning Manifolds in Forensic Data Related Work A preliminary study was made by the same authors in [1], where heroin data was used. PCA, clustering and classification algorithms (MLP, PNN, RBF networks and k-nearest neighbors) were successfully applied. However, heroin data has less variables (6 main constituants), which makes it more likely to be reduced to few features. A thorough review of the field of chemical drug profiling can be found in Guéniat and Esseiva [2]. In this book, authors have tested several statistical methods for heroin and cocaine profiling. Among other methods, they have mainly used similarity measures between samples to determine the main data classes. A methodology based on the square cosine function as an intercorrelation measurement is explained in further details in Esseiva et al. [3]. Also, principal component analysis (PCA) and soft independent modelling of class analogies (SIMCA) have been applied for dimensionality reduction and supervised classification. A radial basis function network has been trained on the processed data and showed encouraging results. The classes used for classification were based solely on indices of chemical similarities found between data points. This methodology was further developed by the same authors in [4]. Another type of data was studied by Madden and Ryder [5]: Raman spectroscopy obtained from solid mixtures containing cocaine. The goal was to predict, based on the Raman spectrum, the cocaine concentration in a solid using k-nearest neighbors, neural networks and partial least squares. They have also used a genetic algorithm to perform feature selection. However, their study has been constrained by a very limited number of experimental samples, even though results were good. Also, the experimental method of sample analysis is fundamentally different from the one used in this study (gas chromatography). Similarly, Raman spectroscopy data was studied in [6] using support vector machines with RBF and polynomial kernels, KNN, the C4.5 decision tree and a naive Bayes classifier. The goal of the classification algorithm was to discriminate samples containing acetaminophen (used as a cutting agent) from those that do not. The RBF-kernel SVM outperformed all the other algorithms on a dataset of 217 samples using 22-fold cross-validation. 3 The Data The data has 13 initial features, i.e., the 13 main chemical components of cocaine, measured by peaks area on the spectrum obtained for each sample: 1. Cocaine 2. Tropacocaine 3. Benzoic acid 4. Norcocaine 5. Ecgonine

3 896 F. Ratle et al. 6. Ecgonine methyl ester 7. N-formylcocaine 8. Trans-cinnamic acid 9. Anhydroecgonine 10. Anhydroecgonine methyl ester 11. Benzoylecgonine 12. Cis-cinnamoylecgonine methyl ester 13. Trans-cinnamoylecgonine methyl ester. Time is also implicitely considered in ST-isomap. Five dimensionality reduction algorithms are used: a standard principal component analysis, kernel PCA [7], locally linear embedding (LLE) [8], isomap [9] and spatio-temporal isomap [10]. The latter has been used in order to detect any relationship in the temporal evolution of the drug s chemical composition, given that the analyses have been sequentially ordered with respect to the date of seizure for that experiment. Every sample has been normalized by dividing each variable by the total area of the peaks of the chromatogram for that sample, every peak being associated with one chemical substance. This normalization is common practice in the field of chemometrics and aims at accounting for the variation in the purity of samples, i.e., the concentration of pure cocaine in the sample samples were considered. It is worth noting that a dataset of this size is rather unusual due to the restricted availability of this type of data. 4 Methodology and Results Due to the size of the dataset (9500 samples, 13 variables), the methods involving the computation of a Gram matrix or distance matrix were repeated several times with random subsets of the data of 50% of its initial size. All the experiments were done in Matlab. The kernel PCA implementation was taken from the pattern classification toolbox by Stork and Yom-Tov [11], which implements algorithms described in Duda et al. [12]. LLE, isomap and ST-isomap implementations were provided by the respective authors of the algorithms. 4.1 Principal Component Analysis Following normalization and centering of the data, a simple PCA was performed. The eigenvalues seem to increase linearly in absolute value, and a subset of at least six variables is necessary in order to explain 80% of the data variability. Fig. 1 shows the residual variance vs the number of components in the subset. Given that the data can be reduced at most to 6 or 7 components, the results obtained with PCA are not convincing and suggest the use of methods for detecting nonlinear structures, i.e., no simple linear strucure seem to lie in the high-dimensional space. As an indication, the two first principal components are illustrated in Fig. 2.

4 Learning Manifolds in Forensic Data 897 Fig. 1. Residual variance vs number of components Fig. 2. The two main principal components 4.2 Kernel PCA Kernel PCA was introduced by Schölkopf et al. [7] and aims at performing a PCA in feature space, where the nonlinear manifold is linear, using the kernel trick. KPCA is thus a simple yet very powerful technique to learn nonlinear structures. Using the Gram matrix K, defined by a positive semidefinite kernel (usually linear, polynomial or Gaussian), rather than the empirical covariance matrix and knowing that, as for PCA, the new variables can be expressed as the product of eigenvectors of the covariance matrix and the data, the nonlinear projection can be expressed as:

5 898 F. Ratle et al. ( Vk Φ (x) ) = N α k i K (x i, x) (1) where N is the number of data points, α k i is the eigenvector of the Gram matrix corresponding to the eigenvector V k of the covariance matrix in feature space, which does not need to be computed. The radial basis function kernel provided the best results (among linear, polynomial and Gaussian), using a Gaussian width of 0.1. Fig. 3 shows the two-dimensional manifold obtained with KPCA. Unlike PCA, a coherent structure is recognizable here, and it seems that two nonlinear features reasonably account for the variation in the whole dataset. i=1 Fig. 3. Two-dimensional embedding with kernel PCA 4.3 Locally Linear Embedding LLE [8] aims at constructing a low-dimensional manifold by building local linear models in the data. Each point is embedded in the lower-dimensional coordinate system by a linear combination of its neighbors: ˆX i = W i X i (2) i N k (X i) where N k (X i ) is the neighborhood of the point X i,ofsizek. The quality of the resulting projection is measured by the squared difference between the original point and its projection. The main parameter to tune is the number k of neighbors used for the projection. Values from 3 to 50 have been tested, and the setting k = 40 has provided the best resulting manifold, even though this neighborhood value is unusually large. Fig. 4 show the three-dimensional embedding obtained. As for KPCA, a structure can be recognized. However, it is not as distinct, and suggests that LLE cannot easily represent the underlying manifold compared to KPCA.

6 Learning Manifolds in Forensic Data Isomap Fig. 4. Three-dimensional embedding with LLE Isomap [9] is a global method for reduction of dimensionality. It uses the classical linear method of multi-dimensional scaling (MDS) [13], but with geodesic distances rather than Euclidean distances. The geodesic distance between two points is the shortest path along the manifold. Indeed, the Euclidean distance does not appropriately estimate the distance between two points lying on a nonlinear manifold. However, it is usually locally accurate, i.e., between neighboring points. Isomap can therefore be summarized as: 1. Determination of every point s nearest neighbors (using Euclidean distances); 2. Construction of a graph connecting every point to its nearest neighbours; Fig. 5. Residual variance vs Isomap dimensionality

7 900 F. Ratle et al. 3. Calculation of the shortest path on the graph between every pair of points; 4. Application of multi-dimensional scaling on the resulting distances (geodesic distances). The application of this algorithm on the chemical variables has also provided good results compared to PCA. As for LLE, the number of neighbors k has been studied and set to 5. As an indication, Fig. 5 shows the residual variance with subsets of 1 to 10 components. As it can be seen, the residual variance with only one component is much lower than for PCA. In Fig. 6, the two-dimensional embedding is illustrated. From this figure, it appears that the underlying structure is better caught than with LLE, which may suggest that isomap is more efficient on this dataset. Fig. 6. Two-dimensional embedding with ISOMAP 4.5 Spatio-temporal Isomap It is well-known in the crime research community that time series analysis often leads to patterns that reflect police activity rather than underlying criminal behavior. This is especially true in drug profiling research, where the police seizures can vary in time independently of criminal activity. On the other hand, for data such as burglaries, time series analysis could prove more efficient, since the vast majority of events are actually reported. Methods assuming sequential data rather than time-referenced data are perhaps more promising in the field of drug profiling in order to capture true underlying patterns rather than sampling patterns. Spatio-temporal isomap [10] is an extension of isomap for the analysis of sequential data and has been presented by Jenkins and Matarić. Here, the data is of course feature-temporal rather than spatio-temporal. The number of neighbors and the obtained embedding are the same as with isomap. However, the featuretemporal distance matrix is shown in Fig. 7. From this figure, it can be seen that

8 Learning Manifolds in Forensic Data 901 regularities are present in the dataset. Given that the samples cover a period of several years, this data could be used in a predictive point of view and could help understand the organization of distribution networks. This remains the purpose of future study. Fig. 7. Feature-temporal distance matrix 4.6 Robustness Assessment Following these results, the robustness of the two most well-suited methods (KPCA and isomap) was tested using a method similar to that used in [14]. Indeed, few quantitative criteria exist to assess the quality of dimensionality reduction methods, since the reconstruction of patterns in input space is not straightforward and thus limits our ability to measure the accuracy of a given algorithm. The algorithm that has been used follows this outline: 1. Randomly divide the dataset D in three partitions: F, P 1 and P Construct embeddings using F P 1 and F P Compute the mean squared difference (MSD) between both embeddings obtained for F. 4. Repeat the previous steps for a fixed number of iterations. The embeddings were constructed 15 times for kernel PCA and isomap, and the results are summarized in Table 1. Table 1. Normalized mean squared difference for KPCA and isomap Algorithm MSD std (MSD) Kernel PCA Isomap

9 902 F. Ratle et al. It can be observed that here, kernel PCA is considerably more stable than isomap. Isomap, being based on a graph of nearest neighbors, may be more sensitive to random variations in the dataset and could therefore lead to different results with different sets of observations of a given phenomenon. 5 Conclusion Five methods of dimensionality reduction were applied to the problem of chemical profiling of cocaine. The application of PCA showed that linear methods for feature extraction had serious limits in this field of application. Kernel PCA, isomap, locally linear embedding and ST-isomap have demonstrated the presence of simple nonlinear structures that were not detected by conventional PCA. Kernel PCA and isomap have given the best results in terms of an interpretable set of features. However, kernel PCA has shown more robust than isomap. Of course, research by experts in drug profiling will yet have to confirm the relevancy of the obtained results and provide a practical interpretation. Further research will aim at selecting appropriate methods for determination of classes on those low-dimensional structures. This clustering task will enable researchers in the field of crime sciences to determine if distinct production or distribution networks can be put into light by analyzing the data clusters obtained from the chemical composition of the drug seizures. Also, regarding sequential data, other methods could be tested, particularly hidden Markov models. References 1. F. Ratle, A.L. Terrettaz, M. Kanevski, P. Esseiva, O. Ribaux, Pattern analysis in illicit heroin seizures: a novel application of machine learning algorithms, Proc. of the 14 th European Symposium on Artificial Neural Networks, d-side publi., O. Guéniat, P. Esseiva, Le Profilage de l Héroïne et de la Cocaïne, Presses polytechniques et universitaires romandes, Lausanne, P. Esseiva, L. Dujourdy, F. Anglada, F. Taroni, P. Margot, A methodology for illicit drug intelligence perspective using large databases, Forensic Science International, 132: , P. Esseiva, F. Anglada, L. Dujourdy, F. Taroni, P. Margot, E. Du Pasquier, M. Dawson, C. Roux, P. Doble, Chemical profiling and classification of illicit heroin by principal component analysis, calculation of inter sample correlation and artificial neural networks, Talanta, 67: , M.G. Madden, A.G. Ryder, Machine Learning Methods for Quantitative Analysis of Raman Spectroscopy Data, In Proceedings of the International Society for Optical Engineering (SPIE 2002), 4876: , M.L. O Connell, T. Howley, A.G. Ryder, M.G. Madden, Classification of a target analyte in solid mixtures using principal component analysis, support vector machines, and Raman spectroscopy, In Proceedings of the International Society for Optical Engineering (SPIE 2005), 4876: , B. Schölkopf, A. Smola, K.R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10: , 1998.

10 Learning Manifolds in Forensic Data S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290: , J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290: , O.C. Jenkins, M.J. Matarić, A spatio-temporal extension to isomap nonlinear dimension reduction, Proc. of the 21 st International Conference on Machine Learning, D.G. Stork, E. Yom-Tov, Computer Manual in MATLAB to accompany Pattern Classification, Wiley, Hoboken (NJ), R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd Edition, Wiley, New York, J.B. Kruskal and M. Wish, Multidimensional Scaling, SAGE Publications, Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N. Le Roux and M. Ouimet. Out-of-samples extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. Advances in Neural Information Processing Systems 16, 2004.

Non-linear dimension reduction

Sta306b May 23, 2011 Dimension Reduction: 1 Non-linear dimension reduction ISOMAP: Tenenbaum, de Silva & Langford (2000) Local linear embedding: Roweis & Saul (2000) Local MDS: Chen (2006) all three methods