Metabolomic Data Analysis with MetaboAnalyst

Size: px

Start display at page:

Download "Metabolomic Data Analysis with MetaboAnalyst"

Angelina Bernadette Flowers
5 years ago
Views:

1 Metabolomic Data Analysis with MetaboAnalyst User ID: guest April 14, Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety of data types generated in metabolomic studies, including compound concentration data, binned NMR/MS spectra data, NMR/MS peak list data, as well as MS spectra (NetCDF, mzxml, mzdata). Users need to specify the data types when uploading their data in order for MetaboAnalyst to select the correct algorithm to process them. The R scripts datautils.r and processing.r are required to read in and process the uploaded data Reading Binned Spectral Data The binned spectra data should be uploaded in comma seperated values (.csv) format. Samples can be in rows or columns, with class labels immediately following the sample IDs. The uploaded file is in comma separated values (.csv) format. Samples are in rows and features in columns The uploaded data file contains 50 (samples) by 200 (spectra bins) data matrix Filtering Baseline Noises A significant proportion of bins contain close-to-zero values that come from baseline noises. These values are troublesome for some algorithms to work properly and should be excluded before further data analysis. MetaboAnalyst uses a simple linear filter based on the maximal values of each bin. The default cut-off threshold will remove 25% of the lowest spectra bins. Please see Figure 1 for a summary graph. The selected cut-off threshold is A total of 51 bins were excluded based on this cut-off. 1

2 Max intensities Current cut off: , 149 bins left. Bin.9.98 Bin.8.86 Bin.7.74 Bin.6.62 Bin.3.66 Bin.2.58 Bin.1.46 Bin.0.34 Spectra bins Figure 1: Filter baseline noises for binned spectra. maximum values of each corresponding bin. The bars represent the 2

3 1.1.3 Data Integrity Check Before data analysis, a data integrity check is performed to make sure that all the necessary information has been collected. The class labels must be present and contain only two classes. If samples are paired, the class label must be from -n/2 to -1 for one group, and 1 to n/2 for the other group (n is the sample number and must be an even number). Class labels with same absolute value are assumed to be pairs. Compound concentration or peak intensity values must all be non-negative numbers Missing value imputations Too many zeroes or missing values will cause difficulties for downstream analysis. MetaboAnalyst offers several different methods for this purpose. The default method replaces all the missing and zero values with a small values (the half of the minimum positive values in the original data) assuming to be the detection limit. The assumption of this approach is that most missing values are caused by low abundance metabolites (i.e.below the detection limit). In addition, since zero values may cause problem for data normalization (i.e. log), they are also replaced with this small value. User can also specify other methods, such as replace by mean/median, or use Probabilistic PCA (PPCA), Bayesian PCA (BPCA) method, Singular Value Decomposition (SVD) method to impute the missing values 1. Please choose the one that is the most appropriate for your data. Table 1 summarizes the result of the data processing steps. Missing variables were replaced with a small value: Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcamethods: a bioconductor package, providing PCA methods for incomplete data., Bioinformatics (9):

4 Table 1: Summary of data processing results Features (positive) Missing/Zero Features(baseline) Features (processed) P P P P P P P P P P P P P P P P P P P P P P P P013b P100b C C C C C C C C C C C C C C C C C C C C C C C C C

5 1.2 Data Normalization The data is stored as a table with one sample per row and one variable (bin/ peak/metabolite) per column. There are two types of normalization. Rowwise normalization aims to bring each sample (row) comparable to each other (i.e. urine samples with different dilution effects). Column-wise normalization aims to make each variable (column) comparable to each other within the same sample. The procedure is useful when variables are of very different orders of magnitude. The normalization consists of the following options: 1. Row-wise normalization: ˆ Normalization by the sum ˆ Normalization by a reference sample (probabilistic quotient normalization) 2 ˆ Normalization by a reference feature (i.e. creatinine, internal control) ˆ Sample specific normalization (i.e. normalize by dry weight, volume) 2. Column-wise normalization : ˆ Log transformation (log 2) ˆ Unit scaling (mean-centered and divided by standard deviation of each variable) ˆ Pareto scaling (mean-centered and divided by the square root of standard deviation of each variable) ˆ Range scaling (mean-centered and divided by the value range of each variable) The R script normalization.r is required. Figure 2 shows the effects before and after normalization. 2 Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics, 2006, Anal Chem 78 (13);

6 Before Normalization After Normalization Spectra Bins Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Density Intensity Normalized Intensity Figure 2: Box plots and kernel density plots before and after normalization. The boxplots show at most 80 features due to space limit. The density plots are based on all samples. Row-wise normalization: Normalization to constant sum Column-wise normalization: Log Normalization. 6

7 2 Statistical and Machine Learning Data Analysis MetaboAnalyst offers a variety of methods commonly used in metabolomic data analyses. They include: 1. Univariate analysis methods: ˆ Fold Change Analysis ˆ T-tests ˆ Volcano Plot 2. Dimensional Reduction methods: ˆ Principal Component Analysis (PCA) ˆ Partial Least Squares - Discriminant Analysis (PLS-DA) 3. Robust Feature Selection Methods in microarray studies ˆ Significance Analysis of Microarray (SAM) ˆ Empirical Bayesian Analysis of Microarray (EBAM) 4. Clustering Analysis ˆ Hierarchical Clustering Dendrogram Heatmap ˆ Partitional Clustering K-means Clustering Self-Organizing Map (SOM) 5. Supervised Classification and Feature Selection methods ˆ Random Forest ˆ Support Vector Machine (SVM) 7

8 2.1 Principal Component Analysis (PCA) PCA is an unsupervised method aiming to find the directions that best explain the variance in a data set (X) without referring to class labels (Y). The data are summarized into much fewer variables called scores which are weighted average of the original variables. The weighting profiles are called loadings. The PCA analysis is performed using the prcomp package. The calculation is based on singular value decomposition. The Rscript chemometrics.r is required. Figure 3 is pairwise score plots providing an overview of the various seperation patterns among the most significant PCs; Figure 4 is the scree plot showing the variances explained by the selected PCs; Figure 5 shows the 2-D score plot between selected PCs; Figure 6 shows the 3-D score plot between selected PCs; Figure 7 shows the loading plot between the selected PCs; Figure 8 shows the biplot between the selected PCs PC % PC % PC % PC % PC % Figure 3: Pairwise score plots between the selected PCs. The explained variance of each PC is shown in the corresponding diagonal cell. 8

9 Variance explained % 97.8 % 1.2 % Scree plot 98.2 % 0.4 % 98.5 % 0.3 % PC index 98.7 % 0.2 % Figure 4: Scree plot shows the variance explained by PCs. The green line on top shows the accumulated variance explained; the blue line underneath shows the variance explained by individual PC. 9

10 PC 2 ( 1.2 %) P013b P041 P064 P012 P089 P113 P065 P085 P038 P060 P014 P034 P070 P056 P058 P042 P027 P092 P049 C002 C005 C010 C012 C028 C034C033 P037 C022 C017 C020 C007 C009 C015 P080 P086 P002 C006C004 C024C029 C016 C031 C011 C019 C032 C030 patient control P100b P099 C026 C PC 1 ( 96.7 %) Figure 5: Score plot between the selected PCs. The explained variances are shown in brackets. 10

11 patient control PC 3 ( 0.4 %) PC 2 ( 1.2 %) PC 1 ( 96.7 %) Figure 6: 3D score plot between the selected PCs. The explained variances are shown in brackets. 11

12 loadings 1 loadings 2 Bin.9.14 Bin.8.86 Bin.8.82 Bin.8.54 Bin.8.46 Bin.8.34 Bin.8.30 Bin.8.26 Bin.8.22 Bin.8.18 Bin.8.10 Bin.8.06 Bin.8.02 Bin.7.98 Bin.7.94 Bin.7.90 Bin.7.86 Bin.7.82 Bin.7.78 Bin.7.74 Bin.7.70 Bin.7.66 Bin.7.62 Bin.7.58 Bin.7.54 Bin.7.50 Bin.7.46 Bin.7.42 Bin.7.38 Bin.7.34 Bin.7.30 Bin.7.26 Bin.7.22 Bin.7.18 Bin.7.14 Bin.7.10 Bin.7.06 Bin.7.02 Bin.6.98 Bin.6.94 Bin.6.90 Bin.6.86 Bin.6.82 Bin.6.78 Bin.6.74 Bin.6.70 Bin.6.66 Bin.6.54 Bin.6.50 Bin.6.46 Bin.6.42 Bin.6.38 Bin.6.34 Bin.4.38 Bin.4.34 Bin.4.30 Bin.4.26 Bin.4.22 Bin.4.18 Bin.4.14 Bin.4.10 Bin.4.06 Bin.4.02 Bin.4.04 Bin.3.98 Bin.3.94 Bin.3.90 Bin.3.86 Bin.3.82 Bin.3.78 Bin.3.74 Bin.3.70 Bin.3.66 Bin.3.62 Bin.3.58 Bin.3.54 Bin.3.50 Bin.3.46 Bin.3.42 Bin.3.38 Bin.3.34 Bin.3.30 Bin.3.26 Bin.3.22 Bin.3.18 Bin.3.14 Bin.3.10 Bin.3.06 Bin.3.02 Bin.3.04 Bin.2.98 Bin.2.94 Bin.2.90 Bin.2.86 Bin.2.82 Bin.2.78 Bin.2.74 Bin.2.70 Bin.2.66 Bin.2.62 Bin.2.58 Bin.2.54 Bin.2.50 Bin.2.46 Bin.2.42 Bin.2.38 Bin.2.34 Bin.2.30 Bin.2.26 Bin.2.22 Bin.2.18 Bin.2.14 Bin.2.10 Bin.2.06 Bin.2.02 Bin.1.98 Bin.1.94 Bin.1.90 Bin.1.86 Bin.1.82 Bin.1.78 Bin.1.74 Bin.1.70 Bin.1.66 Bin.1.62 Bin.1.58 Bin.1.54 Bin.1.50 Bin.1.46 Bin.1.42 Bin.1.38 Bin.1.34 Bin.1.30 Bin.1.26 Bin.1.22 Bin.1.18 Bin.1.14 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.0.94 Bin.0.90 Bin.0.86 Bin.0.82 Bin.0.78 Bin.0.74 Bin.0.70 Bin.0.58 Figure 7: Loading plot for the selected PCs. 12

13 P013b P041 P080 PC P064 P012 P086 P100b P089 P113 P065 P002 P085 P038 P014 P060 P034 P070 P056 Bin.0.98 P027 Bin.3.78 Bin.3.74 P058 P042 P049 Bin.4.10 Bin.4.18 Bin.4.14 Bin.4.30 Bin.4.22 Bin.4.26 Bin.4.34 P092 C031 Bin.7.38 Bin.7.34 Bin.7.42 Bin.7.98 Bin.7.46 Bin.7.94 Bin.8.02 Bin.7.70 Bin.7.50 Bin.8.18 Bin.7.90 Bin.8.10 Bin.8.82 Bin.8.06 Bin.8.86 Bin.8.54 Bin.9.14 Bin.8.46 Bin.7.78 Bin.7.74 Bin.8.34 Bin.8.26 Bin.8.30 Bin.8.22 Bin.7.22 Bin.7.18 Bin.7.30 Bin.7.26 Bin.7.14 Bin.7.10 Bin.7.06 Bin.6.98 Bin.6.86 Bin.7.02 Bin.6.90 Bin.6.94 Bin.6.82 Bin.4.04 Bin.6.38 Bin.6.46 Bin.6.34 Bin.6.78 Bin.6.66 Bin.6.70 Bin.6.50 Bin.6.74 Bin.6.54 Bin.6.42 Bin.4.06 Bin.3.66 Bin.3.70 Bin.3.82 Bin.3.86 Bin.3.94 Bin.3.62 Bin.3.90 Bin.4.02 Bin.3.54 Bin.3.42 Bin.3.46 Bin.3.50 Bin.3.38 Bin.3.26 Bin.3.58 Bin.3.22 Bin.2.06 Bin.2.26 Bin.3.30 Bin.2.14 Bin.3.18 Bin.2.02 Bin.3.10 Bin.2.30 Bin.2.22 Bin.3.34 Bin.2.90 Bin.2.94 Bin.2.98 Bin.2.86 Bin.4.38 Bin.3.04 Bin.3.06 Bin.3.02 Bin.2.74 Bin.3.14 Bin.2.18 Bin.2.10 Bin.1.34 Bin.0.90 Bin.0.94 Bin.1.22 Bin.1.94 Bin.1.90 Bin.2.46 Bin.1.98 Bin.2.42 Bin.1.50 Bin.2.50 Bin.1.18 Bin.0.86 Bin.1.66 Bin.1.46 Bin.1.30 Bin.2.38 Bin.1.74 Bin.1.70 Bin.1.38 Bin.1.78 Bin.1.82 Bin.1.86 Bin.1.62 Bin.1.42 Bin.1.54 Bin.1.58 Bin.1.14 Bin.2.34 Bin.1.26 Bin.2.82 Bin.2.78 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.82 Bin.0.78 Bin.2.62 Bin.0.74 Bin.0.70 Bin.0.58 Bin.3.98 C002 C011 Bin.2.70 Bin.2.66 Bin.2.58 Bin.2.54 Bin.7.66 Bin.7.62 C006Bin.7.54 Bin.7.86 Bin.7.58 C004 Bin.7.82 C005 C010 C012 C028 C024 C029 P099 C034 C033 P037 C022 C017 C007 C009 C015 C019 C020 C032 C016 C030 C021 C PC1 Figure 8: PCA biplot between the selected PCs. Note, you may want to test different centering and scaling normalization methods for the biplot to be displayed properly. 13

14 2.2 Partial Least Squares - Discriminant Analysis (PLS- DA) PLS is a supervised method that uses multivariate regression techniques to extract via linear combination of original variables (X) the information that can predict the class membership (Y). The PLS regression is performed using the plsr function provided by R pls package 3. The classification and crossvalidation are performed using the corresponding wrapper function offered by the caret package 4. To assess the significance of class discrimination, a permutation test was performed. In each permutation, a PLS-DA model was built between the data (X) and the permuted class labels (Y) using the optimal number of components determined by cross validation for the model based on the original class assignment. The ratio of the between sum of the squares and the within sum of squares (B/W-ratio) for the class assignment prediction of each model was calculated. If the B/W ratio of the original class assignment is a part of the distribution based on the permuted class assignments The contrast between the two class assignment cannot be considered significant from a statistical point of view. There are two variable importance measures in PLS-DA. The first, Variable Importance in Projection (VIP) is a weighted sum of squares of the PLS loadings taking into account the amount of explained Y-variation in each dimension. The other importance measure is based on the weighted sum of PLS-regression The weights are a function of the reduction of the sums of squares across the number of PLS components. coefficients 5. The R script chemometrics.r is required. Figure 9 shows the overview of score plots; Figure 10 shows the 2-D score plot between selected components; Figure 11 shows the 3-D score plot between selected components; Figure 12 shows the loading plot between the selected components; Figure 13 shows the classification performance with different number of components. Figure 14 shows the important features identified by PLS-DA. Figure 15 shows the permutation test results for model validation. 3 Ron Wehrens and Bjorn-Helge Mevik.pls: Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR), 2007, R package version Max Kuhn. Contributions from Jed Wing and Steve Weston and Andre Williams.caret: Classification and Regression Training, 2008, R package version Bijlsma et al.large-scale Human Metabolomics Studies: A Strategy for Data (Pre-) Processing and Validation, Anal Chem. 2006,

15 Component 1 34 % Component % Component % Component % Component % Figure 9: Pairwise score plots between the selected components. The explained variance of each component is shown in the corresponding diagonal cell. 15

16 Component 2 ( 10.9 %) C026 C021 C030 C007 C022 C019 C017 P037 P099 C009C034C010 C032 C020 C004 C016 C015 C005 C012 C033 C028 C024 C029 C031 C006 C011 C002 P042 P092 P049 P058 P027 P056 P070 P034 P060 P038 P085 P100b P002 P014 P080 P065 P012 P041 P113 P089 P064 P086 patient control P013b Component 1 ( 34 %) Figure 10: Score plot between the selected PCs. The explained variances are shown in brackets. 16

17 patient control Component 3 ( 4.9 %) Component 2 ( 10.9 %) Component 1 ( 34 %) Figure 11: 3D score plot between the selected PCs. The explained variances are shown in brackets. 17

18 0.3 Bin.7.82 Bin.7.54 Bin.7.86 Bin.7.58 Bin.7.62 loadings Bin.7.66 Bin.3.98 Bin.7.26 Bin.7.42 Bin.7.22 Bin.7.34 Bin.7.94 Bin.7.18 Bin.7.98 Bin.3.62 Bin.2.26 Bin.3.02 Bin.7.02Bin.4.22 Bin.2.34Bin.3.66 Bin.4.18 Bin.1.94 Bin.3.70 Bin.4.26 Bin.7.70 Bin.8.54 Bin.7.50 Bin.6.90 Bin.1.90 Bin.4.30 Bin.4.34 Bin.6.86 Bin.8.46 Bin.4.06 Bin.8.18 Bin.8.86 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.30 Bin.8.34 Bin.8.26 Bin.7.78 Bin.7.14 Bin.7.90 Bin.6.42 Bin.6.54 Bin.6.74 Bin.6.50 Bin.8.22 Bin.6.70 Bin.7.74 Bin.6.66 Bin.6.78 Bin.6.46 Bin.4.02 Bin.2.10 Bin.1.54 Bin.1.86 Bin.2.30 Bin.6.34 Bin.8.06 Bin.6.82 Bin.1.98 Bin.4.04 Bin.6.98 Bin.6.38 Bin.6.94 Bin.3.82 Bin.1.82 Bin.3.90 Bin.1.78 Bin.3.54 Bin.1.66 Bin.1.70 Bin.3.26 Bin.0.58 Bin.1.62 Bin.1.42 Bin.0.70 Bin.1.38 Bin.8.02 Bin.0.74 Bin.2.02 Bin.4.14 Bin.1.58 Bin.3.86 Bin.2.98 Bin.1.46 Bin.1.74 Bin.3.50 Bin.2.74 Bin.3.78 Bin.2.06 Bin.2.86 Bin.0.78Bin.3.74 Bin.3.34 Bin.3.94 Bin.4.38 Bin.3.22 Bin.3.38 Bin.2.22 Bin.7.10 Bin.3.30 Bin.2.82 Bin.2.78 Bin.3.10 Bin.4.10 Bin.2.18 Bin.2.90 Bin.2.14 Bin.1.50 Bin.1.14 Bin.0.82 Bin.0.90 Bin.0.86 Bin.1.34 Bin.3.46 Bin.3.18 Bin.2.94 Bin.2.38 Bin.3.58 Bin.2.42 Bin.1.30Bin.1.18 Bin.2.50 Bin.2.62 Bin.1.26Bin.1.22 Bin Bin Bin Bin.3.04 Bin.3.06 Bin.3.14 Bin.2.46 Bin.7.38 Bin.7.06 Bin.7.46 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.3.42 Bin.0.94 Bin.2.54 Bin.2.66 loadings 1 Figure 12: Loading plot between the selected PCs. 18

19 PLS DA classification with different number of components Accuracy % 100 % 100 % 100 % 100 % Number of components Figure 13: PLS-DA classification using different number of components. The red circle indicates the best classifier. 19

20 Rank by VIP (top 15) Rank by Coef. (top 15) Bin.2.54 Bin.2.54 Bin.2.70 Bin.2.70 Bin.7.82 Bin.2.66 Bin.2.66 Bin.7.82 Bin.7.54 Bin.7.54 Bin.7.86 Bin.7.86 Spectra Bins Bin.2.58 Bin.0.98 Bin.0.94 Bin.2.58 Bin.0.94 Bin.7.58 Bin.7.58 Bin.0.98 Bin.7.66 Bin.7.66 Bin.0.86 Bin.7.62 Bin.7.62 Bin.0.86 Bin.3.98 Bin.3.98 Bin.1.02 Bin Figure 14: Important features identified by PLS-DA. The left panel shows the features ranked by VIP score. The right panel shows the features ranked based on their regression coefficients. 20

21 Class discrimination measured by B/W B/W Original Permuted Index Distribution of random class assignments Frequency Original Permuted(median) B/W Figure 15: PLS-DA model validation by permutation tests. The top panel shows B/W ratio calculated for both original and permuted PLS-DA models. The bottom panel shows the distribution of random class assignments based on the frequencies of permuted B/W ratios. The green line (top) and green area (bottom) mark the 95% confidence regions of B/W for the permuted data. 21

22 2.3 Hierarchical Clustering In (agglomerative) hierarchical cluster analysis, each sample begins as a separate cluster and the algorithm proceeds to combine them until all samples belong to one cluster. Two parameters need to be considered when performing hierarchical clustering. The first one is similarity measure - Euclidean distance, Pearson s correlation, Spearman s rank correlation. The other parameter is clustering algorithms, including average linkage (clustering uses the centroids of the observations), complete linkage (clustering uses the farthest pair of observations between the two groups), single linkage (clustering uses the closest pair of observations) and Ward s linkage (clustering to minimize the sum of squares of any two clusters). Heatmap is often presented as a visual aid in addition to the dendrogram. Hierachical clustering is performed with the hclust function in package stat. The R script clustering.r is required. Figure 16 shows the clustering result in the form of a dendrogram. Figure 17 shows the clustering result in the form of a heatmap. 22

23 Cluster with ward method Height patient control C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P080 Figure 16: Clustering result shown as dendrogram (distance measure using euclidean, and clustering algorithm using ward). 23

24 Bin.4.06 Bin.4.04 Bin.3.06 Bin.3.04 Bin.3.98 Bin.3.26 Bin.3.74 Bin.3.70 Bin.3.78 Bin.3.66 Bin.3.02 Bin.3.94 Bin.3.82 Bin.3.62 Bin.3.90 Bin.3.86 Bin.3.22 Bin.3.54 Bin.3.58 Bin.3.30 Bin.4.02 Bin.2.06 Bin.0.98 Bin.0.86 Bin.4.34 Bin.4.26 Bin.4.22 Bin.4.30 Bin.1.54 Bin.1.42 Bin.1.58 Bin.1.62 Bin.3.38 Bin.1.18 Bin.2.38 Bin.1.30 Bin.0.94 Bin.0.90 Bin.1.46 Bin.1.50 Bin.1.66 Bin.1.70 Bin.1.74 Bin.1.38 Bin.1.86 Bin.1.82 Bin.1.78 Bin.7.98 Bin.7.18 Bin.7.26 Bin.7.30 Bin.7.22 Bin.1.14 Bin.7.10 Bin.2.98 Bin.1.06 Bin.1.10 Bin.2.86 Bin.2.90 Bin.2.62 Bin.2.94 Bin.2.78 Bin.2.82 Bin.7.14 Bin.0.82 Bin.1.02 Bin.6.94 Bin.6.98 Bin.7.94 Bin.7.06 Bin.7.46 Bin.4.38 Bin.8.18 Bin.8.02 Bin.6.86 Bin.6.90 Bin.7.02 Bin.7.50 Bin.7.70 Bin.7.90 Bin.8.06 Bin.0.74 Bin.0.78 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.86 Bin.8.54 Bin.8.46 Bin.7.78 Bin.6.82 Bin.6.38 Bin.0.70 Bin.7.74 Bin.6.78 Bin.6.34 Bin.8.22 Bin.8.34 Bin.8.30 Bin.8.26 Bin.6.46 Bin.0.58 Bin.6.66 Bin.6.74 Bin.6.70 Bin.6.50 Bin.6.42 Bin.6.54 Bin.2.58 Bin.7.62 Bin.7.66 Bin.7.58 Bin.2.66 Bin.2.70 Bin.2.54 Bin.7.54 Bin.7.86 Bin.7.82 Bin.3.42 Bin.1.34 Bin.3.50 Bin.3.46 Bin.7.38 Bin.7.42 Bin.7.34 Bin.1.98 Bin.1.90 Bin.2.22 Bin.2.30 Bin.2.50 Bin.2.42 Bin.3.34 Bin.1.26 Bin.2.46 Bin.3.10 Bin.2.18 Bin.4.10 Bin.2.74 Bin.3.14 Bin.3.18 Bin.2.34 Bin.2.26 Bin.1.22 Bin.4.14 Bin.4.18 Bin.1.94 Bin.2.10 Bin.2.02 Bin.2.14 C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P Value Color Key Figure 17: Clustering result shown as heatmap (distance measure using euclidean, and clustering algorithm using ward). 24

25 2.4 Random Forest (RF) Random Forest is a supervised learning algorithm suitable for high dimensional data analysis. It uses an ensemble of classification trees, each of which is grown by random feature selection from a bootstrap sample at each branch. Class prediction is based on the majority vote of the ensemble. RF also provides other useful information such as OOB (out-of-bag) error and variable importance measure. During tree construction, about one-third of the instances are left out of the bootstrap sample. This OOB data is then used as test sample to obtain an unbiased estimate of the classification error (OOB error). Variable importance is evaluated by measuring the increase of the OOB error when it is permuted. RF analysis is performed using the randomforest package 6. The R script classification.r is required. Table 2 shows the confusion matrix of random forest. Figure 18 shows the cumulative error rates of random forest analysis for given parameters. Figure 19 shows the important features ranked by random forest. The OOB error is 0.04 Random Forest classification Error Overall control patient trees Figure 18: Cumulative error rates by Random Forest classification. The overall error rate is shown as the black line; the red and green lines represent the error rates for each class. 6 Andy Liaw and Matthew Wiener. Classification and Regression by randomforest, 2002, R News 25

26 Top 15 Important Features Bin.2.54 Bin.2.70 Bin.0.98 Bin.0.94 Bin.2.10 Bin.1.02 Spectra Bins Bin.0.90 Bin.2.66 Bin.1.42 Bin.0.86 Bin.2.58 Bin.0.82 Bin.1.86 Bin.1.70 Bin MeanDecreaseAccuracy Figure 19: Significant features identified by Random Forest. The features are ranked by the mean decrease in classification accuracy when they are permuted. 26

27 control patient class.error control patient Table 2: Random Forest Classification Performance 27

28 3 Data Annotation Please be advised that MetaboAnalyst also supports metabolomic data annotation. For NMR, MS, or GC-MS peak list data, users can perform peak identification by searching the corresponding libraries. For compound concentration data, users can perform pathway mapping. These tasks require a lot of manual efforts and are not performed by default. The report was generated on Tue Apr 14 21:30: with R version ( ) on a i386-redhat-linux-gnu platform. Thank you for using MetaboAnalyst! For suggestions and feedback please contact Jeff Xia (jianguox@ualberta.ca). 28

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,