Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Similar documents
Metabolomic Data Analysis with MetaboAnalyst

Network Traffic Measurements and Analysis

High-throughput Processing and Analysis of LC-MS Spectra

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Exploratory data analysis for microarrays

Understanding Clustering Supervising the unsupervised

Supervised vs unsupervised clustering

Clustering and Visualisation of Data

Chapter 1. Using the Cluster Analysis. Background Information

Practical OmicsFusion

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

INF 4300 Classification III Anne Solberg The agenda today:

Package stattarget. December 5, 2017

Machine Learning in Biology

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Measures of Dispersion

Unsupervised Learning and Clustering

Gene Clustering & Classification

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECG782: Multidimensional Digital Signal Processing

Bioinformatics - Lecture 07

Multivariate Analysis

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

DI TRANSFORM. The regressive analyses. identify relationships

Hierarchical Clustering

Machine Learning with MATLAB --classification

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Using Machine Learning to Optimize Storage Systems

1) Give decision trees to represent the following Boolean functions:

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Workload Characterization Techniques

Artificial Neural Networks (Feedforward Nets)

Unsupervised Learning and Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Lecture 25: Review I

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Hierarchical Clustering 4/5/17

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

STATISTICS (STAT) Statistics (STAT) 1

Cluster Analysis for Microarray Data

Network Traffic Measurements and Analysis

Unsupervised Learning : Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Chapter 2 Describing, Exploring, and Comparing Data

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

CS570: Introduction to Data Mining

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Package stattarget. December 23, 2018

Forestry Applied Multivariate Statistics. Cluster Analysis

Accelerometer Gesture Recognition

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Applied Regression Modeling: A Business Approach

Spatial Outlier Detection

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

10-701/15-781, Fall 2006, Final

Clustering analysis of gene expression data

Analyzing ICAT Data. Analyzing ICAT Data

Analysis of Process and biological data using support vector machines

Facial Expression Classification with Random Filters Feature Extraction

Statistical Pattern Recognition

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Lecture on Modeling Tools for Clustering & Regression

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering. Chapter 10 in Introduction to statistical learning

How do microarrays work

Data Warehousing and Machine Learning

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Chapter 13 Multivariate Techniques. Chapter Table of Contents

Note Set 4: Finite Mixture Models and the EM Algorithm

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Content-based image and video analysis. Machine learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Dimension reduction : PCA and Clustering

CSE 158. Web Mining and Recommender Systems. Midterm recap

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Clustering Lecture 5: Mixture Model

Applying Supervised Learning

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Transcription:

Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Outline Introduction Data pre-treatment 1. Normalization 2. Centering, scaling, transformation Univariate analysis 1. Student s t-tes 2. Volcano plot Multivariate analysis 1. PCA 2. PLS-DA Machine learning Software packages 2

Results from data processing Next, analysis of the quantitative metabolite information 3

Metabolomics data analysis Goals biomarker discovery by identifying significant features associated with certain conditions Disease diagnosis via classification Challenges Limited sample size Many metabolites / variables Workflow pre-treatment univariate analysis multivariate analysis machine learning 4

Data Pre-treatment 5

Normalization (I) Goals to reduce systematic variation to separate biological variation from variations introduced in the experimental process to improve the performance of downstream statistical analysis Sources of experimental variation sample inhomogeneity differences in sample preparation ion suppression 6

Normalization (II) Approaches Sample-wise normalization: to make samples comparable to each other Feature/variable-wise normalization: to make features more comparable in magnitude to each other. Sample-wise normalization normalize to a constant sum normalize to a reference sample normalize to a reference feature (an internal standard) sample-specific normalization (dry weight or tissue volume) Feature-wise normalization (i.e., centering, scaling, and transformation) 7

Centering, scaling, transformation 8

Centering Converts all the concentrations to fluctuations around zero instead of around the mean of the metabolite concentrations Focuses on the fluctuating part of the data Is applied in combination with data scaling and transformation 9

Scaling Divide each variable by a factor Different variables have a different scaling factor Aim to adjust for the differences in fold differences between the different metabolites. Results in the inflation of small values Two subclasses Uses a measure of the data dispersion Uses a size measure 10

Scaling: subclass 1 Use data dispersion as a scaling factor auto: use the standard deviation as the scaling factor. All the metabolites have a standard deviation of one and therefore the data is analyzed on the basis of correlations instead of covariance. pareto: use the square root of the standard deviation as the scaling factor. Large fold changes are decreased more than small fold changes and thus large fold changes are less dominant compared to clean data. vast: use standard deviation and the coefficient of variation as scaling factors. This results in a higher importance for metabolites with a small relative sd. range: use (max-min) as scaling factors. Sensitive to outliers. 11

Scaling: subclass 2 Use average as scaling factors The resulting values are changes in percentages compared to the mean concentration. The median can be used as a more robust alternative. 12

Transformation Log and power transformation Both reduce large values relatively more than the small values. Log transformation pros: removal of heteroscedasticity cons: unable to deal with the value zero. Power transformation pros: similar to log transformation cons: not able to make multiplicative effects additive 13

Centering, scaling, transformation A. original data B. centering C. auto D. pareto E. range F. vast G. level H. log I. power 14

Log transformation, again Hard to do useful statistical tests with a skewed distribution. A skewed distribution or exponentially decaying distribution can be transformed into a Gaussian distribution by applying a log transformation. http://www.medcalc.org/manual/transforming_data_to_normality.php 15

Univariate vs. multivariate analysis Univariate analysis examines each variable separately. t-tests volcano plot Multivariate analysis considers two or more variables simultaneously and takes into account relationships between variables. PCA: Principle Component Analysis PLS-DA: Partial Least Squares-Discriminant Analysis Univariate analyses are often first used to obtain an overview or rough ranking of potentially important features before applying more sophisticated multivariate analyses. 16

Univariate Statistics t-test volcano plot 17

Univariate statistics A basic way of presenting univariate data is to create a frequency distribution of the individual cases. Due to the Central Limit Theorem, many of these frequency distributions can be modeled as a normal/gaussian distribution. 18

Gaussian distribution The total area underneath each density curve is equal to 1. 1.0 0.8 = 0, = 0, = 0, = 2 = 0.2, 2 = 1.0, 2 = 2 = 1 ϕ(x) = σ mean = µ " x µ % $ ' 2π e # σ & variance = σ 2 standard deviation = σ https://en.wikipedia.org/wiki/normal_distribution 2 2( x) 0.0 0.1 0.2 0.3 0.4 0.6 0.4 0.2 0.0 0 1 2 4 0.1% 2.1% x 34.1% 34.1% 13.6% 13.6% 2.1% 0.1% µ 19

Sample statistics Sample mean: X= 1 n n i=1 X i Sample variance: S 2 = 1 n 1 n i=1 ( X i X) 2 Sample standard deviation: S = S 2 https://en.wikipedia.org/wiki/normal_distribution 20

t-test (I) One-sample t-test: is the sample drawn from a known population? Null hypothesis H 0 : µ = µ 0 Alternative hypethesis H 1 : µ<µ 0 Test statistic: t = x µ 0 s n Sample standard deviation: s = 1 n 1 n i=1 ( x i x) 2 The test statistic t follows a student s t distribution. The distribution has n-1 degrees of freedom. 21

t-test (II): p-value When the null hypothesis is rejected, the result is said to be statistically significant. 22

t-test (III) Two-sample t-test: are the two populations different? Null hypothesis H 0 : µ 1 µ 2 = 0 Alternative hypethesis H 1 : µ 1 µ 2 0 Test statistic: t = x 1 x 2 ( ) ( µ 1 µ 2 ) s 1 2 2 n 1 + s 2 n 2 The two samples should be independent. 23

t-test (IV) Equivalent statements: The p-value is small. The difference between the two populations is unlikely to have occurred by chance, i.e. is statistically significant. 24

t-test (V) The p-value is big. The difference between the two populations are said NOT to be statistically significant. 25

t-test (VI) Paired t-test: what is the effect of a treatment? Measurements made on the same individuals before and after the treatment. Example: Subjects participated in a study on the effectiveness of a certain diet on serum cholesterol levels. Subject Before After Difference 1 201 200-1 2 231 236 +5 3 221 216-5 5 260 243-17 6 228 224-4 7 245 235-10 H 0 : µ d = 0 H a : µ d 0 Test statistic: t = d µ d s d n 26

Volcano plot (I) Plot fold change vs. significance y-axis: negative log of the p-value x-axis: log of the fold change so that changes in both directions (up and down) appear equidistant from the center Two regions of interest: those points that are found towards the top of the plot that are far to either the left- or the righthand side. 27

Volcano plot (II) 28

Multivariate statistics PCA PLS-DA 29

PCA (I) PCA is a statistical procedure to transform a set of correlated variables into a set of linearly uncorrelated variables. 30

PCA (II) The uncorrelated variables are ordered in such a way that the first one accounts for as much of the variability in the data as possible and each succeeding one has the highest variance possible in the remaining variables. These ordered uncorrelated variables are called principle components. By discarding low-variance variables, PCA helps us reduce data dimension and visualize the data. 31

PCA (III) The transformation matrix The transformation p are called the 1 st, 2 nd, and n th 1, p 2,, p n principle components, respectively. 32

PCA (IV) Each original sample is represented by an n-dimensional vector: S before transformation = [ x 1, x 2,, x n ] After the transformation If all of the principle components are kept, then each sample is still represented by an n-dimensional vector: S after transformation = [ y 1, y 2,, y n ] If only m < n principle components are kept, then each sample will be represented by an m-dimensional vector: S after transformation = [ y 1, y 2,, y m ] 33

PC1 PCA (V) y are called scores. For visualization purpose, m is usually chosen to be 2 or 3. As a result, each sample will be represented by a 2- or 3- dimenational point in the score plot. Principal Component Analysis (Scores) PC2 10 0 10 20 30 ko15 wt15 wt16 wt18 wt19 ko22 ko21 wt22 wt21 ko19 ko16 30 20 10 0 10 ko18 34

PCA (VI) Loadings [ p 11, p 12 ],[ p 21, p 22 ],, p n1, p n2 [ ] are denoted as points in the loadings plot 35

PCA (VII) 36 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.04 0.02 0.00 0.02 0.04 0.06 Principal Component Analysis (Loadings) PC1 PC2 200.1/2924 200.9/3473 202.1/3497 204.9/2590 205/2788 206/2789 207/2744 207.1/2717 213.2/3497 214.1/2519 215.2/3478 215.9/3506 215.9/3170 218.2/3366 219.1/2523 225/3232 225.1/3265 225.1/3151 225.1/3668 226/2962 229.1/3424 231.1/2849 231.1/2528 233/3026 233.1/3025 235/3130 235.1/2716 236.1/2522 236.9/3632 239.1/3252 241.1/3646 241.2/3641 243.1/3542 243.2/3651 244/4103 244.1/2830 244.9/3130 245.1/2882 245.9/3132 246.4/3287 246.7/3295 246.8/3281 247/3490 247.1/2648 247.9/2979 248.2/3240 249.1/3631 250.1/3639 250.1/3674 250.2/3627 250.2/4059 251.1/3880 251.2/3882 254.1/3226 255.1/3668 256.1/3444 256.1/3467 256.1/3652 256.2/3454 256.2/3636 256.2/3662 257.1/3660 257.1/3640 257.2/3647 257.9/3425 258/3497 258/3232 259.1/2802 261.1/3148 264.2/3158 265.1/2620 265.2/3659 265.2/4114 266.1/3624 266.2/3634 267.1/2783 267.2/3628 268.1/3868 268.2/3888 269.1/3886 269.2/3888 269.2/4129 270.1/3891 270.2/3880 270.5/3166 271.1/3879 271.2/3404 271.3/3264 272.1/3123 272.1/2727 273.1/3425 273.1/3655 273.2/3424 273.9/2640 274.2/3431 277/2638 277.2/3845 279/2788 279/2755 280/2788 280.1/3315 280.2/3319 281/2793 281.1/2788 281.2/3672 282.1/2616 282.1/3468 282.2/3471 283.1/3636 283.2/3883 283.2/3640 283.2/3675 283.2/3134 284.1/2716 284.1/3634 284.2/3884 284.2/3627 284.2/3675 285.1/3854 285.1/4129 285.1/2716 285.2/3861 285.2/4127 285.2/3637 286/2620 286.1/3263 286.1/3283 286.1/3851 286.2/3256 286.2/2855 287/2647 287.1/3256 287.1/2635 287.2/3262 288.1/2794 288.2/2795 288.2/2922 289/2952 289.1/3118 289.1/2718 289.2/3125 291.1/3629 292.1/3492 292.2/3217 292.2/3459 294.1/2577 296.1/2727 296.9/4127 297.1/3848 297.2/3854 297.3/3381 298.1/3182 298.1/3671 298.2/3187 299.2/4126 300.2/3390 301/2787 301.1/3891 301.1/3384 301.1/3681 301.2/3391 301.2/3886 301.2/3656 302/2621 302/2789 302.1/3253 302.1/3395 302.1/2818 302.2/3389 302.3/3901 303/2622 303.1/3238 303.2/3388 304.1/2924 304.1/3236 304.2/3471 305.1/3506 305.1/2926 305.1/2877 305.2/3290 306.1/3506 306.1/2929 306.9/2617 307/2621 307.1/3141 307.1/3503 307.1/3864 308/2623 308.1/3257 308.2/3261 309.1/2923 309.1/3649 309.2/3633 310.1/3463 310.1/2923 310.1/2798 310.2/3635 310.2/3468 311.1/4131 311.1/3080 311.1/3625 311.2/3875 312.1/3000 312.2/3877 312.2/4134 313/2786 313.1/4137 313.1/3290 313.1/3506 313.1/2824 313.2/3290 313.2/4137 314/2786 314.1/3289 314.1/4141 314.1/2767 314.1/3857 314.2/3289 315/2509 315/2782 315/2536 315/3270 315.1/2773 315.1/3653 316.1/3087 316.1/3653 316.2/3089 317/2788 317.2/3492 317.2/3092 317.2/3844 317.3/3942 318.1/3246 319.2/3245 321.1/3641 321.1/3673 322.1/3507 322.1/3394 322.1/2899 322.2/3504 322.2/3635 322.2/3149 323.1/3503 323.1/3391 323.1/3896 323.2/3511 323.2/3496 323.2/3392 323.2/3883 324/3268 324.1/2789 324.1/3509 324.2/3507 324.2/3277 325/3255 325.2/3649 325.2/3238 326.1/3300 326.1/3090 326.2/3420 326.4/4114 327.1/3505 327.1/2927 327.2/3420 327.2/3851 328.1/3508 328.2/3631 328.2/3607 328.2/3376 328.2/3500 329/2921 329/3150 329.1/3499 329.2/3880 329.2/3489 329.2/3611 330.1/3503 330.2/3491 330.2/3166 330.2/3630 330.2/3606 331.1/3498 331.1/3459 331.2/3497 331.2/3453 332/2509 332/2537 332.1/3496 333/2514 333.1/3121 333.1/2548 334/2531 334.2/2811 335/2785 335/3278 336/2785 336.1/3533 336.2/3192 336.2/2995 337/2537 337/2509 337.1/3650 337.2/3438 338/2511 338/2536 338.1/3022 338.1/2789 338.2/3829 338.2/3630 339.1/4124 339.2/3319 339.2/4124 340.1/3328 340.1/3307 340.2/3317 340.2/2758 341.2/3556 343/2684 343/2597 343/2662 343.1/3510 343.1/3650 344/2684 344.2/3350 344.2/3152 344.2/2940 345/2684 345/3238 345.1/2692 345.2/3388 345.2/3852 345.3/3386 346.1/3499 346.1/3275 346.2/3493 346.9/2782 347.1/3499 347.2/3494 347.2/3669 348.1/3420 348.1/3288 348.2/3499 348.2/3430 348.2/3285 349.1/3666 349.1/3421 349.1/3633 349.1/3290 349.2/3278 349.2/3665 349.2/3407 350/3385 350.1/3630 350.2/3626 351.1/3499 351.1/3028 351.2/3503 351.2/3485 351.2/3615 351.9/2564 352.1/3498 352.1/2789 352.2/3490 353/2528 353.1/3498 353.2/3495 354/2681 354.1/2782 354.1/3671 354.2/3599 355.2/3104 355.2/3591 355.2/3567 356.1/3869 356.2/3830 357.1/3469 357.2/3474 357.2/3833 358/2783 358/3958 358.1/2778 358.2/3832 359/3507 359/3225 359.1/3514 359.2/3720 360/2692 360.3/3397 360.8/2915 361/2684 361.1/3167 361.1/2693 361.2/3167 361.9/2922 362/2690 362/2664 362.1/3165 362.1/2685 362.2/3156 362.2/3325 363.1/3259 363.1/3481 363.2/3263 364.1/3264 364.2/3262 364.2/2582 364.2/3655 364.2/3464 364.9/2553 365/2684 365/2595 365/2659 365/3484 365.1/3388 365.2/3404 365.2/3385 366/2684 366/2662 366.1/2788 366.1/3395 366.2/3842 367/2666 367.2/3528 367.2/4132 367.2/3584 368.1/2792 368.2/3529 368.2/2786 368.2/4133 368.3/3863 369.2/4113 370.2/3044 370.2/3918 371.2/3073 371.2/3605 372.1/3626 372.1/3291 375/3071 376.1/3450 376.2/3459 376.2/3722 377/3074 377/3509 377.1/3077 377.2/2997 377.2/3458 378.2/3831 379.1/3327 379.1/3476 379.1/3505 379.2/3336 379.2/3307 380.1/3154 380.1/3203 380.1/3329 380.2/3333 380.2/2971 380.2/3660 381/2686 381/2663 381.1/3154 381.1/3194 381.2/3194 382/2677 382.1/3246 382.2/3788 383.2/3241 383.2/4087 384.2/3999 384.3/4003 385.1/3181 385.2/4134 385.2/3174 385.2/3449 386/2535 386.1/3177 386.2/3179 386.2/4133 386.3/3577 387.2/3243 388/3959 388.1/2685 388.1/2663 388.1/3651 389/3886 389.1/3646 389.2/3349 390.1/3376 390.2/3353 390.3/3050 391.1/3595 391.1/3632 391.1/3611 391.2/3490 391.2/3864 391.2/3610 392.1/3604 392.1/3472 392.1/4127 392.1/3495 392.1/3631 392.2/2929 392.2/3616 394.2/3077 395/4121 395.1/3558 395.2/3719 395.2/3674 395.2/3570 396.1/3348 396.1/3560 396.2/3326 396.2/2868 396.2/3862 397/2780 397.1/3315 397.2/3326 397.2/3877 397.2/2872 398.3/4059 398.3/3636 399/2530 399.1/2799 400.1/2792 401.1/3334 401.2/2862 402.1/2688 402.1/3337 402.1/2672 402.2/3325 403.1/2950 403.1/3329 403.2/3889 404/2707 404.1/2686 404.3/3900 405/2686 405/3251 405.2/3391 405.3/3940 407.2/3501 408.1/2894 408.1/3494 408.2/2576 410.2/3938 410.3/3944 411.2/3944 411.3/3933 412.2/3944 412.2/3369 412.3/4118 413.1/3634 413.3/4116 414.1/3624 414.2/3062 416.1/2686 416.2/2915 417.1/2683 419.1/3062 419.1/3505 419.2/3854 419.2/3062 419.2/3807 419.3/3531 420.1/3786 420.2/2722 421.1/2798 421.1/3265 421.3/3867 422.2/2887 422.3/2572 423.1/3254 423.2/3272 423.2/3253 424.2/2952 424.2/4005 424.2/3155 424.3/4002 425.1/2950 425.2/2955 426.1/3017 426.2/3060 426.9/2692 427/2687 427/2707 427.1/3015 427.3/3269 428.3/3634 429.2/3888 429.2/4080 429.2/3155 429.2/2951 429.2/4031 429.2/4054 429.4/3464 430.1/2686 430.2/3885 430.2/4073 431.1/2685 431.1/3910 431.2/2880 431.2/3884 432.1/2683 432.2/2719 433/3506 433.1/2682 433.2/3810 433.2/2703 434.1/3792 436/3126 436.2/2954 436.2/3631 436.2/3596 436.3/2950 437.1/2789 438.1/3455 438.2/3461 438.2/4074 438.3/4058 438.3/3548 439.1/3456 439.2/3459 439.2/3481 439.3/4054 440.1/3150 440.2/3471 440.2/2854 440.2/3160 440.2/4060 440.3/4059 440.3/3089 441.2/3604 441.3/3495 441.3/3050 442.9/2691 443/2684 444/2666 444.1/2686 444.2/2897 444.2/3883 444.3/3867 445/2684 445.1/2683 447.2/3877 447.2/3859 448.1/3016 448.1/3549 448.2/3862 448.2/3550 449.1/2895 449.1/3290 449.1/2743 449.1/3534 449.2/2739 450.1/3879 452.1/3076 452.1/2550 452.2/2547 452.2/3146 453.2/3074 453.2/2959 454.1/3291 454.1/3997 454.2/3286 454.2/3088 455.1/3290 455.2/3288 456.1/3288 456.2/3289 457.1/3292 457.2/4112 457.2/3293 458.2/3046 458.2/2941 458.3/2945 458.9/2940 459.1/4137 459.2/3047 459.3/3866 460.1/3452 460.2/3450 460.4/3550 461.1/3542 461.2/3912 461.3/3933 462.1/3094 462.2/3273 462.3/4433 463.2/3046 463.3/3826 464.1/3047 464.2/3468 464.2/3044 465.2/3466 466.1/3203 466.2/3470 466.2/3658 466.2/3200 466.2/3695 467.2/3661 467.2/3699 467.3/3460 468.2/3139 468.2/2940 468.2/3718 468.3/2941 469.2/3864 471.3/4112 471.9/2941 472/2954 472.2/3121 473.2/2935 473.2/3290 473.2/3138 473.9/3293 474.1/3075 474.4/3722 475.2/3089 475.2/2563 475.2/2536 475.2/2787 475.2/2725 475.2/2511 475.2/2689 475.2/2665 475.2/2617 475.2/3058 475.2/2830 476.1/3290 476.2/2678 477.1/3288 477.2/3293 478/3033 478.1/3162 478.2/3603 478.2/3639 478.4/3022 479.1/3171 479.2/3599 479.3/3630 480.1/3318 480.1/3336 480.2/3317 480.2/3099 480.3/4131 481.1/3318 481.2/3314 482.2/3318 482.2/3598 482.3/3201 483.1/3538 483.1/3054 483.2/3598 483.2/3553 484.1/3249 484.1/3588 484.2/3599 484.2/2845 485.1/4136 485.1/3607 485.2/2847 485.2/3041 485.2/3609 485.3/4134 485.3/3057 486/3425 486.1/3464 486.2/2799 486.2/3476 486.2/3457 486.3/4121 487.2/2799 488.2/2882 490.3/3253 491.3/3840 491.3/3539 492.2/3643 492.3/3873 492.3/3633 493.1/2882 493.2/3667 494.2/3428 494.2/3120 494.2/3165 495.2/3430 495.3/2930 495.3/3417 496.2/3403 496.2/3337 496.2/3442 496.2/3363 497.2/3337 497.2/3407 497.2/3363 498.2/3444 498.2/3403 498.4/4051 499.3/4112 499.4/4120 500.1/3045 500.2/2798 500.2/3040 501.1/3046 501.2/2797 501.2/3493 502.1/3166 502.1/3189 502.1/4192 502.2/3032 502.2/3171 503.1/3166 503.2/3167 504.1/3260 504.1/3164 504.1/3598 504.2/3260 504.2/3031 505.1/3262 505.2/3264 505.4/3699 506.1/3261 506.2/3391 506.2/3258 506.3/3028 506.3/3310 507.1/3017 507.2/3030 507.2/3390 508.1/3367 508.2/3528 508.2/3373 508.3/3839 509.1/3030 509.2/3528 509.3/3835 510.1/3471 510.2/3527 510.2/3761 511.2/3763 512/2971 512.1/3368 512.2/2923 512.2/3761 512.2/3127 512.2/3358 512.3/2926 512.3/3127 512.4/4066 513.1/3756 513.2/3365 513.2/2913 513.3/2928 513.3/3522 513.4/3526 515.2/2869 515.9/2929 516.2/3262 517.1/2909 517.1/2937 517.2/2925 517.9/2512 518.2/3409 518.2/3436 518.2/3112 518.3/3522 519/3727 520.2/3212 520.2/3268 520.2/3284 520.3/4135 520.4/4140 521.2/3212 521.3/3163 521.3/4137 521.4/4138 522.2/3363 522.2/3426 523.1/3037 523.2/3362 523.2/3426 523.2/3459 523.2/3389 523.3/4145 523.4/4140 524.1/3157 524.1/2884 524.2/3696 524.2/3624 524.3/2927 525.2/3697 525.2/3162 526.1/3177 526.2/3693 526.2/3180 527.1/3178 527.2/3191 527.2/3701 527.3/3532 528.1/3241 528.1/3179 528.2/2837 528.2/3239 528.2/3168 528.3/2835 529.1/3242 529.2/3240 529.2/3170 529.4/2928 530.2/3352 530.2/3541 530.2/3744 531.2/3350 531.2/3514 532.1/3517 532.2/3489 532.2/2870 532.2/3353 532.2/3458 533/2886 533.1/3351 533.2/3488 533.2/2870 533.2/2681 533.2/3459 533.2/3343 533.2/3898 533.3/3892 533.3/3171 533.4/4086 534.1/3455 534.2/3583 534.3/3864 534.3/4115 534.3/2667 535.2/3583 535.3/3117 535.3/3867 535.3/3518 535.3/3832 536.1/3940 536.2/3565 536.2/3719 536.2/3673 536.3/3419 537/3930 537.2/3564 537.2/3672 537.2/2870 537.4/3174 537.4/4125 538/3520 538.2/3975 538.3/4134 538.3/4098 538.4/4138 538.9/2509 539.2/2908 539.3/4135 539.4/4141 540.4/4132 540.4/3698 540.5/4170 541.4/4132 542.3/3209 543.4/3699 544/3714 544.1/3444 544.1/3370 544.2/3207 544.2/3429 544.2/3389 545.2/3201 545.2/3381 545.3/4120 545.5/2646 546.2/3205 546.2/3018 546.2/3701 546.3/4137 546.3/3016 546.3/3657 546.4/4141 547.2/3021 547.2/3683 547.2/3206 547.3/3019 547.3/4139 547.4/4141 548.1/3187 548.1/3425 548.2/3020 548.3/3015 548.3/4235 548.4/4232 549/3522 549/3262 549.1/3514 549.1/3186 549.3/4128 549.3/3428 550.1/3237 550.2/3665 550.2/3604 550.2/3245 550.3/4130 550.3/3600 550.3/3685 550.4/4134 550.4/4115 551.2/3665 551.2/3013 551.2/3032 551.2/3603 551.3/3601 551.3/3025 552.1/3353 552.1/3369 552.2/3350 552.3/3819 553/2545 553.1/3357 553.1/4164 553.2/3348 553.2/4057 553.3/3817 554.1/3215 554.1/2526 554.2/3339 554.2/3215 554.2/3490 556.2/3111 556.2/3444 556.2/3592 556.3/2914 556.3/3118 557/3161 557.1/2574 557.2/2916 557.2/3115 557.3/2914 557.3/3121 558/3514 558.1/3526 558.1/3396 558.2/3535 558.2/3392 559.1/3391 559.1/3520 559.2/3393 559.2/3536 560.2/3626 560.2/3096 560.3/3089 561.2/2916 561.3/3635 561.3/2910 561.4/3504 561.9/2539 562.1/3514 562.2/3426 562.2/3740 563/3515 563.4/4133 564.2/3883 564.2/3452 564.2/3481 564.2/3621 564.3/3049 564.3/3511 564.3/3886 564.3/2977 564.4/4140 565.2/3452 565.2/3881 565.3/3479 565.4/4140 565.4/3903 565.4/2649 566.2/3573 566.2/3204 566.3/4130 566.3/3595 566.4/4231 567.1/3030 567.2/3573 567.2/3016 567.4/4231 567.5/4234 568.2/3207 568.2/3224 568.3/3617 568.3/2910 568.4/2913 568.4/4231 568.8/2627 569.1/3295 569.2/3212 569.3/3616 569.3/3594 569.8/2919 570.1/3212 570.3/4124 570.4/3509 570.5/3514 571/3508 571.4/4130 571.4/3715 571.4/3677 571.6/2509 571.8/2913 572.1/3676 572.2/3390 572.2/3663 572.2/3417 572.2/3844 572.3/2827 572.4/3932 572.8/2915 573.2/3004 573.2/3420 573.3/3608 573.3/3391 573.8/2921 574.2/4044 574.3/4051 574.8/2915 575.3/4115 575.4/4130 576/3212 576.2/2858 576.2/3663 576.3/2859 576.3/4112 576.4/4123 576.4/3863 577.2/2861 577.3/4136 577.4/4208 578.3/3941 579.1/2789 579.2/2788 579.3/3844 579.4/3510 579.5/3513 580.1/2789 580.2/3574 580.3/3314 580.5/2523 581.1/2789 581.2/2789 581.2/3575 581.2/2861 582.3/3485 582.3/4122 582.4/3827 583.4/4097 584.1/3394 584.2/3507 584.3/3371 584.4/4137 585/3503 585.3/3087 586.1/3285 586.2/3279 586.3/3301 587.2/3274 588/3273 588.2/2794 588.3/2789 589.2/2797 589.4/4134 589.5/4402 590.1/3369 590.1/3453 590.2/3214 590.2/3467 590.2/3007 590.3/3005 591.2/3220 591.2/2982 591.3/3005 591.5/4105 592.2/3629 592.2/2998 592.3/3006 592.4/4130 593/3501 593.2/3274 593.3/3260 593.3/4128 593.4/3705 593.4/3672 594/2618 594.1/3671 594.2/3376 594.2/3408 594.3/3619 594.3/3602 594.4/2537 595/2627 595.2/3004 595.3/3016 596/3029 596.2/3008 596.2/3844 596.2/4052 596.3/3807 596.4/3803 597.2/4057 597.2/4128 597.3/3812 597.3/3000 597.3/3789 597.4/3795 598.3/3704 599.3/4126 599.4/4131 Loadings plot for all variables

PC1 Loadings plot for the top 25 varaibles Principal Component Analysis (Loadings) Top 25 PCA (VIII) PC2 0.06 0.04 0.02 0.00 0.02 0.04 0.06 556.2/3444 359/3507 532.2/3458 392.1/3631 460.1/3452 324/3268 472/2954 533.2/2870 358.2/3832362.1/3165 357.2/3833 356.2/3830 396.2/3862 323.1/3391 322.1/3394410.2/3938 382.1/3246 349.1/3290 412.2/3944 301.2/3391 300.2/3390 548.1/3425 302.2/3389 523.2/3459 466.2/3658 356.2/3830 357.2/3833 396.2/3862 349.1/3290 301.2/3391 324/3268 548.1/3425 300.2/3390 358.2/3832 412.2/3944 382.1/3246 322.1/3394 523.2/3459 359/3507 460.1/3452 302.2/3389 472/2954 533.2/2870 410.2/3938 466.2/3658 323.1/3391 532.2/3458 556.2/3444 362.1/3165 392.1/3631 0.04 0.02 0.00 0.02 0.04 37

PCA (IX) Scree plot: variance vs. principle component number 38

PLS-DA (I) A supervised method to find a predictive model that describes the direction of maximum covariance between a dataset (X) and the class membership (Y) Similar to PCA, the original variables are summarized into much fewer new variables using their weighted averages. The new variables are called scores. The weighting profiles are called loadings. PLS-DA can perform both classification and feature selection. Feature importance measure: VIP (Variable Importance in Projection) 39

PLS-DA (II) Interpretation of the model R 2 X and R 2 Y Q 2 Y fraction of the variance that the model explains in the independent (X) and dependent variables (Y) Range: 0-1 measure of the predictive accuracy of the model usually estimated by cross validation or permutation testing Range: 0-1 > 0.5 is considered good while > 0.9 is outstanding 40

PLS-DA (III) Note of caution Supervised classification methods are powerful. BUT, they can overfit your data, severely. 41

Machine Learning Clustering Classification 42

Clustering Group similar objects together Any clustering method requires A method to measure similarity/dissimilarity between objects A threshold to decide whether an object belongs to a cluster A way to measure the distance between two clusters Common clustering algorithms K-means Hierarchical Self-organizing map Unsupervised machine learning techniques 43

Hierarchical clustering (I) 1. Find the two closest objects and merge them into a cluster 2. Find and merge the next two closest objects (or an object and a cluster, or two clusters) 3. Repeat step 2 until all objects have been clustered 44

Hierarchical clustering (II) Methods to measure similarity between objects Euclidean, Manhattan Pearson correlation Cosine similarity Linkage: ways to measure the distance between two clusters single complete centroid average 45

Hierarchical clustering (III) 46

Classification Use a training set of correctly-identified observations to build a predictive model Predict to which of a set of categories a new observation belongs Supervised machine learning Methods Linear discriminant analysis Support vector machine (SVM) Artificial neural network (ANN) k-nearest neighbor Random forest PLS-DA 47

Software Packages MetaboAnalyst XCMS 48

For in-depth statistical analysis and data interpretation, please make an appointment with a biostatistician. 49

Thank you! 50