Metabolomic Data Analysis with MetaboAnalyst

Size: px
Start display at page:

Download "Metabolomic Data Analysis with MetaboAnalyst"

Transcription

1 Metabolomic Data Analysis with MetaboAnalyst User ID: guest April 14, Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety of data types generated in metabolomic studies, including compound concentration data, binned NMR/MS spectra data, NMR/MS peak list data, as well as MS spectra (NetCDF, mzxml, mzdata). Users need to specify the data types when uploading their data in order for MetaboAnalyst to select the correct algorithm to process them. The R scripts datautils.r and processing.r are required to read in and process the uploaded data Reading Binned Spectral Data The binned spectra data should be uploaded in comma seperated values (.csv) format. Samples can be in rows or columns, with class labels immediately following the sample IDs. The uploaded file is in comma separated values (.csv) format. Samples are in rows and features in columns The uploaded data file contains 50 (samples) by 200 (spectra bins) data matrix Filtering Baseline Noises A significant proportion of bins contain close-to-zero values that come from baseline noises. These values are troublesome for some algorithms to work properly and should be excluded before further data analysis. MetaboAnalyst uses a simple linear filter based on the maximal values of each bin. The default cut-off threshold will remove 25% of the lowest spectra bins. Please see Figure 1 for a summary graph. The selected cut-off threshold is A total of 51 bins were excluded based on this cut-off. 1

2 Max intensities Current cut off: , 149 bins left. Bin.9.98 Bin.8.86 Bin.7.74 Bin.6.62 Bin.3.66 Bin.2.58 Bin.1.46 Bin.0.34 Spectra bins Figure 1: Filter baseline noises for binned spectra. maximum values of each corresponding bin. The bars represent the 2

3 1.1.3 Data Integrity Check Before data analysis, a data integrity check is performed to make sure that all the necessary information has been collected. The class labels must be present and contain only two classes. If samples are paired, the class label must be from -n/2 to -1 for one group, and 1 to n/2 for the other group (n is the sample number and must be an even number). Class labels with same absolute value are assumed to be pairs. Compound concentration or peak intensity values must all be non-negative numbers Missing value imputations Too many zeroes or missing values will cause difficulties for downstream analysis. MetaboAnalyst offers several different methods for this purpose. The default method replaces all the missing and zero values with a small values (the half of the minimum positive values in the original data) assuming to be the detection limit. The assumption of this approach is that most missing values are caused by low abundance metabolites (i.e.below the detection limit). In addition, since zero values may cause problem for data normalization (i.e. log), they are also replaced with this small value. User can also specify other methods, such as replace by mean/median, or use Probabilistic PCA (PPCA), Bayesian PCA (BPCA) method, Singular Value Decomposition (SVD) method to impute the missing values 1. Please choose the one that is the most appropriate for your data. Table 1 summarizes the result of the data processing steps. Missing variables were replaced with a small value: Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcamethods: a bioconductor package, providing PCA methods for incomplete data., Bioinformatics (9):

4 Table 1: Summary of data processing results Features (positive) Missing/Zero Features(baseline) Features (processed) P P P P P P P P P P P P P P P P P P P P P P P P013b P100b C C C C C C C C C C C C C C C C C C C C C C C C C

5 1.2 Data Normalization The data is stored as a table with one sample per row and one variable (bin/ peak/metabolite) per column. There are two types of normalization. Rowwise normalization aims to bring each sample (row) comparable to each other (i.e. urine samples with different dilution effects). Column-wise normalization aims to make each variable (column) comparable to each other within the same sample. The procedure is useful when variables are of very different orders of magnitude. The normalization consists of the following options: 1. Row-wise normalization: ˆ Normalization by the sum ˆ Normalization by a reference sample (probabilistic quotient normalization) 2 ˆ Normalization by a reference feature (i.e. creatinine, internal control) ˆ Sample specific normalization (i.e. normalize by dry weight, volume) 2. Column-wise normalization : ˆ Log transformation (log 2) ˆ Unit scaling (mean-centered and divided by standard deviation of each variable) ˆ Pareto scaling (mean-centered and divided by the square root of standard deviation of each variable) ˆ Range scaling (mean-centered and divided by the value range of each variable) The R script normalization.r is required. Figure 2 shows the effects before and after normalization. 2 Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics, 2006, Anal Chem 78 (13);

6 Before Normalization After Normalization Spectra Bins Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Density Intensity Normalized Intensity Figure 2: Box plots and kernel density plots before and after normalization. The boxplots show at most 80 features due to space limit. The density plots are based on all samples. Row-wise normalization: Normalization to constant sum Column-wise normalization: Log Normalization. 6

7 2 Statistical and Machine Learning Data Analysis MetaboAnalyst offers a variety of methods commonly used in metabolomic data analyses. They include: 1. Univariate analysis methods: ˆ Fold Change Analysis ˆ T-tests ˆ Volcano Plot 2. Dimensional Reduction methods: ˆ Principal Component Analysis (PCA) ˆ Partial Least Squares - Discriminant Analysis (PLS-DA) 3. Robust Feature Selection Methods in microarray studies ˆ Significance Analysis of Microarray (SAM) ˆ Empirical Bayesian Analysis of Microarray (EBAM) 4. Clustering Analysis ˆ Hierarchical Clustering Dendrogram Heatmap ˆ Partitional Clustering K-means Clustering Self-Organizing Map (SOM) 5. Supervised Classification and Feature Selection methods ˆ Random Forest ˆ Support Vector Machine (SVM) 7

8 2.1 Principal Component Analysis (PCA) PCA is an unsupervised method aiming to find the directions that best explain the variance in a data set (X) without referring to class labels (Y). The data are summarized into much fewer variables called scores which are weighted average of the original variables. The weighting profiles are called loadings. The PCA analysis is performed using the prcomp package. The calculation is based on singular value decomposition. The Rscript chemometrics.r is required. Figure 3 is pairwise score plots providing an overview of the various seperation patterns among the most significant PCs; Figure 4 is the scree plot showing the variances explained by the selected PCs; Figure 5 shows the 2-D score plot between selected PCs; Figure 6 shows the 3-D score plot between selected PCs; Figure 7 shows the loading plot between the selected PCs; Figure 8 shows the biplot between the selected PCs PC % PC % PC % PC % PC % Figure 3: Pairwise score plots between the selected PCs. The explained variance of each PC is shown in the corresponding diagonal cell. 8

9 Variance explained % 97.8 % 1.2 % Scree plot 98.2 % 0.4 % 98.5 % 0.3 % PC index 98.7 % 0.2 % Figure 4: Scree plot shows the variance explained by PCs. The green line on top shows the accumulated variance explained; the blue line underneath shows the variance explained by individual PC. 9

10 PC 2 ( 1.2 %) P013b P041 P064 P012 P089 P113 P065 P085 P038 P060 P014 P034 P070 P056 P058 P042 P027 P092 P049 C002 C005 C010 C012 C028 C034C033 P037 C022 C017 C020 C007 C009 C015 P080 P086 P002 C006C004 C024C029 C016 C031 C011 C019 C032 C030 patient control P100b P099 C026 C PC 1 ( 96.7 %) Figure 5: Score plot between the selected PCs. The explained variances are shown in brackets. 10

11 patient control PC 3 ( 0.4 %) PC 2 ( 1.2 %) PC 1 ( 96.7 %) Figure 6: 3D score plot between the selected PCs. The explained variances are shown in brackets. 11

12 loadings 1 loadings 2 Bin.9.14 Bin.8.86 Bin.8.82 Bin.8.54 Bin.8.46 Bin.8.34 Bin.8.30 Bin.8.26 Bin.8.22 Bin.8.18 Bin.8.10 Bin.8.06 Bin.8.02 Bin.7.98 Bin.7.94 Bin.7.90 Bin.7.86 Bin.7.82 Bin.7.78 Bin.7.74 Bin.7.70 Bin.7.66 Bin.7.62 Bin.7.58 Bin.7.54 Bin.7.50 Bin.7.46 Bin.7.42 Bin.7.38 Bin.7.34 Bin.7.30 Bin.7.26 Bin.7.22 Bin.7.18 Bin.7.14 Bin.7.10 Bin.7.06 Bin.7.02 Bin.6.98 Bin.6.94 Bin.6.90 Bin.6.86 Bin.6.82 Bin.6.78 Bin.6.74 Bin.6.70 Bin.6.66 Bin.6.54 Bin.6.50 Bin.6.46 Bin.6.42 Bin.6.38 Bin.6.34 Bin.4.38 Bin.4.34 Bin.4.30 Bin.4.26 Bin.4.22 Bin.4.18 Bin.4.14 Bin.4.10 Bin.4.06 Bin.4.02 Bin.4.04 Bin.3.98 Bin.3.94 Bin.3.90 Bin.3.86 Bin.3.82 Bin.3.78 Bin.3.74 Bin.3.70 Bin.3.66 Bin.3.62 Bin.3.58 Bin.3.54 Bin.3.50 Bin.3.46 Bin.3.42 Bin.3.38 Bin.3.34 Bin.3.30 Bin.3.26 Bin.3.22 Bin.3.18 Bin.3.14 Bin.3.10 Bin.3.06 Bin.3.02 Bin.3.04 Bin.2.98 Bin.2.94 Bin.2.90 Bin.2.86 Bin.2.82 Bin.2.78 Bin.2.74 Bin.2.70 Bin.2.66 Bin.2.62 Bin.2.58 Bin.2.54 Bin.2.50 Bin.2.46 Bin.2.42 Bin.2.38 Bin.2.34 Bin.2.30 Bin.2.26 Bin.2.22 Bin.2.18 Bin.2.14 Bin.2.10 Bin.2.06 Bin.2.02 Bin.1.98 Bin.1.94 Bin.1.90 Bin.1.86 Bin.1.82 Bin.1.78 Bin.1.74 Bin.1.70 Bin.1.66 Bin.1.62 Bin.1.58 Bin.1.54 Bin.1.50 Bin.1.46 Bin.1.42 Bin.1.38 Bin.1.34 Bin.1.30 Bin.1.26 Bin.1.22 Bin.1.18 Bin.1.14 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.0.94 Bin.0.90 Bin.0.86 Bin.0.82 Bin.0.78 Bin.0.74 Bin.0.70 Bin.0.58 Figure 7: Loading plot for the selected PCs. 12

13 P013b P041 P080 PC P064 P012 P086 P100b P089 P113 P065 P002 P085 P038 P014 P060 P034 P070 P056 Bin.0.98 P027 Bin.3.78 Bin.3.74 P058 P042 P049 Bin.4.10 Bin.4.18 Bin.4.14 Bin.4.30 Bin.4.22 Bin.4.26 Bin.4.34 P092 C031 Bin.7.38 Bin.7.34 Bin.7.42 Bin.7.98 Bin.7.46 Bin.7.94 Bin.8.02 Bin.7.70 Bin.7.50 Bin.8.18 Bin.7.90 Bin.8.10 Bin.8.82 Bin.8.06 Bin.8.86 Bin.8.54 Bin.9.14 Bin.8.46 Bin.7.78 Bin.7.74 Bin.8.34 Bin.8.26 Bin.8.30 Bin.8.22 Bin.7.22 Bin.7.18 Bin.7.30 Bin.7.26 Bin.7.14 Bin.7.10 Bin.7.06 Bin.6.98 Bin.6.86 Bin.7.02 Bin.6.90 Bin.6.94 Bin.6.82 Bin.4.04 Bin.6.38 Bin.6.46 Bin.6.34 Bin.6.78 Bin.6.66 Bin.6.70 Bin.6.50 Bin.6.74 Bin.6.54 Bin.6.42 Bin.4.06 Bin.3.66 Bin.3.70 Bin.3.82 Bin.3.86 Bin.3.94 Bin.3.62 Bin.3.90 Bin.4.02 Bin.3.54 Bin.3.42 Bin.3.46 Bin.3.50 Bin.3.38 Bin.3.26 Bin.3.58 Bin.3.22 Bin.2.06 Bin.2.26 Bin.3.30 Bin.2.14 Bin.3.18 Bin.2.02 Bin.3.10 Bin.2.30 Bin.2.22 Bin.3.34 Bin.2.90 Bin.2.94 Bin.2.98 Bin.2.86 Bin.4.38 Bin.3.04 Bin.3.06 Bin.3.02 Bin.2.74 Bin.3.14 Bin.2.18 Bin.2.10 Bin.1.34 Bin.0.90 Bin.0.94 Bin.1.22 Bin.1.94 Bin.1.90 Bin.2.46 Bin.1.98 Bin.2.42 Bin.1.50 Bin.2.50 Bin.1.18 Bin.0.86 Bin.1.66 Bin.1.46 Bin.1.30 Bin.2.38 Bin.1.74 Bin.1.70 Bin.1.38 Bin.1.78 Bin.1.82 Bin.1.86 Bin.1.62 Bin.1.42 Bin.1.54 Bin.1.58 Bin.1.14 Bin.2.34 Bin.1.26 Bin.2.82 Bin.2.78 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.82 Bin.0.78 Bin.2.62 Bin.0.74 Bin.0.70 Bin.0.58 Bin.3.98 C002 C011 Bin.2.70 Bin.2.66 Bin.2.58 Bin.2.54 Bin.7.66 Bin.7.62 C006Bin.7.54 Bin.7.86 Bin.7.58 C004 Bin.7.82 C005 C010 C012 C028 C024 C029 P099 C034 C033 P037 C022 C017 C007 C009 C015 C019 C020 C032 C016 C030 C021 C PC1 Figure 8: PCA biplot between the selected PCs. Note, you may want to test different centering and scaling normalization methods for the biplot to be displayed properly. 13

14 2.2 Partial Least Squares - Discriminant Analysis (PLS- DA) PLS is a supervised method that uses multivariate regression techniques to extract via linear combination of original variables (X) the information that can predict the class membership (Y). The PLS regression is performed using the plsr function provided by R pls package 3. The classification and crossvalidation are performed using the corresponding wrapper function offered by the caret package 4. To assess the significance of class discrimination, a permutation test was performed. In each permutation, a PLS-DA model was built between the data (X) and the permuted class labels (Y) using the optimal number of components determined by cross validation for the model based on the original class assignment. The ratio of the between sum of the squares and the within sum of squares (B/W-ratio) for the class assignment prediction of each model was calculated. If the B/W ratio of the original class assignment is a part of the distribution based on the permuted class assignments The contrast between the two class assignment cannot be considered significant from a statistical point of view. There are two variable importance measures in PLS-DA. The first, Variable Importance in Projection (VIP) is a weighted sum of squares of the PLS loadings taking into account the amount of explained Y-variation in each dimension. The other importance measure is based on the weighted sum of PLS-regression The weights are a function of the reduction of the sums of squares across the number of PLS components. coefficients 5. The R script chemometrics.r is required. Figure 9 shows the overview of score plots; Figure 10 shows the 2-D score plot between selected components; Figure 11 shows the 3-D score plot between selected components; Figure 12 shows the loading plot between the selected components; Figure 13 shows the classification performance with different number of components. Figure 14 shows the important features identified by PLS-DA. Figure 15 shows the permutation test results for model validation. 3 Ron Wehrens and Bjorn-Helge Mevik.pls: Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR), 2007, R package version Max Kuhn. Contributions from Jed Wing and Steve Weston and Andre Williams.caret: Classification and Regression Training, 2008, R package version Bijlsma et al.large-scale Human Metabolomics Studies: A Strategy for Data (Pre-) Processing and Validation, Anal Chem. 2006,

15 Component 1 34 % Component % Component % Component % Component % Figure 9: Pairwise score plots between the selected components. The explained variance of each component is shown in the corresponding diagonal cell. 15

16 Component 2 ( 10.9 %) C026 C021 C030 C007 C022 C019 C017 P037 P099 C009C034C010 C032 C020 C004 C016 C015 C005 C012 C033 C028 C024 C029 C031 C006 C011 C002 P042 P092 P049 P058 P027 P056 P070 P034 P060 P038 P085 P100b P002 P014 P080 P065 P012 P041 P113 P089 P064 P086 patient control P013b Component 1 ( 34 %) Figure 10: Score plot between the selected PCs. The explained variances are shown in brackets. 16

17 patient control Component 3 ( 4.9 %) Component 2 ( 10.9 %) Component 1 ( 34 %) Figure 11: 3D score plot between the selected PCs. The explained variances are shown in brackets. 17

18 0.3 Bin.7.82 Bin.7.54 Bin.7.86 Bin.7.58 Bin.7.62 loadings Bin.7.66 Bin.3.98 Bin.7.26 Bin.7.42 Bin.7.22 Bin.7.34 Bin.7.94 Bin.7.18 Bin.7.98 Bin.3.62 Bin.2.26 Bin.3.02 Bin.7.02Bin.4.22 Bin.2.34Bin.3.66 Bin.4.18 Bin.1.94 Bin.3.70 Bin.4.26 Bin.7.70 Bin.8.54 Bin.7.50 Bin.6.90 Bin.1.90 Bin.4.30 Bin.4.34 Bin.6.86 Bin.8.46 Bin.4.06 Bin.8.18 Bin.8.86 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.30 Bin.8.34 Bin.8.26 Bin.7.78 Bin.7.14 Bin.7.90 Bin.6.42 Bin.6.54 Bin.6.74 Bin.6.50 Bin.8.22 Bin.6.70 Bin.7.74 Bin.6.66 Bin.6.78 Bin.6.46 Bin.4.02 Bin.2.10 Bin.1.54 Bin.1.86 Bin.2.30 Bin.6.34 Bin.8.06 Bin.6.82 Bin.1.98 Bin.4.04 Bin.6.98 Bin.6.38 Bin.6.94 Bin.3.82 Bin.1.82 Bin.3.90 Bin.1.78 Bin.3.54 Bin.1.66 Bin.1.70 Bin.3.26 Bin.0.58 Bin.1.62 Bin.1.42 Bin.0.70 Bin.1.38 Bin.8.02 Bin.0.74 Bin.2.02 Bin.4.14 Bin.1.58 Bin.3.86 Bin.2.98 Bin.1.46 Bin.1.74 Bin.3.50 Bin.2.74 Bin.3.78 Bin.2.06 Bin.2.86 Bin.0.78Bin.3.74 Bin.3.34 Bin.3.94 Bin.4.38 Bin.3.22 Bin.3.38 Bin.2.22 Bin.7.10 Bin.3.30 Bin.2.82 Bin.2.78 Bin.3.10 Bin.4.10 Bin.2.18 Bin.2.90 Bin.2.14 Bin.1.50 Bin.1.14 Bin.0.82 Bin.0.90 Bin.0.86 Bin.1.34 Bin.3.46 Bin.3.18 Bin.2.94 Bin.2.38 Bin.3.58 Bin.2.42 Bin.1.30Bin.1.18 Bin.2.50 Bin.2.62 Bin.1.26Bin.1.22 Bin Bin Bin Bin.3.04 Bin.3.06 Bin.3.14 Bin.2.46 Bin.7.38 Bin.7.06 Bin.7.46 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.3.42 Bin.0.94 Bin.2.54 Bin.2.66 loadings 1 Figure 12: Loading plot between the selected PCs. 18

19 PLS DA classification with different number of components Accuracy % 100 % 100 % 100 % 100 % Number of components Figure 13: PLS-DA classification using different number of components. The red circle indicates the best classifier. 19

20 Rank by VIP (top 15) Rank by Coef. (top 15) Bin.2.54 Bin.2.54 Bin.2.70 Bin.2.70 Bin.7.82 Bin.2.66 Bin.2.66 Bin.7.82 Bin.7.54 Bin.7.54 Bin.7.86 Bin.7.86 Spectra Bins Bin.2.58 Bin.0.98 Bin.0.94 Bin.2.58 Bin.0.94 Bin.7.58 Bin.7.58 Bin.0.98 Bin.7.66 Bin.7.66 Bin.0.86 Bin.7.62 Bin.7.62 Bin.0.86 Bin.3.98 Bin.3.98 Bin.1.02 Bin Figure 14: Important features identified by PLS-DA. The left panel shows the features ranked by VIP score. The right panel shows the features ranked based on their regression coefficients. 20

21 Class discrimination measured by B/W B/W Original Permuted Index Distribution of random class assignments Frequency Original Permuted(median) B/W Figure 15: PLS-DA model validation by permutation tests. The top panel shows B/W ratio calculated for both original and permuted PLS-DA models. The bottom panel shows the distribution of random class assignments based on the frequencies of permuted B/W ratios. The green line (top) and green area (bottom) mark the 95% confidence regions of B/W for the permuted data. 21

22 2.3 Hierarchical Clustering In (agglomerative) hierarchical cluster analysis, each sample begins as a separate cluster and the algorithm proceeds to combine them until all samples belong to one cluster. Two parameters need to be considered when performing hierarchical clustering. The first one is similarity measure - Euclidean distance, Pearson s correlation, Spearman s rank correlation. The other parameter is clustering algorithms, including average linkage (clustering uses the centroids of the observations), complete linkage (clustering uses the farthest pair of observations between the two groups), single linkage (clustering uses the closest pair of observations) and Ward s linkage (clustering to minimize the sum of squares of any two clusters). Heatmap is often presented as a visual aid in addition to the dendrogram. Hierachical clustering is performed with the hclust function in package stat. The R script clustering.r is required. Figure 16 shows the clustering result in the form of a dendrogram. Figure 17 shows the clustering result in the form of a heatmap. 22

23 Cluster with ward method Height patient control C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P080 Figure 16: Clustering result shown as dendrogram (distance measure using euclidean, and clustering algorithm using ward). 23

24 Bin.4.06 Bin.4.04 Bin.3.06 Bin.3.04 Bin.3.98 Bin.3.26 Bin.3.74 Bin.3.70 Bin.3.78 Bin.3.66 Bin.3.02 Bin.3.94 Bin.3.82 Bin.3.62 Bin.3.90 Bin.3.86 Bin.3.22 Bin.3.54 Bin.3.58 Bin.3.30 Bin.4.02 Bin.2.06 Bin.0.98 Bin.0.86 Bin.4.34 Bin.4.26 Bin.4.22 Bin.4.30 Bin.1.54 Bin.1.42 Bin.1.58 Bin.1.62 Bin.3.38 Bin.1.18 Bin.2.38 Bin.1.30 Bin.0.94 Bin.0.90 Bin.1.46 Bin.1.50 Bin.1.66 Bin.1.70 Bin.1.74 Bin.1.38 Bin.1.86 Bin.1.82 Bin.1.78 Bin.7.98 Bin.7.18 Bin.7.26 Bin.7.30 Bin.7.22 Bin.1.14 Bin.7.10 Bin.2.98 Bin.1.06 Bin.1.10 Bin.2.86 Bin.2.90 Bin.2.62 Bin.2.94 Bin.2.78 Bin.2.82 Bin.7.14 Bin.0.82 Bin.1.02 Bin.6.94 Bin.6.98 Bin.7.94 Bin.7.06 Bin.7.46 Bin.4.38 Bin.8.18 Bin.8.02 Bin.6.86 Bin.6.90 Bin.7.02 Bin.7.50 Bin.7.70 Bin.7.90 Bin.8.06 Bin.0.74 Bin.0.78 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.86 Bin.8.54 Bin.8.46 Bin.7.78 Bin.6.82 Bin.6.38 Bin.0.70 Bin.7.74 Bin.6.78 Bin.6.34 Bin.8.22 Bin.8.34 Bin.8.30 Bin.8.26 Bin.6.46 Bin.0.58 Bin.6.66 Bin.6.74 Bin.6.70 Bin.6.50 Bin.6.42 Bin.6.54 Bin.2.58 Bin.7.62 Bin.7.66 Bin.7.58 Bin.2.66 Bin.2.70 Bin.2.54 Bin.7.54 Bin.7.86 Bin.7.82 Bin.3.42 Bin.1.34 Bin.3.50 Bin.3.46 Bin.7.38 Bin.7.42 Bin.7.34 Bin.1.98 Bin.1.90 Bin.2.22 Bin.2.30 Bin.2.50 Bin.2.42 Bin.3.34 Bin.1.26 Bin.2.46 Bin.3.10 Bin.2.18 Bin.4.10 Bin.2.74 Bin.3.14 Bin.3.18 Bin.2.34 Bin.2.26 Bin.1.22 Bin.4.14 Bin.4.18 Bin.1.94 Bin.2.10 Bin.2.02 Bin.2.14 C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P Value Color Key Figure 17: Clustering result shown as heatmap (distance measure using euclidean, and clustering algorithm using ward). 24

25 2.4 Random Forest (RF) Random Forest is a supervised learning algorithm suitable for high dimensional data analysis. It uses an ensemble of classification trees, each of which is grown by random feature selection from a bootstrap sample at each branch. Class prediction is based on the majority vote of the ensemble. RF also provides other useful information such as OOB (out-of-bag) error and variable importance measure. During tree construction, about one-third of the instances are left out of the bootstrap sample. This OOB data is then used as test sample to obtain an unbiased estimate of the classification error (OOB error). Variable importance is evaluated by measuring the increase of the OOB error when it is permuted. RF analysis is performed using the randomforest package 6. The R script classification.r is required. Table 2 shows the confusion matrix of random forest. Figure 18 shows the cumulative error rates of random forest analysis for given parameters. Figure 19 shows the important features ranked by random forest. The OOB error is 0.04 Random Forest classification Error Overall control patient trees Figure 18: Cumulative error rates by Random Forest classification. The overall error rate is shown as the black line; the red and green lines represent the error rates for each class. 6 Andy Liaw and Matthew Wiener. Classification and Regression by randomforest, 2002, R News 25

26 Top 15 Important Features Bin.2.54 Bin.2.70 Bin.0.98 Bin.0.94 Bin.2.10 Bin.1.02 Spectra Bins Bin.0.90 Bin.2.66 Bin.1.42 Bin.0.86 Bin.2.58 Bin.0.82 Bin.1.86 Bin.1.70 Bin MeanDecreaseAccuracy Figure 19: Significant features identified by Random Forest. The features are ranked by the mean decrease in classification accuracy when they are permuted. 26

27 control patient class.error control patient Table 2: Random Forest Classification Performance 27

28 3 Data Annotation Please be advised that MetaboAnalyst also supports metabolomic data annotation. For NMR, MS, or GC-MS peak list data, users can perform peak identification by searching the corresponding libraries. For compound concentration data, users can perform pathway mapping. These tasks require a lot of manual efforts and are not performed by default. The report was generated on Tue Apr 14 21:30: with R version ( ) on a i386-redhat-linux-gnu platform. Thank you for using MetaboAnalyst! For suggestions and feedback please contact Jeff Xia (jianguox@ualberta.ca). 28

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

High-throughput Processing and Analysis of LC-MS Spectra

High-throughput Processing and Analysis of LC-MS Spectra High-throughput Processing and Analysis of LC-MS Spectra By Jianguo Xia (jianguox@ualberta.ca) Last update : 02/05/2012 This tutorial shows how to process and analyze LC-MS spectra using methods provided

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017 Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2 161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Analyzing Genomic Data with NOJAH

Analyzing Genomic Data with NOJAH Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Supplementary text S6 Comparison studies on simulated data

Supplementary text S6 Comparison studies on simulated data Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate

More information

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract 19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms

More information

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2) Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Clustering Jacques van Helden

Clustering Jacques van Helden Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Chapter 1. Using the Cluster Analysis. Background Information

Chapter 1. Using the Cluster Analysis. Background Information Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,

More information

Bioinformatics - Lecture 07

Bioinformatics - Lecture 07 Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Practical OmicsFusion

Practical OmicsFusion Practical OmicsFusion Introduction In this practical, we will analyse data, from an experiment which aim was to identify the most important metabolites that are related to potato flesh colour, from an

More information

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging 1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant

More information

Using the DATAMINE Program

Using the DATAMINE Program 6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

MULTIVARIATE ANALYSIS USING R

MULTIVARIATE ANALYSIS USING R MULTIVARIATE ANALYSIS USING R B N Mandal I.A.S.R.I., Library Avenue, New Delhi 110 012 bnmandal @iasri.res.in 1. Introduction This article gives an exposition of how to use the R statistical software for

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Package DiffCorr. August 29, 2016

Package DiffCorr. August 29, 2016 Type Package Package DiffCorr August 29, 2016 Title Analyzing and Visualizing Differential Correlation Networks in Biological Data Version 0.4.1 Date 2015-03-31 Author, Kozo Nishida Maintainer

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde

Unsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde Unsupervised learning: Clustering & Dimensionality reduction Theo Knijnenburg Jorma de Ronde Source of slides Marcel Reinders TU Delft Lodewyk Wessels NKI Bioalgorithms.info Jeffrey D. Ullman Stanford

More information

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button. Here is the Statistics button. After creating dataset you can analyze it in different ways. First, you can calculate statistics. Open Statistics dialog, Common tabsheet, click Calculate. Min, Max: minimal

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

JMP Book Descriptions

JMP Book Descriptions JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

NMRProcFlow Macro-command Reference Guide

NMRProcFlow Macro-command Reference Guide NMRProcFlow Macro-command Reference Guide This document is the reference guide of the macro-commands Daniel Jacob UMR 1332 BFP, Metabolomics Facility CGFB Bordeaux, MetaboHUB - 2018 1 NMRProcFlow - Macro-command

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Final Report: Kaggle Soil Property Prediction Challenge

Final Report: Kaggle Soil Property Prediction Challenge Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Package stattarget. December 23, 2018

Package stattarget. December 23, 2018 Type Package Title Statistical Analysis of Molecular Profiles Version 1.12.0 Author Hemi Luan Maintainer Hemi Luan Depends R (>= 3.3.0) Package stattarget December 23, 2018 Imports

More information

Multivariate Methods

Multivariate Methods Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on supervised learning

More information

DI TRANSFORM. The regressive analyses. identify relationships

DI TRANSFORM. The regressive analyses. identify relationships July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Features: representation, normalization, selection. Chapter e-9

Features: representation, normalization, selection. Chapter e-9 Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017 Milena Kraus Digital Health Summer Agenda Real-world Use Cases Oncology Nephrology Heart Insufficiency Additional Topics Data Management & Foundations Biology Recap Data Sources Data Formats Business Processes

More information

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS

More information

Computing with large data sets

Computing with large data sets Computing with large data sets Richard Bonneau, spring 2009 Lecture 8(week 5): clustering 1 clustering Clustering: a diverse methods for discovering groupings in unlabeled data Because these methods don

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Forestry Applied Multivariate Statistics. Cluster Analysis

Forestry Applied Multivariate Statistics. Cluster Analysis 1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Pathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module

Pathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module Pathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module By: Jasmine Chong, Jeff Xia Date: 14/02/2018 The aim of this tutorial is to demonstrate how the MS Peaks to Pathways

More information

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil,

More information

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public

More information

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017 Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised

More information

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\ Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Clustering analysis of gene expression data

Clustering analysis of gene expression data Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Week 7 Picturing Network. Vahe and Bethany

Week 7 Picturing Network. Vahe and Bethany Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Unsupervised learning in Vision

Unsupervised learning in Vision Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual

More information