Metabolomic Data Analysis with MetaboAnalyst
|
|
- Angelina Bernadette Flowers
- 5 years ago
- Views:
Transcription
1 Metabolomic Data Analysis with MetaboAnalyst User ID: guest April 14, Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety of data types generated in metabolomic studies, including compound concentration data, binned NMR/MS spectra data, NMR/MS peak list data, as well as MS spectra (NetCDF, mzxml, mzdata). Users need to specify the data types when uploading their data in order for MetaboAnalyst to select the correct algorithm to process them. The R scripts datautils.r and processing.r are required to read in and process the uploaded data Reading Binned Spectral Data The binned spectra data should be uploaded in comma seperated values (.csv) format. Samples can be in rows or columns, with class labels immediately following the sample IDs. The uploaded file is in comma separated values (.csv) format. Samples are in rows and features in columns The uploaded data file contains 50 (samples) by 200 (spectra bins) data matrix Filtering Baseline Noises A significant proportion of bins contain close-to-zero values that come from baseline noises. These values are troublesome for some algorithms to work properly and should be excluded before further data analysis. MetaboAnalyst uses a simple linear filter based on the maximal values of each bin. The default cut-off threshold will remove 25% of the lowest spectra bins. Please see Figure 1 for a summary graph. The selected cut-off threshold is A total of 51 bins were excluded based on this cut-off. 1
2 Max intensities Current cut off: , 149 bins left. Bin.9.98 Bin.8.86 Bin.7.74 Bin.6.62 Bin.3.66 Bin.2.58 Bin.1.46 Bin.0.34 Spectra bins Figure 1: Filter baseline noises for binned spectra. maximum values of each corresponding bin. The bars represent the 2
3 1.1.3 Data Integrity Check Before data analysis, a data integrity check is performed to make sure that all the necessary information has been collected. The class labels must be present and contain only two classes. If samples are paired, the class label must be from -n/2 to -1 for one group, and 1 to n/2 for the other group (n is the sample number and must be an even number). Class labels with same absolute value are assumed to be pairs. Compound concentration or peak intensity values must all be non-negative numbers Missing value imputations Too many zeroes or missing values will cause difficulties for downstream analysis. MetaboAnalyst offers several different methods for this purpose. The default method replaces all the missing and zero values with a small values (the half of the minimum positive values in the original data) assuming to be the detection limit. The assumption of this approach is that most missing values are caused by low abundance metabolites (i.e.below the detection limit). In addition, since zero values may cause problem for data normalization (i.e. log), they are also replaced with this small value. User can also specify other methods, such as replace by mean/median, or use Probabilistic PCA (PPCA), Bayesian PCA (BPCA) method, Singular Value Decomposition (SVD) method to impute the missing values 1. Please choose the one that is the most appropriate for your data. Table 1 summarizes the result of the data processing steps. Missing variables were replaced with a small value: Stacklies W, Redestig H, Scholz M, Walther D, Selbig J. pcamethods: a bioconductor package, providing PCA methods for incomplete data., Bioinformatics (9):
4 Table 1: Summary of data processing results Features (positive) Missing/Zero Features(baseline) Features (processed) P P P P P P P P P P P P P P P P P P P P P P P P013b P100b C C C C C C C C C C C C C C C C C C C C C C C C C
5 1.2 Data Normalization The data is stored as a table with one sample per row and one variable (bin/ peak/metabolite) per column. There are two types of normalization. Rowwise normalization aims to bring each sample (row) comparable to each other (i.e. urine samples with different dilution effects). Column-wise normalization aims to make each variable (column) comparable to each other within the same sample. The procedure is useful when variables are of very different orders of magnitude. The normalization consists of the following options: 1. Row-wise normalization: ˆ Normalization by the sum ˆ Normalization by a reference sample (probabilistic quotient normalization) 2 ˆ Normalization by a reference feature (i.e. creatinine, internal control) ˆ Sample specific normalization (i.e. normalize by dry weight, volume) 2. Column-wise normalization : ˆ Log transformation (log 2) ˆ Unit scaling (mean-centered and divided by standard deviation of each variable) ˆ Pareto scaling (mean-centered and divided by the square root of standard deviation of each variable) ˆ Range scaling (mean-centered and divided by the value range of each variable) The R script normalization.r is required. Figure 2 shows the effects before and after normalization. 2 Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics, 2006, Anal Chem 78 (13);
6 Before Normalization After Normalization Spectra Bins Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Bin.3.30 Bin.8.02 Bin.8.34 Bin.7.98 Bin.1.70 Bin.7.10 Bin.3.26 Bin.3.82 Bin.4.10 Bin.6.98 Bin.2.50 Bin.3.10 Bin.7.62 Bin.6.46 Bin.3.18 Bin.3.70 Bin.1.58 Bin.6.38 Bin.0.78 Bin.1.30 Bin.0.82 Bin.0.74 Bin.8.82 Bin.0.58 Bin.1.94 Bin.2.42 Bin.1.62 Bin.8.26 Bin.6.54 Bin.7.78 Bin.7.30 Bin.9.14 Bin.7.58 Bin.7.02 Bin.2.18 Bin.4.26 Bin.6.82 Bin.1.42 Bin.7.22 Bin.7.90 Bin.2.86 Bin.7.14 Bin.0.70 Bin.7.54 Bin.1.78 Bin.2.46 Bin.6.42 Bin.4.22 Bin.1.82 Bin.8.86 Bin.1.54 Bin.1.50 Bin.2.10 Bin.2.26 Bin.7.46 Bin.2.74 Bin.6.90 Bin.2.78 Bin.0.86 Bin.4.34 Bin.1.46 Bin.2.22 Bin.1.66 Bin.8.06 Bin.2.34 Bin.1.26 Bin.3.22 Bin.2.14 Bin.2.02 Bin.7.86 Bin.7.26 Bin.8.30 Bin.7.66 Bin.4.14 Bin.2.70 Bin.2.30 Bin.4.06 Bin.8.54 Bin.3.90 Bin.3.94 Density Intensity Normalized Intensity Figure 2: Box plots and kernel density plots before and after normalization. The boxplots show at most 80 features due to space limit. The density plots are based on all samples. Row-wise normalization: Normalization to constant sum Column-wise normalization: Log Normalization. 6
7 2 Statistical and Machine Learning Data Analysis MetaboAnalyst offers a variety of methods commonly used in metabolomic data analyses. They include: 1. Univariate analysis methods: ˆ Fold Change Analysis ˆ T-tests ˆ Volcano Plot 2. Dimensional Reduction methods: ˆ Principal Component Analysis (PCA) ˆ Partial Least Squares - Discriminant Analysis (PLS-DA) 3. Robust Feature Selection Methods in microarray studies ˆ Significance Analysis of Microarray (SAM) ˆ Empirical Bayesian Analysis of Microarray (EBAM) 4. Clustering Analysis ˆ Hierarchical Clustering Dendrogram Heatmap ˆ Partitional Clustering K-means Clustering Self-Organizing Map (SOM) 5. Supervised Classification and Feature Selection methods ˆ Random Forest ˆ Support Vector Machine (SVM) 7
8 2.1 Principal Component Analysis (PCA) PCA is an unsupervised method aiming to find the directions that best explain the variance in a data set (X) without referring to class labels (Y). The data are summarized into much fewer variables called scores which are weighted average of the original variables. The weighting profiles are called loadings. The PCA analysis is performed using the prcomp package. The calculation is based on singular value decomposition. The Rscript chemometrics.r is required. Figure 3 is pairwise score plots providing an overview of the various seperation patterns among the most significant PCs; Figure 4 is the scree plot showing the variances explained by the selected PCs; Figure 5 shows the 2-D score plot between selected PCs; Figure 6 shows the 3-D score plot between selected PCs; Figure 7 shows the loading plot between the selected PCs; Figure 8 shows the biplot between the selected PCs PC % PC % PC % PC % PC % Figure 3: Pairwise score plots between the selected PCs. The explained variance of each PC is shown in the corresponding diagonal cell. 8
9 Variance explained % 97.8 % 1.2 % Scree plot 98.2 % 0.4 % 98.5 % 0.3 % PC index 98.7 % 0.2 % Figure 4: Scree plot shows the variance explained by PCs. The green line on top shows the accumulated variance explained; the blue line underneath shows the variance explained by individual PC. 9
10 PC 2 ( 1.2 %) P013b P041 P064 P012 P089 P113 P065 P085 P038 P060 P014 P034 P070 P056 P058 P042 P027 P092 P049 C002 C005 C010 C012 C028 C034C033 P037 C022 C017 C020 C007 C009 C015 P080 P086 P002 C006C004 C024C029 C016 C031 C011 C019 C032 C030 patient control P100b P099 C026 C PC 1 ( 96.7 %) Figure 5: Score plot between the selected PCs. The explained variances are shown in brackets. 10
11 patient control PC 3 ( 0.4 %) PC 2 ( 1.2 %) PC 1 ( 96.7 %) Figure 6: 3D score plot between the selected PCs. The explained variances are shown in brackets. 11
12 loadings 1 loadings 2 Bin.9.14 Bin.8.86 Bin.8.82 Bin.8.54 Bin.8.46 Bin.8.34 Bin.8.30 Bin.8.26 Bin.8.22 Bin.8.18 Bin.8.10 Bin.8.06 Bin.8.02 Bin.7.98 Bin.7.94 Bin.7.90 Bin.7.86 Bin.7.82 Bin.7.78 Bin.7.74 Bin.7.70 Bin.7.66 Bin.7.62 Bin.7.58 Bin.7.54 Bin.7.50 Bin.7.46 Bin.7.42 Bin.7.38 Bin.7.34 Bin.7.30 Bin.7.26 Bin.7.22 Bin.7.18 Bin.7.14 Bin.7.10 Bin.7.06 Bin.7.02 Bin.6.98 Bin.6.94 Bin.6.90 Bin.6.86 Bin.6.82 Bin.6.78 Bin.6.74 Bin.6.70 Bin.6.66 Bin.6.54 Bin.6.50 Bin.6.46 Bin.6.42 Bin.6.38 Bin.6.34 Bin.4.38 Bin.4.34 Bin.4.30 Bin.4.26 Bin.4.22 Bin.4.18 Bin.4.14 Bin.4.10 Bin.4.06 Bin.4.02 Bin.4.04 Bin.3.98 Bin.3.94 Bin.3.90 Bin.3.86 Bin.3.82 Bin.3.78 Bin.3.74 Bin.3.70 Bin.3.66 Bin.3.62 Bin.3.58 Bin.3.54 Bin.3.50 Bin.3.46 Bin.3.42 Bin.3.38 Bin.3.34 Bin.3.30 Bin.3.26 Bin.3.22 Bin.3.18 Bin.3.14 Bin.3.10 Bin.3.06 Bin.3.02 Bin.3.04 Bin.2.98 Bin.2.94 Bin.2.90 Bin.2.86 Bin.2.82 Bin.2.78 Bin.2.74 Bin.2.70 Bin.2.66 Bin.2.62 Bin.2.58 Bin.2.54 Bin.2.50 Bin.2.46 Bin.2.42 Bin.2.38 Bin.2.34 Bin.2.30 Bin.2.26 Bin.2.22 Bin.2.18 Bin.2.14 Bin.2.10 Bin.2.06 Bin.2.02 Bin.1.98 Bin.1.94 Bin.1.90 Bin.1.86 Bin.1.82 Bin.1.78 Bin.1.74 Bin.1.70 Bin.1.66 Bin.1.62 Bin.1.58 Bin.1.54 Bin.1.50 Bin.1.46 Bin.1.42 Bin.1.38 Bin.1.34 Bin.1.30 Bin.1.26 Bin.1.22 Bin.1.18 Bin.1.14 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.0.94 Bin.0.90 Bin.0.86 Bin.0.82 Bin.0.78 Bin.0.74 Bin.0.70 Bin.0.58 Figure 7: Loading plot for the selected PCs. 12
13 P013b P041 P080 PC P064 P012 P086 P100b P089 P113 P065 P002 P085 P038 P014 P060 P034 P070 P056 Bin.0.98 P027 Bin.3.78 Bin.3.74 P058 P042 P049 Bin.4.10 Bin.4.18 Bin.4.14 Bin.4.30 Bin.4.22 Bin.4.26 Bin.4.34 P092 C031 Bin.7.38 Bin.7.34 Bin.7.42 Bin.7.98 Bin.7.46 Bin.7.94 Bin.8.02 Bin.7.70 Bin.7.50 Bin.8.18 Bin.7.90 Bin.8.10 Bin.8.82 Bin.8.06 Bin.8.86 Bin.8.54 Bin.9.14 Bin.8.46 Bin.7.78 Bin.7.74 Bin.8.34 Bin.8.26 Bin.8.30 Bin.8.22 Bin.7.22 Bin.7.18 Bin.7.30 Bin.7.26 Bin.7.14 Bin.7.10 Bin.7.06 Bin.6.98 Bin.6.86 Bin.7.02 Bin.6.90 Bin.6.94 Bin.6.82 Bin.4.04 Bin.6.38 Bin.6.46 Bin.6.34 Bin.6.78 Bin.6.66 Bin.6.70 Bin.6.50 Bin.6.74 Bin.6.54 Bin.6.42 Bin.4.06 Bin.3.66 Bin.3.70 Bin.3.82 Bin.3.86 Bin.3.94 Bin.3.62 Bin.3.90 Bin.4.02 Bin.3.54 Bin.3.42 Bin.3.46 Bin.3.50 Bin.3.38 Bin.3.26 Bin.3.58 Bin.3.22 Bin.2.06 Bin.2.26 Bin.3.30 Bin.2.14 Bin.3.18 Bin.2.02 Bin.3.10 Bin.2.30 Bin.2.22 Bin.3.34 Bin.2.90 Bin.2.94 Bin.2.98 Bin.2.86 Bin.4.38 Bin.3.04 Bin.3.06 Bin.3.02 Bin.2.74 Bin.3.14 Bin.2.18 Bin.2.10 Bin.1.34 Bin.0.90 Bin.0.94 Bin.1.22 Bin.1.94 Bin.1.90 Bin.2.46 Bin.1.98 Bin.2.42 Bin.1.50 Bin.2.50 Bin.1.18 Bin.0.86 Bin.1.66 Bin.1.46 Bin.1.30 Bin.2.38 Bin.1.74 Bin.1.70 Bin.1.38 Bin.1.78 Bin.1.82 Bin.1.86 Bin.1.62 Bin.1.42 Bin.1.54 Bin.1.58 Bin.1.14 Bin.2.34 Bin.1.26 Bin.2.82 Bin.2.78 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.82 Bin.0.78 Bin.2.62 Bin.0.74 Bin.0.70 Bin.0.58 Bin.3.98 C002 C011 Bin.2.70 Bin.2.66 Bin.2.58 Bin.2.54 Bin.7.66 Bin.7.62 C006Bin.7.54 Bin.7.86 Bin.7.58 C004 Bin.7.82 C005 C010 C012 C028 C024 C029 P099 C034 C033 P037 C022 C017 C007 C009 C015 C019 C020 C032 C016 C030 C021 C PC1 Figure 8: PCA biplot between the selected PCs. Note, you may want to test different centering and scaling normalization methods for the biplot to be displayed properly. 13
14 2.2 Partial Least Squares - Discriminant Analysis (PLS- DA) PLS is a supervised method that uses multivariate regression techniques to extract via linear combination of original variables (X) the information that can predict the class membership (Y). The PLS regression is performed using the plsr function provided by R pls package 3. The classification and crossvalidation are performed using the corresponding wrapper function offered by the caret package 4. To assess the significance of class discrimination, a permutation test was performed. In each permutation, a PLS-DA model was built between the data (X) and the permuted class labels (Y) using the optimal number of components determined by cross validation for the model based on the original class assignment. The ratio of the between sum of the squares and the within sum of squares (B/W-ratio) for the class assignment prediction of each model was calculated. If the B/W ratio of the original class assignment is a part of the distribution based on the permuted class assignments The contrast between the two class assignment cannot be considered significant from a statistical point of view. There are two variable importance measures in PLS-DA. The first, Variable Importance in Projection (VIP) is a weighted sum of squares of the PLS loadings taking into account the amount of explained Y-variation in each dimension. The other importance measure is based on the weighted sum of PLS-regression The weights are a function of the reduction of the sums of squares across the number of PLS components. coefficients 5. The R script chemometrics.r is required. Figure 9 shows the overview of score plots; Figure 10 shows the 2-D score plot between selected components; Figure 11 shows the 3-D score plot between selected components; Figure 12 shows the loading plot between the selected components; Figure 13 shows the classification performance with different number of components. Figure 14 shows the important features identified by PLS-DA. Figure 15 shows the permutation test results for model validation. 3 Ron Wehrens and Bjorn-Helge Mevik.pls: Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR), 2007, R package version Max Kuhn. Contributions from Jed Wing and Steve Weston and Andre Williams.caret: Classification and Regression Training, 2008, R package version Bijlsma et al.large-scale Human Metabolomics Studies: A Strategy for Data (Pre-) Processing and Validation, Anal Chem. 2006,
15 Component 1 34 % Component % Component % Component % Component % Figure 9: Pairwise score plots between the selected components. The explained variance of each component is shown in the corresponding diagonal cell. 15
16 Component 2 ( 10.9 %) C026 C021 C030 C007 C022 C019 C017 P037 P099 C009C034C010 C032 C020 C004 C016 C015 C005 C012 C033 C028 C024 C029 C031 C006 C011 C002 P042 P092 P049 P058 P027 P056 P070 P034 P060 P038 P085 P100b P002 P014 P080 P065 P012 P041 P113 P089 P064 P086 patient control P013b Component 1 ( 34 %) Figure 10: Score plot between the selected PCs. The explained variances are shown in brackets. 16
17 patient control Component 3 ( 4.9 %) Component 2 ( 10.9 %) Component 1 ( 34 %) Figure 11: 3D score plot between the selected PCs. The explained variances are shown in brackets. 17
18 0.3 Bin.7.82 Bin.7.54 Bin.7.86 Bin.7.58 Bin.7.62 loadings Bin.7.66 Bin.3.98 Bin.7.26 Bin.7.42 Bin.7.22 Bin.7.34 Bin.7.94 Bin.7.18 Bin.7.98 Bin.3.62 Bin.2.26 Bin.3.02 Bin.7.02Bin.4.22 Bin.2.34Bin.3.66 Bin.4.18 Bin.1.94 Bin.3.70 Bin.4.26 Bin.7.70 Bin.8.54 Bin.7.50 Bin.6.90 Bin.1.90 Bin.4.30 Bin.4.34 Bin.6.86 Bin.8.46 Bin.4.06 Bin.8.18 Bin.8.86 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.30 Bin.8.34 Bin.8.26 Bin.7.78 Bin.7.14 Bin.7.90 Bin.6.42 Bin.6.54 Bin.6.74 Bin.6.50 Bin.8.22 Bin.6.70 Bin.7.74 Bin.6.66 Bin.6.78 Bin.6.46 Bin.4.02 Bin.2.10 Bin.1.54 Bin.1.86 Bin.2.30 Bin.6.34 Bin.8.06 Bin.6.82 Bin.1.98 Bin.4.04 Bin.6.98 Bin.6.38 Bin.6.94 Bin.3.82 Bin.1.82 Bin.3.90 Bin.1.78 Bin.3.54 Bin.1.66 Bin.1.70 Bin.3.26 Bin.0.58 Bin.1.62 Bin.1.42 Bin.0.70 Bin.1.38 Bin.8.02 Bin.0.74 Bin.2.02 Bin.4.14 Bin.1.58 Bin.3.86 Bin.2.98 Bin.1.46 Bin.1.74 Bin.3.50 Bin.2.74 Bin.3.78 Bin.2.06 Bin.2.86 Bin.0.78Bin.3.74 Bin.3.34 Bin.3.94 Bin.4.38 Bin.3.22 Bin.3.38 Bin.2.22 Bin.7.10 Bin.3.30 Bin.2.82 Bin.2.78 Bin.3.10 Bin.4.10 Bin.2.18 Bin.2.90 Bin.2.14 Bin.1.50 Bin.1.14 Bin.0.82 Bin.0.90 Bin.0.86 Bin.1.34 Bin.3.46 Bin.3.18 Bin.2.94 Bin.2.38 Bin.3.58 Bin.2.42 Bin.1.30Bin.1.18 Bin.2.50 Bin.2.62 Bin.1.26Bin.1.22 Bin Bin Bin Bin.3.04 Bin.3.06 Bin.3.14 Bin.2.46 Bin.7.38 Bin.7.06 Bin.7.46 Bin.1.10 Bin.1.06 Bin.1.02 Bin.0.98 Bin.3.42 Bin.0.94 Bin.2.54 Bin.2.66 loadings 1 Figure 12: Loading plot between the selected PCs. 18
19 PLS DA classification with different number of components Accuracy % 100 % 100 % 100 % 100 % Number of components Figure 13: PLS-DA classification using different number of components. The red circle indicates the best classifier. 19
20 Rank by VIP (top 15) Rank by Coef. (top 15) Bin.2.54 Bin.2.54 Bin.2.70 Bin.2.70 Bin.7.82 Bin.2.66 Bin.2.66 Bin.7.82 Bin.7.54 Bin.7.54 Bin.7.86 Bin.7.86 Spectra Bins Bin.2.58 Bin.0.98 Bin.0.94 Bin.2.58 Bin.0.94 Bin.7.58 Bin.7.58 Bin.0.98 Bin.7.66 Bin.7.66 Bin.0.86 Bin.7.62 Bin.7.62 Bin.0.86 Bin.3.98 Bin.3.98 Bin.1.02 Bin Figure 14: Important features identified by PLS-DA. The left panel shows the features ranked by VIP score. The right panel shows the features ranked based on their regression coefficients. 20
21 Class discrimination measured by B/W B/W Original Permuted Index Distribution of random class assignments Frequency Original Permuted(median) B/W Figure 15: PLS-DA model validation by permutation tests. The top panel shows B/W ratio calculated for both original and permuted PLS-DA models. The bottom panel shows the distribution of random class assignments based on the frequencies of permuted B/W ratios. The green line (top) and green area (bottom) mark the 95% confidence regions of B/W for the permuted data. 21
22 2.3 Hierarchical Clustering In (agglomerative) hierarchical cluster analysis, each sample begins as a separate cluster and the algorithm proceeds to combine them until all samples belong to one cluster. Two parameters need to be considered when performing hierarchical clustering. The first one is similarity measure - Euclidean distance, Pearson s correlation, Spearman s rank correlation. The other parameter is clustering algorithms, including average linkage (clustering uses the centroids of the observations), complete linkage (clustering uses the farthest pair of observations between the two groups), single linkage (clustering uses the closest pair of observations) and Ward s linkage (clustering to minimize the sum of squares of any two clusters). Heatmap is often presented as a visual aid in addition to the dendrogram. Hierachical clustering is performed with the hclust function in package stat. The R script clustering.r is required. Figure 16 shows the clustering result in the form of a dendrogram. Figure 17 shows the clustering result in the form of a heatmap. 22
23 Cluster with ward method Height patient control C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P080 Figure 16: Clustering result shown as dendrogram (distance measure using euclidean, and clustering algorithm using ward). 23
24 Bin.4.06 Bin.4.04 Bin.3.06 Bin.3.04 Bin.3.98 Bin.3.26 Bin.3.74 Bin.3.70 Bin.3.78 Bin.3.66 Bin.3.02 Bin.3.94 Bin.3.82 Bin.3.62 Bin.3.90 Bin.3.86 Bin.3.22 Bin.3.54 Bin.3.58 Bin.3.30 Bin.4.02 Bin.2.06 Bin.0.98 Bin.0.86 Bin.4.34 Bin.4.26 Bin.4.22 Bin.4.30 Bin.1.54 Bin.1.42 Bin.1.58 Bin.1.62 Bin.3.38 Bin.1.18 Bin.2.38 Bin.1.30 Bin.0.94 Bin.0.90 Bin.1.46 Bin.1.50 Bin.1.66 Bin.1.70 Bin.1.74 Bin.1.38 Bin.1.86 Bin.1.82 Bin.1.78 Bin.7.98 Bin.7.18 Bin.7.26 Bin.7.30 Bin.7.22 Bin.1.14 Bin.7.10 Bin.2.98 Bin.1.06 Bin.1.10 Bin.2.86 Bin.2.90 Bin.2.62 Bin.2.94 Bin.2.78 Bin.2.82 Bin.7.14 Bin.0.82 Bin.1.02 Bin.6.94 Bin.6.98 Bin.7.94 Bin.7.06 Bin.7.46 Bin.4.38 Bin.8.18 Bin.8.02 Bin.6.86 Bin.6.90 Bin.7.02 Bin.7.50 Bin.7.70 Bin.7.90 Bin.8.06 Bin.0.74 Bin.0.78 Bin.8.10 Bin.8.82 Bin.9.14 Bin.8.86 Bin.8.54 Bin.8.46 Bin.7.78 Bin.6.82 Bin.6.38 Bin.0.70 Bin.7.74 Bin.6.78 Bin.6.34 Bin.8.22 Bin.8.34 Bin.8.30 Bin.8.26 Bin.6.46 Bin.0.58 Bin.6.66 Bin.6.74 Bin.6.70 Bin.6.50 Bin.6.42 Bin.6.54 Bin.2.58 Bin.7.62 Bin.7.66 Bin.7.58 Bin.2.66 Bin.2.70 Bin.2.54 Bin.7.54 Bin.7.86 Bin.7.82 Bin.3.42 Bin.1.34 Bin.3.50 Bin.3.46 Bin.7.38 Bin.7.42 Bin.7.34 Bin.1.98 Bin.1.90 Bin.2.22 Bin.2.30 Bin.2.50 Bin.2.42 Bin.3.34 Bin.1.26 Bin.2.46 Bin.3.10 Bin.2.18 Bin.4.10 Bin.2.74 Bin.3.14 Bin.3.18 Bin.2.34 Bin.2.26 Bin.1.22 Bin.4.14 Bin.4.18 Bin.1.94 Bin.2.10 Bin.2.02 Bin.2.14 C011 C004 C031 C006 C024 C009 C029 C016 C032 C019 C021 C007 C017 C030 C034 C033 C022 C026 C020 C012 C010 C015 P037 C028 C002 C005 P092 P099 P034 P056 P060 P070 P042 P058 P027 P049 P085 P038 P086 P100b P012 P064 P089 P041 P013b P113 P014 P065 P002 P Value Color Key Figure 17: Clustering result shown as heatmap (distance measure using euclidean, and clustering algorithm using ward). 24
25 2.4 Random Forest (RF) Random Forest is a supervised learning algorithm suitable for high dimensional data analysis. It uses an ensemble of classification trees, each of which is grown by random feature selection from a bootstrap sample at each branch. Class prediction is based on the majority vote of the ensemble. RF also provides other useful information such as OOB (out-of-bag) error and variable importance measure. During tree construction, about one-third of the instances are left out of the bootstrap sample. This OOB data is then used as test sample to obtain an unbiased estimate of the classification error (OOB error). Variable importance is evaluated by measuring the increase of the OOB error when it is permuted. RF analysis is performed using the randomforest package 6. The R script classification.r is required. Table 2 shows the confusion matrix of random forest. Figure 18 shows the cumulative error rates of random forest analysis for given parameters. Figure 19 shows the important features ranked by random forest. The OOB error is 0.04 Random Forest classification Error Overall control patient trees Figure 18: Cumulative error rates by Random Forest classification. The overall error rate is shown as the black line; the red and green lines represent the error rates for each class. 6 Andy Liaw and Matthew Wiener. Classification and Regression by randomforest, 2002, R News 25
26 Top 15 Important Features Bin.2.54 Bin.2.70 Bin.0.98 Bin.0.94 Bin.2.10 Bin.1.02 Spectra Bins Bin.0.90 Bin.2.66 Bin.1.42 Bin.0.86 Bin.2.58 Bin.0.82 Bin.1.86 Bin.1.70 Bin MeanDecreaseAccuracy Figure 19: Significant features identified by Random Forest. The features are ranked by the mean decrease in classification accuracy when they are permuted. 26
27 control patient class.error control patient Table 2: Random Forest Classification Performance 27
28 3 Data Annotation Please be advised that MetaboAnalyst also supports metabolomic data annotation. For NMR, MS, or GC-MS peak list data, users can perform peak identification by searching the corresponding libraries. For compound concentration data, users can perform pathway mapping. These tasks require a lot of manual efforts and are not performed by default. The report was generated on Tue Apr 14 21:30: with R version ( ) on a i386-redhat-linux-gnu platform. Thank you for using MetaboAnalyst! For suggestions and feedback please contact Jeff Xia (jianguox@ualberta.ca). 28
Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte
Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,
More informationHigh-throughput Processing and Analysis of LC-MS Spectra
High-throughput Processing and Analysis of LC-MS Spectra By Jianguo Xia (jianguox@ualberta.ca) Last update : 02/05/2012 This tutorial shows how to process and analyze LC-MS spectra using methods provided
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationLecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017
Lecture 27: Review Reading: All chapters in ISLR. STATS 202: Data mining and analysis December 6, 2017 1 / 16 Final exam: Announcements Tuesday, December 12, 8:30-11:30 am, in the following rooms: Last
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More information10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2
161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under
More informationCOSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationMSA220 - Statistical Learning for Big Data
MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups
More informationAnalyzing Genomic Data with NOJAH
Analyzing Genomic Data with NOJAH TAB A) GENOME WIDE ANALYSIS Step 1: Select the example dataset or upload your own. Two example datasets are available. Genome-Wide TCGA-BRCA Expression datasets and CoMMpass
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationHow do microarrays work
Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationCluster Analysis. Ying Shen, SSE, Tongji University
Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group
More informationCluster Analysis for Microarray Data
Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationSupplementary text S6 Comparison studies on simulated data
Supplementary text S Comparison studies on simulated data Peter Langfelder, Rui Luo, Michael C. Oldham, and Steve Horvath Corresponding author: shorvath@mednet.ucla.edu Overview In this document we illustrate
More informationChemometrics. Description of Pirouette Algorithms. Technical Note. Abstract
19-1214 Chemometrics Technical Note Description of Pirouette Algorithms Abstract This discussion introduces the three analysis realms available in Pirouette and briefly describes each of the algorithms
More informationContents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results
Statistical Analysis of Microarray Data Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation of clustering results Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationMultivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)
Multivariate analyses in ecology Cluster (part 2) Ordination (part 1 & 2) 1 Exercise 9B - solut 2 Exercise 9B - solut 3 Exercise 9B - solut 4 Exercise 9B - solut 5 Multivariate analyses in ecology Cluster
More informationDESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES
EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset
More informationClustering Jacques van Helden
Statistical Analysis of Microarray Data Clustering Jacques van Helden Jacques.van.Helden@ulb.ac.be Contents Data sets Distance and similarity metrics K-means clustering Hierarchical clustering Evaluation
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationCluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010
Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,
More informationChapter 1. Using the Cluster Analysis. Background Information
Chapter 1 Using the Cluster Analysis Background Information Cluster analysis is the name of a multivariate technique used to identify similar characteristics in a group of observations. In cluster analysis,
More informationBioinformatics - Lecture 07
Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationPractical OmicsFusion
Practical OmicsFusion Introduction In this practical, we will analyse data, from an experiment which aim was to identify the most important metabolites that are related to potato flesh colour, from an
More informationClassification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging
1 CS 9 Final Project Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging Feiyu Chen Department of Electrical Engineering ABSTRACT Subject motion is a significant
More informationUsing the DATAMINE Program
6 Using the DATAMINE Program 304 Using the DATAMINE Program This chapter serves as a user s manual for the DATAMINE program, which demonstrates the algorithms presented in this book. Each menu selection
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and
More informationECS 234: Data Analysis: Clustering ECS 234
: Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationMULTIVARIATE ANALYSIS USING R
MULTIVARIATE ANALYSIS USING R B N Mandal I.A.S.R.I., Library Avenue, New Delhi 110 012 bnmandal @iasri.res.in 1. Introduction This article gives an exposition of how to use the R statistical software for
More information10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski
10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A
More informationPackage DiffCorr. August 29, 2016
Type Package Package DiffCorr August 29, 2016 Title Analyzing and Visualizing Differential Correlation Networks in Biological Data Version 0.4.1 Date 2015-03-31 Author, Kozo Nishida Maintainer
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationUnsupervised learning: Clustering & Dimensionality reduction. Theo Knijnenburg Jorma de Ronde
Unsupervised learning: Clustering & Dimensionality reduction Theo Knijnenburg Jorma de Ronde Source of slides Marcel Reinders TU Delft Lodewyk Wessels NKI Bioalgorithms.info Jeffrey D. Ullman Stanford
More informationVIDAEXPERT: DATA ANALYSIS Here is the Statistics button.
Here is the Statistics button. After creating dataset you can analyze it in different ways. First, you can calculate statistics. Open Statistics dialog, Common tabsheet, click Calculate. Min, Max: minimal
More informationHierarchical Clustering
What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering
More informationJMP Book Descriptions
JMP Book Descriptions The collection of JMP documentation is available in the JMP Help > Books menu. This document describes each title to help you decide which book to explore. Each book title is linked
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationNMRProcFlow Macro-command Reference Guide
NMRProcFlow Macro-command Reference Guide This document is the reference guide of the macro-commands Daniel Jacob UMR 1332 BFP, Metabolomics Facility CGFB Bordeaux, MetaboHUB - 2018 1 NMRProcFlow - Macro-command
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationFinal Report: Kaggle Soil Property Prediction Challenge
Final Report: Kaggle Soil Property Prediction Challenge Saurabh Verma (verma076@umn.edu, (612)598-1893) 1 Project Goal Low cost and rapid analysis of soil samples using infrared spectroscopy provide new
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationHigh throughput Data Analysis 2. Cluster Analysis
High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO
More informationPackage stattarget. December 23, 2018
Type Package Title Statistical Analysis of Molecular Profiles Version 1.12.0 Author Hemi Luan Maintainer Hemi Luan Depends R (>= 3.3.0) Package stattarget December 23, 2018 Imports
More informationMultivariate Methods
Multivariate Methods Cluster Analysis http://www.isrec.isb-sib.ch/~darlene/embnet/ Classification Historically, objects are classified into groups periodic table of the elements (chemistry) taxonomy (zoology,
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationINF 4300 Classification III Anne Solberg The agenda today:
INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15
More informationFeature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on supervised learning
More informationDI TRANSFORM. The regressive analyses. identify relationships
July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical,
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationPredict Outcomes and Reveal Relationships in Categorical Data
PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationFeatures: representation, normalization, selection. Chapter e-9
Features: representation, normalization, selection Chapter e-9 1 Features Distinguish between instances (e.g. an image that you need to classify), and the features you create for an instance. Features
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationCourse on Microarray Gene Expression Analysis
Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level
More informationData Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017
Milena Kraus Digital Health Summer Agenda Real-world Use Cases Oncology Nephrology Heart Insufficiency Additional Topics Data Management & Foundations Biology Recap Data Sources Data Formats Business Processes
More informationADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA
INSIGHTS@SAS: ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA AGENDA 09.00 09.15 Intro 09.15 10.30 Analytics using SAS Enterprise Guide Ellen Lokollo 10.45 12.00 Advanced Analytics using SAS
More informationComputing with large data sets
Computing with large data sets Richard Bonneau, spring 2009 Lecture 8(week 5): clustering 1 clustering Clustering: a diverse methods for discovering groupings in unlabeled data Because these methods don
More informationSalford Systems Predictive Modeler Unsupervised Learning. Salford Systems
Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term
More informationForestry Applied Multivariate Statistics. Cluster Analysis
1 Forestry 531 -- Applied Multivariate Statistics Cluster Analysis Purpose: To group similar entities together based on their attributes. Entities can be variables or observations. [illustration in Class]
More information3 Feature Selection & Feature Extraction
3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy
More informationPathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module
Pathway Analysis of Untargeted Metabolomics Data using the MS Peaks to Pathways Module By: Jasmine Chong, Jeff Xia Date: 14/02/2018 The aim of this tutorial is to demonstrate how the MS Peaks to Pathways
More informationMay 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch
May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil,
More informationGene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients
1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public
More informationClustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017
Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal: Generalize to new data Model New Data? Original Data Does the model accurately reflect new data? Supervised vs. Unsupervised
More informationData Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\
Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured
More informationAcquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.
Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting
More informationCLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi
CLUSTER ANALYSIS V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi-110 012 In multivariate situation, the primary interest of the experimenter is to examine and understand the relationship amongst the
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationClustering analysis of gene expression data
Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.
More informationCluster Analysis: Agglomerate Hierarchical Clustering
Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationWeek 7 Picturing Network. Vahe and Bethany
Week 7 Picturing Network Vahe and Bethany Freeman (2005) - Graphic Techniques for Exploring Social Network Data The two main goals of analyzing social network data are identification of cohesive groups
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationLesson 3. Prof. Enza Messina
Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical
More informationAnalyzing ICAT Data. Analyzing ICAT Data
Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex
More informationUnsupervised learning in Vision
Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual
More information