Tutorial: From chemical structures to outlier analysis. G. Marcou, D. Horvath, A. Varnek

Similar documents
Short instructions on using Weka

Tutorial on Machine Learning. Impact of dataset composition on models performance. G. Marcou, N. Weill, D. Horvath, D. Rognan, A.

Hands on Datamining & Machine Learning with Weka

Software Documentation of the Potential Support Vector Machine

Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CS 229 Midterm Review

Introduction to ConQuest

Rapid Application Development using InforSense Open Workflow and Oracle Chemistry Cartridge Technologies

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Harris: Quantitative Chemical Analysis, Eight Edition CHAPTER 05: QUALITY ASSURANCE AND CALIBRATION METHODS

KNIME Enalos+ Molecular Descriptor nodes

Analyzing ICAT Data. Analyzing ICAT Data

ChemBioFinder for Office 13.0 User Guide

Using CSD-CrossMiner to Create a Feature Database

The Explorer. chapter Getting started

STATS PAD USER MANUAL

AutoDock Virtual Screening: Raccoon & Fox Tools

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Noise-based Feature Perturbation as a Selection Method for Microarray Data

COMP s1 - Getting started with the Weka Machine Learning Toolkit

Application of Support Vector Machine In Bioinformatics

Classification using Weka (Brain, Computation, and Neural Learning)

Beilstein (Elsevier/MDL CrossFire Commander 7.1)

Machine Learning for. Artem Lind & Aleskandr Tkachenko

TIBCO Spotfire Lead Discovery 2.1 User s Manual

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Tutorial 2: Analysis of DIA/SWATH data in Skyline

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Agilent G6825AA METLIN Personal Metabolite Database for MassHunter Workstation

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

UNIBALANCE Users Manual. Marcin Macutkiewicz and Roger M. Cooke

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Skyline High Resolution Metabolomics (Draft)

Support Vector Machines

User Guide Written By Yasser EL-Manzalawy

Final Report: Kaggle Soil Property Prediction Challenge

KNIME Enalos+ Modelling nodes

Some questions of consensus building using co-association

Importing data in a database with levels

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis

Unsupervised Discretization using Tree-based Density Estimation

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Correctly Compute Complex Samples Statistics

Agilent G6854AA MassHunter Personal Compound Database

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Cover Page. The handle holds various files of this Leiden University dissertation.

08 An Introduction to Dense Continuous Robotic Mapping

SOCIAL MEDIA MINING. Data Mining Essentials

Supervised classification exercice

Instance-based Learning

Machine Learning Techniques for Data Mining

Artificial Neural Networks (Feedforward Nets)

Quick Reference Guide

Data mining with sparse grids

Customisable Curation Workflows in Argo

Machine Learning and Bioinformatics 機器學習與生物資訊學

Support Vector Machines + Classification for IR

static MM_Index snap(mm_index corect, MM_Index ligct, int imatch0, int *moleatoms, i

Tutorial Case studies

ChemFinder for Office 17.0 User Guide

Robust Regression. Robust Data Mining Techniques By Boonyakorn Jantaranuson

SGN (4 cr) Chapter 10

Chap.12 Kernel methods [Book, Chap.7]

Performance Evaluation of Various Classification Algorithms

AI32 Guide to Weka. Andrew Roberts 1st March 2005

Analyzing Genomic Data with NOJAH

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

Deep Tensor: Eliciting New Insights from Graph Data that Express Relationships between People and Things

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

3D Object Recognition using Multiclass SVM-KNN

Enterprise Data Catalog for Microsoft Azure Tutorial

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

A Comparative Study of Selected Classification Algorithms of Data Mining

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Inception Network Overview. David White CS793

OECD QSAR Toolbox v.4.1. Example illustrating endpoint vs. endpoint correlation using ToxCast data

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Chapters 5-6: Statistical Inference Methods

How to Create a Substance Answer Set

REALCOM-IMPUTE: multiple imputation using MLwin. Modified September Harvey Goldstein, Centre for Multilevel Modelling, University of Bristol

Contents. Preface to the Second Edition

Lead Discovery 5.2. User Guide. Powered by TIBCO Spotfire

What is machine learning?

Evaluating Classifiers

Introduction The problem of cancer classication has clear implications on cancer treatment. Additionally, the advent of DNA microarrays introduces a w

Bagging for One-Class Learning

Voxel selection algorithms for fmri

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Chapter 5: Outlier Detection

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

The Power and Sample Size Application

PSU Student Research Symposium 2017 Bayesian Optimization for Refining Object Proposals, with an Application to Pedestrian Detection Anthony D.

Automated Microarray Classification Based on P-SVM Gene Selection

Introduction to BEST Viewpoints

Transcription:

Tutorial: From chemical structures to outlier analysis G. Marcou, D. Horvath, A. Varnek 1

Download your materials Download URL (tinyurl to Renater FileSender repository): ahttps://tinyurl.com/y99km3nd Content (<150Mo when unziped) asoftwares ExtCV.exe, xfragmentor.exe, xmodelanalyzerr.exe adatasets Datamining 5-HT2B (a database) Iter0 (tutorial directory) Iter1 (result after the tutorial) Pubmed Relevant articles 2

Introduction The IUPHAR 5-HT2B dataset a88 compounds aaffinity values pki, pkd Mostly human atype Agonist Antagonist abibliography PubMed Ids 3

5-HT2A Expressed in forebrain cerebral cortex Famous 5-HT2A agonist LSD Implicated in Flower Power 4

5-HT2C Expressed in CNS Example 5-HT2C antagonist Methysergide Implicated into: obesity, seizure, psychotic disorder 5

5-HT2B, the anti-target Expressed into Liver Kidney Heart Example ligands: PhenFen Benfluorex Drug induced valvular heart disease 6

Exercise 1 Goal: acreate a database of 5-HT2B ligand Software: ainstantjchem File aiuphar_5ht2b.sdf Output adirectory 5-HT2B

Create a local database

Configure local database

Load an SDF into the database

SDF field management

Grid view column management

Final state of the interface A database of 5-HT2B ligands Access to chemical structures, affinity, type, identifier, bibliographic references Possibility to edit the dataset: structures, instances and values

Exercise 2 Goal: aset up an External Cross-Validation procedure Software: aextcrossvalidate axfragmentor File aiuphar_5ht2b.sdf Output adirectories CVIteriFoldj, files train.sdf and test.sdf aisida Substructural Molecular Fragments, files *.hdr, *.svm, *.arff 14

Software: ExtCrossValidate 1 Select the IUPHAR_5HT2B.sdf file 2 Choose Cross Validation 3 Set 1 iteration to 1: k=1 and 4 folds: N=4 4 Uncheck the tick-box. Store all train and test sets into a folder named CV 15

External Cross-Validation 4- Fold Cross Validation Each compound is used once in a test set. Training and test sets are structural files. Dataset Fold1 Fold2 Fold3 Fold4 CV CV CV CV - trainiter1fold1.sdf - trainiter1fold2.sdf - trainiter1fold3.sdf - trainiter1fold4.sdf - testiter1fold1.sdf - testiter1fold2.sdf - testiter1fold3.sdf - testiter1fold4.sdf Each fold produces a dedicated training and test files in a directory named CV.

Software: xfragmentor 17

ISIDA fragments Sequences of atoms and bonds containing Nmin > 2 to Nmax < 15 atoms atoms and bonds Augmented atoms (a central atom and his direct neighbours) 18

ISIDA Substructural Molecular DataSet Fragments C-C-C-N-C-C C=O N(C)(C)(C) N(CO)(CC)(CC)... C-N-C-C*C 10 1 1 1 0 8 1 1 1 0 4 1 1 1 4 Etc. ISIDA FRAGMENTOR the Pattern matrix 19

Compute training set descriptors Add SDF files (area 1) a IUPHAR_5-HT2B.sdf acv/trainiterifoldj.sdf Add Fragmentations (area 2) aatom count: E aatom centered of length 2 to 4: IIAB(2-4) ause formal charge: FC Save configuration aclick the Save XML button aname the file train_e_iiab_r(2-4)_fc.xml Click the Run button 20

Compute test set descriptors Remove all SDF files and add test SDF files (area 1) aremove IUPHAR_5-HT2B.sdf aremove CV/trainIteriFoldj.sdf aadd CV/testIteriFoldj.sdf Check the tickbox Use Predefined Fragments (area 3) aleave Basis name to default value or empty Use the Check button (optional) ayou should receive the message All hear files were found. Save configuration aclick the Save XML button aname the file test_e_iiab_r(2-4)_fc.xml Click the Run button 21

Files created by xfragmentor XML: Configuration file. HDR: List of fragments. SVM: Molecular descriptors (attributes)/ Property value (value) files in sparse format. Each molecule correspond to one line, in order of appearance in the SDF file. ARFF: Molecular descriptors (attributes)/ Property value (value) files in sparse format, compatible with the Weka software. Concatenate information from HDR and SVM 22

Summary Division of the dataset into 4 sets of training and test structural files Calculation of the ISIDA Substructural Molecular Fragments descriptors aatom Count aatom Centered fragment including from 2 to 4 atoms aatoms and bonds standard annotation aformal Charge 23

Exercise 3 Goal: asetup Gaussian Processes models using only the training sets. Software: aweka/experimenter File acv/trainiterifoldj.sdf and CV/train_IteriFoldj_*.arff Output aextcvtrain.arff aextcvtrain.exp 24

Introduction to Gaussian Processes Goal: amodel a probability distribution of the property. Y train,y test are random vectors representing properties X train, X test are the molecular descriptors matrices Y train Y test = N 0, K(X train, X train )+1σ 2 K(X train, X test ) K(X test, X train ) K(X test, X test ) Parameters of the method ak(.,.) the kernel. In this tutorial, the linear kernel is used. aσ 2 the noise level. Closely related to the experimental noise on the property. 25

Inference of Gaussian Processes Two quantities are inferred for a test set athe mean vector Y test = K(X test, X train ) K(X train, X train )+1σ 2 1 Y train athe covariance 2 σ Y test = K(X test, X test ) K(X test, X train ) K(X train, X train )+1σ 2 1 K(X train, X test ) () Data points ( ) Mean vector ( ) +/- two variances 26

Weka/Experimenter 27

Weka/Experimenter aclick New. aresults destination: ExtCVtrain.arff a4-fold cross validation repeated 4 times. aexperiment type: Regression aadd the training files to the Datasets aadd the ZeroR method to the Algorithms aadd Gaussian Processes method to the Algorithms 28

Setup Gaussian Processes Add Gaussian Processes: no data transformations Noise level in the range [1..10] 29

Finalize the experiment Click the button Save asave you setup as ExtCVtrain.exp Click the Run tab, then Start Click the Analyse tab 30

Analyze the experiment on training data Click the Experiment button Set the Comparison field to Relative Absolute Error Click the Perform test button 31

Summary The optimum noise level is deduced only from the internal cross-validation on training sets. aoptimal value about 4 Different performances on different training sets amust be statistically demonstrated The test sets are fully independent. athe setup of the model building is complete without the help of the test sets 32

Exercise 4 Goal: aexternal validation of Gaussian Processes models. Software: aweka/experimenter File acv/trainiterifoldj.sdf and CV/trainIteriFoldj_*.arff acv/testiterifoldj.sdf and CV/testIteriFoldj_*.arff Output aextcvtrain.arff aextcvtrain.exp 33

Weka/Experimenter Advanced mode

Weka/Experimenter Advanced mode Set the Destination to ExpCVtest.arff Set the Runs from 1 to 1 Check or set the Datasets

Setup the Results Generator (.*train \.sdf.*) CV test _E_IIAB_R(2-4)_FC.arff Set the value of RelationFind to (.*train \.sdf.*) Set the value of testsetprefix to test Set the value of testsetsuffix to _E_IIAB(2-4)_FC.arff Set the testsetdir to CV Click the RegressionSplitEvaluator

RegressionSplitEvaluator: GP with optimum noise level

Finalize the experiment Click the button Save asave you setup as ExtCVtrain.exp Click the Run tab, then Start Click the Analyse tab

Analyze the experiment on training data Click the Experiment button Set the Comparison field to Relative Absolute Error Click the Perform test button

Summary The optimum noise level is deduced only from the internal cross-validation on training sets. aoptimal value about 4 The configuration is used to validate the models on the test sets arelative absolute error about 80% Performances are dependent of the crossvalidation fold aare there outliers?

Goal: aoutlier identification Software: aweka/explorer axmodelanalyzerr ainstantjchem Files aarff files Output Exercise 4 alog files of Weka/Explorer anew SDF without an outlier.

Weka/Explorer Load the file IUPHAR_5HT2B_E_IIAB_R(2-4)_FC.arff

Setup a Gaussian Processes model Setup a Gaussian Processes model with an optimal noise value (for instance 4) Select the Cross-validation option and set the Folds value to 4. Click the Start button 43

Weka/Explorer Gaussian Processes Cross-Validation results Experiment summary Description of the model Cross-validation statistics

Setup a Gaussian Processes model Setup a Gaussian Processes model if needed Select the Supplied test set option and set the training file as test. This produces fitting results. Click the More options button.

Weka/Explorer classifier options Set the Output predictions to CSV. Click OK then click the button Start.

Weka/Explorer Gaussian Processes Fitting results Estimation for each instance aweka/explorer does not provide the covariance matrix. Fit performances. aas they are excellent, the outliers become visible aoutliers do not fit.

Save Wekaresult buffer Right click on the model s line Save your file as GP_Fit_all.out

xmoldelanalyzerr Open the file GP_Fit_all.out and the file IUPHAR_5HT2B.sdf. Click the OK button.

Outlier detection Example of a single step procedure All points outside the [-3σ, 3σ] range. aassumption: data follow a normal distribution. ahigh impact of outliers on the mean (μ) and standard deviation (σ) values.

Interval width and risk in outlier detection [ 3.5σ, 3.5σ ] Risk(%) u (sigma) 0.05 3.5

Interval width and risk in outlier detection [ 3σ,3σ ] Risk(%) u (sigma) 0.05 3.5 0.25 3

Interval width and risk in outlier detection [ 2.6σ, 2.6σ ] Risk(%) u (sigma) 0.05 3.5 0.25 3 1.00 2.6

Interval width and risk in outlier detection [ 1.96σ,1.96σ ] Risk(%) u (sigma) 0.05 3.5 0.25 3 1.00 2.6 5.00 1.96

Interval width and risk in outlier detection [ σ,σ ] Risk(%) u (sigma) 0.05 3.5 0.25 3 1.00 2.6 5.00 1.96 32.00 1.00

Outlier detection Example of sequential procedure WR Thompson (?) Formulation of the test in 1935 FE Grubbs ES Pearson C Chandra Sekar Domain of validity of the test, 1936 Tables of critical values, 1950 The Grubbs test. acompute decision variables: G 1 = r r 1 s G N = r r N s acompute critical value: G c = N-1 N atake a decision: t 2 N-2+t 2 G 1 > G c r 1, outlier G N > G c r N, outlier t, α/2n fractile of student distribution with N-2 degrees of freedom

Outlier, compound 33 amelatonin apubid 12750432 Outlier identification

Goal: aoutlier identification Software: aweka/explorer axmodelanalyzerr ainstantjchem Files aarff files Output Exercise 5 alog files of Weka/Explorer anew SDF without an outlier. 58

Remove Melatonin from the training set In InstantJChem software, click the Query button. Set the query field Origninal CdId to «not in list 50» - the Original CdId of Melatonin. Click the Run Query button. Click the export button. Save the exported SDF file as IUPHAR_5HT2B-33.sdf.

Fragment the curated dataset In the xfragmentor interface aif needed, click the Read XML button and read the file train_e_iiab_r(2-4)_fc.xml aremove all SDF files aadd the file IUPHA_5-HT2B- 33.sdf aclick the button Run.

Load the curated dataset into the Weka/Explorer software Load the file IUPHAR_5HT2B-33_E_IIAB_R(2-4)_FC.arff

Re-fit the Gaussian Processes model Setup a Gaussian Processes model if needed Select the Supplied test set option and set the training file as test. This produces fitting results. Click the More options button.

Re-evaluate the Gaussian Processes model Setup a Gaussian Processes model with an optimal noise value (for instance 4) Select the Cross-validation option and set the Folds value to 4. Click the Start button

Gaussian Processes model on currated dataset Fit Cross-validation Performances are (slightly) improved

Summary Single step procedure afast and simple afalse estimation of risk in outlier detection: Distribution parameters are distorted by outliers The risk estimation is valid for only one point. athe bigger the dataset is, the greater the chances to interpret a standard instance as outlier (false positive) P(X > a) = p P({x i x i > a} Sample) = Binomial(N, p) 65

Summary Sequential procedure afastidious the whole procedure repeats after each identified outlier. adistribution parameters are distorted by the outlier Use robust statistics resampling by the half-means method or by the smallest half-volume method Egon WJ, Morgan SL, Anal. Chem., 1998, 70(11), pages 2372-2379 Identify multiple outliers arank instances by outlier likeliness Higher rank to instances that are consistently not fitted in consensus modeling 66

Conclusion Outlier detection may depend on data modeling amolecular Descriptors, Machine Learning method, atherefore, a consensus model-based approach may help (pick the outliers that are common to the several distinct models) Use internal and external cross-validation to avoid overfitting. Exploit the fitted values (rather than cross-validated predictions) for outlier detection An outlier not confirmed as anomalous cannot be discarded abut it may help understand the limitations of your model ait might be a discovery! 67

Thanks 68

Thank you 69