Many Patterns & Many Methods

Similar documents
NEW POSSIBILITIES FOR X-RAY DIFFRACTOMETRY. Bernd Hasse Incoatec, Geesthacht, Germany

COMPARISON BETWEEN CONVENTIONAL AND TWO-DIMENSIONAL XRD

Polygonal Graphic Representations Applied to Semi-Quantitative Analysis for 3 and 4 Elements in XRF. Marcia Garcia and Rodolfo Figueroa

SIeve+ will identify patterns of various X-ray powder diffraction (XRPD) data files:

DI TRANSFORM. The regressive analyses. identify relationships

How to Analyze Materials

ANOMALOUS SCATTERING FROM SINGLE CRYSTAL SUBSTRATE

Parametric. Practices. Patrick Cunningham. CAE Associates Inc. and ANSYS Inc. Proprietary 2012 CAE Associates Inc. and ANSYS Inc. All rights reserved.

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Raman Spectra of Chondrocytes in Cartilage: hyperspec s chondro data set

Clustering and Visualisation of Data

LECTURE 16. Dr. Teresa D. Golden University of North Texas Department of Chemistry

Dimension reduction : PCA and Clustering

10701 Machine Learning. Clustering

Solving Systems of Equations Using Matrices With the TI-83 or TI-84

Principal Component Analysis

AUTOMATED RIETVELD-ANALYSIS OF LARGE NUMBERS OF DATASETS

Network Traffic Measurements and Analysis

Matrices 4: use of MATLAB

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Unsupervised learning in Vision

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Distances, Clustering! Rafael Irizarry!

An introduction to plotting data

A Simple 2-D Microfluorescence

Vectors of movement, a new approach to cluster multidimensional big data on mobility

Year 8 End of Year Exams Revision List

Predict Outcomes and Reveal Relationships in Categorical Data

Statistical Pattern Recognition

Using OPUS to Process Evolved Gas Data (8/12/15 edits highlighted)

Hyperspectral Chemical Imaging: principles and Chemometrics.

Spectral Classification

The Power and Sample Size Application

CS 229: Machine Learning Final Report Identifying Driving Behavior from Data

Release by the International Centre for Diffraction Data (ICDD) of New Powder Data Mining Tools: the PDF-4/Full File and PDF-4/Organics Databases

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Fundamentals of Rietveld Refinement III. Refinement of a Mixture

2. Use elementary row operations to rewrite the augmented matrix in a simpler form (i.e., one whose solutions are easy to find).

Technical Computing with MATLAB

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

8 th Grade Mathematics Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the

Progenesis QI for proteomics User Guide. Analysis workflow guidelines for DDA data

Learning a Manifold as an Atlas Supplementary Material

Metabolomic Data Analysis with MetaboAnalyst

Progenesis LC-MS Tutorial Including Data File Import, Alignment, Filtering, Progenesis Stats and Protein ID

Importing data in a database with levels

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Visual Analytics Tools for the Global Change Assessment Model. Michael Steptoe, Ross Maciejewski, & Robert Link Arizona State University

Data Analysis Guidelines

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Scaling Techniques in Political Science

Graphical Analysis of Data using Microsoft Excel [2016 Version]

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Week 7 Picturing Network. Vahe and Bethany

Statistical Pattern Recognition

Final Exam Assigned: 11/21/02 Due: 12/05/02 at 2:30pm

Understanding Clustering Supervising the unsupervised

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Data Preprocessing. Javier Béjar AMLT /2017 CS - MAI. (CS - MAI) Data Preprocessing AMLT / / 71 BY: $\

Tools for Monitoring and Controlling Uniformity of Solid Dosage Forms

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

Data Clustering. Danushka Bollegala

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Lecture 16: High-dimensional regression, non-linear regression

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Assessing the homo- or heterogeneity of noisy experimental data. Kay Diederichs Konstanz, 01/06/2017

Statistical Pattern Recognition

Key Stage 3 Curriculum

Recognition, SVD, and PCA

HW4 VINH NGUYEN. Q1 (6 points). Chapter 8 Exercise 20

Machine Learning : supervised versus unsupervised

Raman Spectra of Chondrocytes in Cartilage: hyperspec s chondro data set

Customizable information fields (or entries) linked to each database level may be replicated and summarized to upstream and downstream levels.

Stage 7 Checklists Have you reached this Standard?

The latest trend of hybrid instrumentation

Clustering CS 550: Machine Learning

Data preprocessing Functional Programming and Intelligent Algorithms

Package SC3. September 29, 2018

Using Excel for Graphical Analysis of Data

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Step-by-Step Guide to Relatedness and Association Mapping Contents

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

ADAPTIVE GRAPH CUTS WITH TISSUE PRIORS FOR BRAIN MRI SEGMENTATION

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

SAS/STAT 13.1 User s Guide. The Power and Sample Size Application

Visual Analytics. Visualizing multivariate data:

WITec Suite FIVE. Project FIVE Control FIVE Project FIVE+

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

Mathematics Expectations Page 1 Grade 06

ECLT 5810 Clustering

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

MATH (CRN 13695) Lab 1: Basics for Linear Algebra and Matlab

Version 2.4 of Idiogrid

Step Change in Design: Exploring Sixty Stent Design Variations Overnight

Transcription:

Many Patterns & Many Methods New methods for visualising & utilising multiple analysis techniques in polymorph and salt screening systems 2009 Gordon Barr, Chris Gilmore & Gordon Cunningham WestCHEM, Chemistry Department, University of Glasgow www.chem.gla.ac.uk/snap

This document was presented at PPXRD - Pharmaceutical Powder X-ray Diffraction Symposium Sponsored by The International Centre for Diffraction Data This presentation is provided by the International Centre for Diffraction Data in cooperation with the authors and presenters of the PPXRD symposia for the express purpose of educating the scientific community. All copyrights for the presentation are retained by the original authors. The ICDD has received permission from the authors to post this material on our website and make the material available for viewing. Usage is restricted for the purposes of education and scientific research. PPXRD Website www.icdd.com/ppxrd ICDD Website - www.icdd.com

The Problem High throughput screening experiments can generate hundreds of PXRD patterns a day Problems with: Data quality. Sample quality. Data quantity. Need for automation, and speed. How do you deal with hundreds of samples from a single technique (e.g. XRPD), let alone more than one at once?

How to cluster powder patterns? Compare pairs of patterns using full-profile parametric and non-parametric statistics Match every data point forget about the peaks! Use correlation coefficients: Pearson correlation coefficient (parametric). Spearman correlation coefficient (non-parametric). Correlation coefficient +1.0 Correlation coefficient -1.0

How to cluster powder patterns? Match two patterns: -> Get a correlation coefficient Match n patterns: Pattern A matches Pattern B with a correlation of: 0.314 -> Get a correlation between every pair of patterns -> can build a n x n correlation matrix

Correlations and Distances Have a correlation matrix Convert correlations to distances: Correlation = 1.0 distance = 0.0 Correlation = -1.0 distance = 1.0 Correlation = 0.0 distance = 0.5 Take the distance matrix and perform: Cluster analysis, Principal components analysis, Metric multidimensional scaling, Fuzzy clustering, Minimum spanning trees etc. To find interesting patterns and to visualize the data.

Methodology n XRPD Patterns Optional Preprocessing Full profile matching all patterns against all patterns nxn Correlation Matrix nxn Distance Matrix PCA Principal Components Analysis MMDS Metric Multi- Dimensional Scaling Clustering via Dendrograms Estimate number of clusters Identify possible mixtures Identify Most Representative Patterns for each cluster Cluster visualisation tools Colour-coded Cell Display

Example: Doxazosin Also indexed as: Cardura XL, Cardura Doxazosin is a member of the alpha blocker family of drugs used to lower blood pressure in people with hypertension. Doxazosin is also used to treat symptoms of benign prostatic hyperplasia (BPH). Study performed using 21 patterns of 5 polymorphic forms of Doxazosin Cut Level

Metric multidimensional scaling (MMDS)

Example: Carbamazepine <- No background subtraction <- Normal background subtraction Strict background subtraction ->

A1 H1 A2 H2 A3 H3 A4 H4 A5 H5 A6 H6 A7 H7 Alpha Beta 30 % alpha, 70 % beta Gamma-Beta mixture DMSO Dioxane 1-4 Form II Dioxane Data collection by Leo Woning, Bruker-AXS

2000 Pattern Dendrogram

Raman data works too! I'd cast my eye over the spectra and have done a spectral comparison of the data by eye. I INDEPENDENTLY came up with five different spectral groups. So bottom line is PolySNAP using background subtraction routines gave EXACTLY the same result as me doing a spectral comparison by eye..thought you all should know that IMHO this is a significant step forward. Don Clark, Pfizer Global R&D

Raman Data Differences Different background types Much smaller differences between patterns Cosmic spike problems XRPD Form A Form B Raman

Raman Example 3 form pharma

Different Data Types Doesn t have to be PXRD or Raman data: IR DSC Other Profile Data Numeric Data XRF

Multiple datasets Combined XRPD + Raman instruments now available Applying multiple techniques to the same samples helps give additional information to work with How would we actually combine results from two (or more) such different techniques?

Methodology n Full profile matching nxn nxn XRD results XRPD Patterns all patterns against all patterns Correlation Matrix Distance Matrix Combined results Combine nxn Distance Matrix n Full profile matching nxn nxn Raman Patterns all patterns against all patterns Correlation Matrix Distance Matrix Raman results

Combining Datasets Manual weighting: Give a single weight to each dataset as a whole Combine datasets on that basis e.g. Powder 0.8, Raman 0.2 Dynamic weighting: Automatically calculate optimal weighting for each entry in each dataset

Dynamic Weighting Each data set has a 2-D distance matrix d D k is square (nxn) distance matrix for dataset k e.g. we have Raman and XRPD data on 20 samples, so k = 2, n=20. We want a Group Average Matrix G to optimally describe our data Specify diagonal weight matrices W k which can vary over the k datasets

Dynamic Weighting We want to minimise (1) Where (a double-centering operation on D), and Solve (1) to get best values for G and W

Example Bryne, D.V., O'Sullivan, Dijksterhuis, G.B., Bredie, W.L.P. & Martens, M. (2001) Food Quality Pref. 12, 171-187. Attempt to develop a sensory vocabulary to describe the flavours of warmed up meat patties (!) The data were assessed by 8 assessors ( = no. of data types) judging 6 products ( = no. of data sets for each type) using 20 attributes ( = no. of points in each data set). So you have 8 D matrices. The group average G in 2 dimensions gives:

Example A, B...F are the six patties. It can be seen that D and E are judged to be very similar but there is a big difference in the others.

Example: Combining Four Techniques Dataset of Sulphathiazol, Carbamazepine + Mixtures 16 samples each had data from: 1. PXRD (collected on a Bruker C2 GADDS) 2. DSC (collected on a TA instruments Q100) 3. IR (collected on a JASCO FT/IR 4100) 4. Raman (collected on a Renishaw invia Reflex) Combinations: PXRD+Raman PXRD+Raman+DSC PXRD+Raman+DSC+IR. etc. [up to 15 sets of results!]

Side by side: Dendrograms

Side by side: 3D MMDS

Side by side: 3D MMDS

Combined Data: All Four

Live Demo Multiple Datasets

Combined Conclusions Full Profile Matching + Cluster analysis methods do very well in distinguishing forms automatically using either Raman or PXRD data individually Combined results using Dynamic Weighting seem to do better than either PXRD or Raman individually Use of combined data helps highlight any inconsistencies in separate analyses Such inconsistencies would not be obvious with only one data source Outliers can then be examined manually in detail Seeing similar clustering from multiple original data sources increases confidence in the overall results

Pre-screening large datasets Full analysis as shown limited to up to 2,000 patterns per data set. What if you ve got more? Is this new sample something seen before, or new? Pre-screening allows a single sample pattern to be compared to large in-house database of existing patterns. Compare e.g. >66,000 samples to new unknown in ~20 mins Return the best 50 matches, then visualise using dendrograms, 3D Plots etc as before

Salt Screening Mode Salt Screening: not interested in samples consisting of One of our starting materials Mixture of multiple starting materials Given a library of starting materials to compare the new samples to: Just highlight what s new and interesting

How do I do this? PolySNAP Matlab or other stats packages dsnap -Cluster & visualise 3D fragment geometry similarities from the Cambridge Structural Database H H O H 3 C 2 C 4 OH 6 R 1 1 R 2 5

Acknowledgements Many thanks to. Arnt Kern & Karsten Knorr, Bruker AXS Chris Frampton & Susie Buttar, Pharmorphix For more information, please contact us: Email: Web: snap@chem.gla.ac.uk www.chem.gla.ac.uk/snap