Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

Similar documents
Real-time Data Compression for Mass Spectrometry Jose de Corral 2015 GPU Technology Conference

QuiC 1.0 (Owens) User Manual

An array is a collection of data that holds fixed number of values of same type. It is also known as a set. An array is a data type.

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Lecture 6: Input Compaction and Further Studies

Cartoon parallel architectures; CPUs and GPUs

GPU programming basics. Prof. Marco Bertini

How to declare an array in C?

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Introduction to Parallel Programming

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

To 3D or not to 3D? Why GPUs Are Critical for 3D Mass Spectrometry Imaging Eri Rubin SagivTech Ltd.

High Performance Computing and GPU Programming

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

OBJECT detection in general has many applications

Edge and local feature detection - 2. Importance of edge detection in computer vision

Lecture 7: Most Common Edge Detectors

Double-Precision Matrix Multiply on CUDA

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

A full description of the principles of the Matlab code used to produce the images is given below.

Chapter 4. Clustering Core Atoms by Location

GPU-accelerated data expansion for the Marching Cubes algorithm

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

ECE 408 / CS 483 Final Exam, Fall 2014

C E N T E R A T H O U S T O N S C H O O L of H E A L T H I N F O R M A T I O N S C I E N C E S. Image Operations II

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

MRMPROBS tutorial. Hiroshi Tsugawa RIKEN Center for Sustainable Resource Science

S7260: Microswimmers on Speed: Simulating Spheroidal Squirmers on GPUs

A Comprehensive Study on the Performance of Implicit LS-DYNA

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

Optimisation Myths and Facts as Seen in Statistical Physics

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

03 - Reconstruction. Acknowledgements: Olga Sorkine-Hornung. CSCI-GA Geometric Modeling - Spring 17 - Daniele Panozzo

Counting Particles or Cells Using IMAQ Vision

Module 3: CUDA Execution Model -I. Objective

Image Compression With Haar Discrete Wavelet Transform

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CS 677: Parallel Programming for Many-core Processors Lecture 6

Data processing. Filters and normalisation. Mélanie Pétéra W4M Core Team 31/05/2017 v 1.0.0

What s New in Empower 3

Hardware-Accelerated 3D Visualization of Mass Spectrometry Data

BLAKE2b Stage 1 Stage 2 Stage 9

In the previous presentation, Erik Sintorn presented methods for practically constructing a DAG structure from a voxel data set.

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Statistical Process Control in Proteomics SProCoP

Code Optimizations for High Performance GPU Computing

CHAPTER 9 INPAINTING USING SPARSE REPRESENTATION AND INVERSE DCT

Vector Addition on the Device: main()

Using GPUs to compute the multilevel summation of electrostatic forces

1/25/12. Administrative

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Edge detection. Convert a 2D image into a set of curves. Extracts salient features of the scene More compact than pixels

QR Decomposition on GPUs

m6aviewer Version Documentation

Data Parallel Execution Model

Computer Graphics. Attributes of Graphics Primitives. Somsak Walairacht, Computer Engineering, KMITL 1

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

QuiC 2.1 (Senna) User Manual

Spatial Enhancement Definition

GPU Sparse Graph Traversal

Tutorial 2: Analysis of DIA/SWATH data in Skyline

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

R-software multims-toolbox. (User Guide)

Tiled Matrix Multiplication

Image convolution with CUDA

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

SEM Drift Correction. Procedure Guide

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

(Refer Slide Time: 00:02:00)

Maximizing Face Detection Performance

Segmentation and Grouping

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Course Evaluations. h"p:// 4 Random Individuals will win an ATI Radeon tm HD2900XT

College Algebra Exam File - Fall Test #1

6. Object Identification L AK S H M O U. E D U

L10 Layered Depth Normal Images. Introduction Related Work Structured Point Representation Boolean Operations Conclusion

Example 1: Color-to-Grayscale Image Processing

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

4.5 VISIBLE SURFACE DETECTION METHODES

Anno accademico 2006/2007. Davide Migliore

Computer Graphics. Chapter 4 Attributes of Graphics Primitives. Somsak Walairacht, Computer Engineering, KMITL 1

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Module Memory and Data Locality

Coarse-to-fine image registration

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute

BIOINF 4399B Computational Proteomics and Metabolomics

TraceFinder Analysis Quick Reference Guide

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Transcription:

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference 2012 Waters Corporation 1

Agenda Overview of LC/IMS/MS 3D Data Processing 4D Data Processing Results and Conclusions 2012 Waters Corporation 2

Overview of LC/IMS/MS 2012 Waters Corporation 3

Overview of LC/IMS/MS What is LC/IMS/MS? Liquid Chromatography (LC) Ion Mobility Separation (IMS) Mass Spectrometry (MS) Combined analytical technique with unparalleled separation and identification Applications Biopharmaceutical Life Sciences Food safety Environmental Protection Clinical Chemical Materials 2012 Waters Corporation 4

Overview of LC/IMS/MS LC/IMS/MS instruments produce and measure ions of the analytical sample Output is a 4D plot of ion intensity vs time, mass to charge ratio (m/z), and ion mobility (drift) drift m/z time 2012 Waters Corporation 5

Overview of LC/IMS/MS How the data is acquired? LC/IMS/MS instruments generate mass scans of m/z values (~5 ms) 200 consecutive scans represent 200 mobility (drift) values at a point in time Acquire blocks of 200 scans at each time value Scan data show sharp peaks, each corresponding to an ion Data is very noisy Scans have many zero intensity values (sparse) Scans are stored in compressed form Simple compression (no zeros) Similar to Yale sparse matrix compression 2012 Waters Corporation 6

The Data Size Problem Uncompressed data size is huge Typical analysis: 2,000 time values x 200,000 m/z values x 200 drift values 80 billion data points This is often larger and will increase in the near future Even when compressed, data size is large (several GB) Can t store entire data in device memory (single GPU) May even be a problem to store entire data in main memory Data must be retrieved and processed in reasonably sized tiles But there are minimum tile dimensions due to nature of computation Tiles can t be made arbitrarily small Challenge for GPU processing 2012 Waters Corporation 7

Objective of LC/IMS/MS Data Processing Ion Detection Find the location and intensity of every ion (every peak) in the volume of output data Compute properties of each peak found Generate a peak list Given the data is noisy, it has to be filtered to find peaks precisely Run convolution filters in all three axis (time, m/z, and drift) Computing convolution filters in the entire (uncompressed) data is impractical Too many data points Need alternate method 2012 Waters Corporation 8

Ion Detection in the GPU Broken down into two major steps 3D and 4D Processing Takes advantage of sparsity and nature of the data (how mobility is acquired) 3D Processing 4D data is collapsed to 3D by summing data along the drift axis Data still is sparse Peaks are detected in 3D space drift m/z int m/z time time 2012 Waters Corporation 9

Ion Detection in the GPU 4D Processing Uses peaks detected in 3D Processing Builds a volume of 4D data around each peak Peaks are detected in 4D space inside each volume drift drift m/z m/z time time 2012 Waters Corporation 10

3D Data Visualization Data size: 3745 (time) x 172289 (mass) x 200 (drift) = 129 billion data points Compressed data file: 9 GB Data points in 3D: 645 million 2012 Waters Corporation 11

3D Data Processing 2012 Waters Corporation 12

3D Processing Steps Pre-processing Computes filter coefficients and other required parameters One time computation done in the CPU Read raw data Decompresses the data and collapses (sum) the drift axis Currently done in the CPU Filter data Runs convolution filters in the mass and time axes Peak detection Finds peak apices in the filtered data Peak properties computation Computes properties of each detected peak 2012 Waters Corporation 13

Data Flow 3D Processing Filter Eight types of data Five convolution filters of three types o Smooth, second derivative (deriv2), and combined (smooth+deriv2) Filter along both axes o mass (ms) and time (tm) 3D Processing Peak Detection and Peak Properties Use five of those eight types of data Generate the peak list mssmooth TM_DERIV2 step1 PEAK DETECTION peaks found MS_SMOOTH + output rawdata TM_COMBINED msshape PEAK PROPERTIES peak list MS_DERIV2 + tmshape msderiv2 TM_SMOOTH step2 2012 Waters Corporation 14

Scan Packs Data is grouped in Scan Packs Consecutive mass scans Covers a section of the time axis Number of scans depend on data to process (64 to 100 typical) If needed, each scan pack is broken into Mass Sectors Covers a section of the mass axis Number of sectors depend on device memory available (1 or 2 sectors typical) mass time Mass Sector 0 Mass Sector 1 0 1 2 M/2-1 M/2 M-1 0 1 2 N-1 2012 Waters Corporation 15

Scan Packs There are scan packs of the eight types of data mentioned (raw data scan pack, mssmooth scan pack, ) Processing is done one scan pack at a time One input scan pack, one output scan pack At least two consecutive scan packs are maintained in device memory for each type of data (three for some) mass Previous Scan Pack time Current Scan Pack Next Scan Pack 2012 Waters Corporation 16

Scan Packs For each output data point, convolution filters require as many input data points as the number of filter coefficients The filter boundary is defined here as the number of coefficients minus one Need additional boundary input data points after the point being filtered For filters in the mass direction Filter boundary only impose some overlap of mass sectors But scan packs can be filtered independently mass Input Current Scan Pack time Ms Filters Filtered Current Scan Pack 2012 Waters Corporation 17

Scan Packs This applies to the two mass axis filters Only the current scan pack must be in memory mssmooth TM_DERIV2 step1 MS_SMOOTH + output rawdata TM_COMBINED msshape rawdata mssmooth msderiv2 MS_DERIV2 + tmshape Current Scan Pack Current Scan Pack msderiv2 TM_SMOOTH step2 2012 Waters Corporation 18

Scan Packs For filters in the time direction Filter boundary requires part of the next scan pack to filter the current scan pack Scan packs cannot be filtered independently mass Input Current Scan Pack time Time Filters Filtered Current Scan Pack Input Next Scan Pack Input data in Next Scan Pack 2012 Waters Corporation 19

Scan Packs This applies to the three time axis filters Both the current and the next scan packs must be in memory mssmooth TM_DERIV2 step1 MS_SMOOTH + output rawdata TM_COMBINED msshape mssmooth rawdata msderiv2 step1 msshape step2 MS_DERIV2 + tmshape Current Scan Pack Current Scan Pack msderiv2 TM_SMOOTH step2 Next Scan Pack Input data in Next Scan Pack 2012 Waters Corporation 20

Scan Packs Peak detection uses three consecutive scans of the output scan pack We need data in the previous and next scan packs to find peaks at the edges At least one scan in these scan packs of output data must be in memory output Current Scan Pack time mass Last scan in Previous Scan Pack First scan in Next Scan Pack 2012 Waters Corporation 21

Scan Packs Similarly, peak properties computation uses the tmshape scan pack to extract a profile or shape of each peak along the time axis We need data in the previous and next scan packs to extract the profile of peaks found near the edges The previous, current, and next scan packs of tmshape data must be in memory mass tmshape Current Scan Pack time 2012 Waters Corporation 22

Scan Packs The length of this profile determines the minimum scan pack size Guarantees we don t need more than three tmshape scan packs mass tmshape Previous Scan Pack tmshape Current Scan Pack time tmshape Next Scan Pack Time shape profile length: 21 Minimum scan pack size: 10 2012 Waters Corporation 23

Scan Packs Resultant device memory needed for the eight types of data All scan packs have to be aligned Peak properties computation need matching points for each data type All scan packs are delayed to keep the next scan pack in memory tmshape step1 step2 msshape output Previous Scan Pack rawdata mssmooth msderiv2 Current Scan Pack Next Scan Pack 2012 Waters Corporation 24

Data Filter Two-dimensional thread blocks are used Mass along the x dimension Time along the y dimension Currently each thread filters one data point Prefer to exactly fit a number of thread blocks in the scan pack No wasted threads The scan pack size is set to meet this requirement (our choice) mass (X) TB 00 TB 10 TB 20 TB 30 time (scans) (Y) TB 01 TB 11 TB 21 TB 31 TB 02 TB 12 TB 22 TB 32 2012 Waters Corporation 25

Data Filter Each thread uses almost the same input data points as adjacent threads in the direction of the filter Nature of convolution filters Dimension the thread blocks to maximize data reuse blockdim.x > blockdim.y for mass filters (along mass axis) blockdim.y > blockdim.x for time filters (along time axis) mass (X) time (scans) (Y) Mass filters Time filters 2012 Waters Corporation 26

Data Filter Thread blocks dimensions are chosen Dependent on the GPU used To favor data reuse as described But we still prefer to exactly fit a number of blocks in the scan pack Two types of blocks with different form factor Compute a virtual block with dimensions the least common multiple of both blocks Use the virtual block only to o o Set the scan pack size to be a multiple of the virtual block y dimension Pad the scans with zeros to a multiple of the virtual block x dimension Mass filter block (32 x8) Virtual block (32 x32) Time filter block (16 x 32) 2012 Waters Corporation 27

Mass Axis Filter The mass axis convolution filters (ms filters) Use a different set of filter coefficients for each point in the mass axis The number (odd) of coefficients increases with mass All coefficient sets organized in a matrix in device memory o Coefficients are symmetrical (only half the coefficients stored) Two matrices needed, one for each filter type (smooth and deriv2) Mass Num Coefficients matrix point coeffs 0 C0 0 C0 1 0 0 3 1 C1 0 C1 1 0 0 3 2 C2 0 C2 1 0 0 3 3 C3 0 C3 1 0 0 3 4 C4 0 C4 1 C4 2 0 5 5 C5 0 C5 1 C5 2 0 5 6 C6 0 C6 1 C6 2 0 5 7 C7 0 C7 1 C7 2 0 5 8 C8 0 C8 1 C8 2 C8 3 7 9 C9 0 C9 1 C9 2 C9 3 7 10 C10 0 C10 1 C10 2 C10 3 7 11 C11 0 C11 1 C11 2 C11 3 7 2012 Waters Corporation 28

Mass Axis Filter One kernel computes both mass axis filters in one invocation Raw data is copied to shared memory Shared memory input is used for both filters mssmooth TM_DERIV2 step1 Both coefficient matrices are bound to linear textures MS_SMOOTH + output Too big for 2D textures rawdata TM_COMBINED msshape MS_DERIV2 + tmshape msderiv2 TM_SMOOTH step2 2012 Waters Corporation 29

Time Axis Filter The time axis convolution filters Use the same set of filter coefficients for all points in the time axis Three coefficient sets needed, one for each type of filter One kernel computes two of the three time axis filters mssmooth TM_DERIV2 step1 Also computes the two averaged outputs to avoid additional reads MS_SMOOTH + output rawdata TM_COMBINED msshape MS_DERIV2 + tmshape msderiv2 TM_SMOOTH step2 2012 Waters Corporation 30

Time Axis Filter A second kernel computes the third time axis filter Each filter input data is copied to shared memory All filter coefficients are copied to constant memory mssmooth TM_DERIV2 step1 MS_SMOOTH + output rawdata TM_COMBINED msshape MS_DERIV2 + tmshape msderiv2 TM_SMOOTH step2 2012 Waters Corporation 31

Peak Detection Finds peak apices in the output scan pack output PEAK DETECTION peaks found step1 step2 PEAK PROPERTIES peak list tmshape msshape Apex found at a point if it is a local maxima Intensity is above a threshold Intensity is greater than its eight neighbor points Each detected peak identified by its coordinates within the scan pack 2012 Waters Corporation 32

Peak Detection Peak detection kernel Use square thread blocks to maximize data reuse output scan packs read from textures Each thread tests one point Peak detection array An array of detected peak coordinates A counter is incremented on each detected peak Using an atomic operation Determines the location of the peak in the peak detection array 2012 Waters Corporation 33

Peak Properties Computation Compute a set of properties of each detected peak Use the peak detection array and all five scan pack outputs from the filter step output PEAK DETECTION peaks found step1 step2 PEAK PROPERTIES peak list tmshape msshape 2012 Waters Corporation 34

Peak Properties Computation Peak properties are saved in device memory arrays One array per peak property type All the same size (one element per detected peak) Contain peak properties from detected peaks in all scan packs Peak Index Peak Detection Coord Peak Property Type 1 Peak Property Type 2 Peak Property Type 3 0 1 2 3 P-2 P-1 2012 Waters Corporation 35

Peak Properties Computation Two groups of peak properties A first group computed from output, step1, and step2 scan packs Intensity of the apex Exact time and mass of the apex Peak exclusion rules output PEAK DETECTION peaks found step1 step2 PEAK PROPERTIES peak list tmshape msshape 2012 Waters Corporation 36

Peak Properties Computation One kernel computes the first group of peak properties Use one-dimensional thread blocks and grid Input scan packs bound to linear textures Each thread computes properties for one peak The coordinates from the peak detection array Used to read from the scan packs at the right location The index into the peak detection array Used to address the peak property arrays 2012 Waters Corporation 37

Peak Properties Computation A second group of peak properties computed from tmshape and msshape scan packs Exact time and mass of apex (from profile) Inflection points of peak Start and stop points of peak FWHM (full width at half maximum) of peak Area of peak Signal to noise ratio of peak Additional peak exclusion rules output step1 step2 tmshape PEAK DETECTION peaks found PEAK PROPERTIES peak list These peak properties are computed twice One set of properties in the time axis (using tmshape ) A second set of properties in the mass axis (using msshape ) msshape Computation of these peak properties is complex and will not be discussed here Subject for a talk by itself Several kernels involved 2012 Waters Corporation 38

4D Data Processing 2012 Waters Corporation 39

Parent Peaks Peaks detected in 3D Processing Determine where we search for peaks in 4D Processing (child peaks) For each parent peak, specify a search volume with dimensions In mass and time set by the peak s start/stop points found in 3D processing In drift is the entire drift range drift Child peaks mass Parent peak time 2012 Waters Corporation 40

Parent Peaks Parent peak volumes are filled with raw data, then filtered in each axis Filter boundaries require to fill a larger volume (data volume) Volume dimension in each axis reduced after the filter (output volume) Data volume Output volume Time filter Mass filter Drift filter 2012 Waters Corporation 41

Parent Peaks A third type of volume (buffer volume) larger than any data volume The same dimensions for all parent peaks Common memory allocation size Helps process volumes in parallel Each parent peak Receives an allocation of buffer volume dimensions Fills with raw data only its data volume dimensions Filters the volume resulting in its output volume dimensions 2012 Waters Corporation 42

Parent Peaks Pack a group of buffer volumes in a device memory buffer Help process a group of parent peaks in parallel Three of these buffers required for processing Large device memory allocation (keep the group small) Buffer volumes data organized in rows in the memory buffer One row per parent peak Kernels must convert between 3D and linear coordinates Parent Peak Index (within the group) 0 1 2 3 Buffer volume data P-2 P-1 0 1 2 B-1 2012 Waters Corporation 43

Data Organization Processing done in scan packs of compressed 4D data Process all parent peaks with volumes falling inside the scan pack Largest volume dimension in time determines scan pack size Large device memory allocation (two scan packs required) Parent peaks sorted by stop time Guarantees that all volumes with a stop time falling inside the current scan pack, can be processed with the current and the previous scan packs mass time Previous Scan Pack Current Scan Pack Will be processed in next scan pack 2012 Waters Corporation 44

Thread Blocks Three-dimensional thread blocks and grid used to filter volumes x (mass), y (drift), z (time) Dimensions set to favor each type of computation Each thread filters multiple data points Example of a volume using 27 thread blocks y (drift) 0,2,0 1.2.0 2,2,0 0,2,1 1,2,1 2,2,1 0,2,2 1,2,2 2,2,2 0,2,2 1,2,2 2,2,2 0,1,2 1,1,2 2,1,2 0,0,2 1,0,2 2,0,2 2,2,2 2,1,2 2,0,2 2,2,1 2,1,1 2,0,1 2,2,0 2,1,0 2,0,0 x (mass) z (time) 2012 Waters Corporation 45

Thread Blocks However, this three-dimensional grid is virtual (design is prior to when three-dimensional grids were possible) The actual grid is two-dimensional and covers all volumes in a group of parent peaks blockidx.y = Parent peak index in the group blockidx.x = The 3D blocks of each volume distributed linearly Kernels must convert from linear to 3D block indices Volume 0 Two-dimensional grid Volume 1 2012 Waters Corporation 46

Fill Volumes Parent peak volumes are filled with raw data From compressed scan packs All parent peak volumes in a group (a volume buffer) Decompression done in the GPU (only volumes, not entire scans) Raw data comes compressed without zero intensity points All non-zero intensities from 200 mobility scans are packed in a block A companion block of mass indices specify the location of the non-zero values Each pair of these blocks make a regular mass scan (a point in time) Block of mass indices Block of intensities scan 0 scan 1 scan 2 scan N-1 2012 Waters Corporation 47

Fill Volumes All blocks from scans needed to build a scan pack are organized differently in device memory All intensities blocks consecutive in an array All mass indices blocks consecutive in other array Easier to address mobility scans inside the kernel Array of mass indices blocks Array of intensities blocks scan 0 scan 1 scan 2 scan S-1 2012 Waters Corporation 48

Fill Volumes The kernel uses two-dimensional thread blocks and a one-dimensional grid x (time), y (drift) One block per parent peak Each thread handles multiple points in the volume For each volume Find the volume s start mass index in each mobility scan intersecting the volume Use a binary search in the array of mass indices blocks Then, for each intersecting scan Copy all non-zero intensities to the volume, at the positions determined by the mass indices Stop the copy when the volume s stop mass index is encountered Volume intensity set to zero in all mass indices not found 2012 Waters Corporation 49

Filter Volumes All parent peak volumes in a volume buffer are filtered Prepare the volumes for peak detection Requires additional volume buffers for temporary results (reused) A complete filter sequence includes Eight convolution filters in the three axes (time, mass, drift) (tm, ms, df) Two types of filters (smooth and second derivative) Three volume buffers (B1, B2, B3) Raw Data B1 TM_DERIV2 B3 MS_SMOOTH B1 DF_SMOOTH B3 TM_SMOOTH + B2 MS_DERIV2 B1 DF_SMOOTH B3 MS_SMOOTH B1 DF_DERIV2 + B3 2012 Waters Corporation 50

Filter Volumes Filter sequence Three filter sub-sequences (three rows) that are run top to bottom Each sub-sequence has a filter in each axis (from raw data to output) The order of the filters is chrom mass drift in each sub-sequence Each sub-sequence has only one second derivative filter Final content of buffer B3 is the average of the three sub-sequences o B3 is written in the first sub-sequence, and accumulated in the other two The minimum number of buffers required is three Raw Data B1 TM_DERIV2 B3 MS_SMOOTH B1 DF_SMOOTH B3 TM_SMOOTH + B2 MS_DERIV2 B1 DF_SMOOTH B3 MS_SMOOTH B1 DF_DERIV2 + B3 2012 Waters Corporation 51

Filter Volumes Time filter kernel Uses three-dimensional thread blocks and grid as described earlier Launched only once per group of parent peaks Computes both time filters from the raw data volume buffer o o One input two outputs Frees buffer B1 for reuse in next filter steps Filter coefficients are the same as in 3D processing (constant memory) Raw Data B1 TM_DERIV2 B3 MS_SMOOTH B1 DF_SMOOTH B3 TM_SMOOTH + B2 MS_DERIV2 B1 DF_SMOOTH B3 MS_SMOOTH B1 DF_DERIV2 + B3 2012 Waters Corporation 52

Filter Volumes Mass filter kernel Uses three-dimensional thread blocks and grid as described earlier Launched three times per group of parent peaks Computes the smooth or the second derivative filter depending on the coefficients passed (one input one output) Filter coefficients are the same as in 3D processing (a matrix in device memory) Raw Data B1 TM_DERIV2 B3 MS_SMOOTH B1 DF_SMOOTH B3 TM_SMOOTH + B2 MS_DERIV2 B1 DF_SMOOTH B3 MS_SMOOTH B1 DF_DERIV2 + B3 2012 Waters Corporation 53

Filter Volumes Drift filter kernel Uses three-dimensional thread blocks and grid as described earlier Launched three times per group of parent peaks Computes the smooth or the second derivative filter depending on the coefficients passed (one input one output) Filter coefficients are different for each drift value (a matrix in device memory) The first launch writes to buffer B3 and the other two accumulate on B3 Raw Data B1 TM_DERIV2 B3 MS_SMOOTH B1 DF_SMOOTH B3 TM_SMOOTH + B2 MS_DERIV2 B1 DF_SMOOTH B3 MS_SMOOTH B1 DF_DERIV2 + B3 2012 Waters Corporation 54

Peak Detection Finds child peaks in filtered parent peak volumes All volumes in buffer B3 Apex found at a point if it is a local maxima Intensity is above a threshold Intensity is greater than its 26 neighbor points Each detected peak identified by its global coordinates within the data Peak detection array A list of detected peaks coordinates Other arrays created Intensity value of lower neighbor in each axis Intensity value of higher neighbor in each axis Distance from parent to child 2012 Waters Corporation 55

Peak Detection Peak detection kernel Uses three-dimensional thread blocks and grid as described earlier Launched only once per group of parent peaks A counter is incremented on each detected peak Using an atomic operation Determines the location of the peak in the peak detection array 2012 Waters Corporation 56

Duplicate Removal Remove child peaks detected in more than one volume Parent peak volumes overlap Duplicate child peaks have identical coordinates Keep the one closer to its parent peak Thrust used for this computation (sort and unique) Child peaks Duplicate child peak Parent peaks with overlapping volumes 2012 Waters Corporation 57

Peak Properties Computation Similar to peak properties computation in 3D Processing A set of properties of each detected child peak in each scan pack Intensity and exact time, mass, and drift of the apex Peak properties are saved in device memory arrays One array per peak property type Arrays may be saved in host memory if number of child peaks is excessive Millions of child peaks are common One kernel computes the peak properties Use one-dimensional thread blocks and grid Each thread computes properties for one peak Use the peak detection array and the additional arrays saved during peak detection 2012 Waters Corporation 58

Results and Conclusions 2012 Waters Corporation 59

Results Application support is limited to Tesla C2050, C2070 and C2075 Requirement for large device memory and TCC mode Speedup varies for each processing step Use double precision only when needed Filters in 4D processing dominate computation time Overall speedup limited by data I/O Speedup also vary for different types of LC/IMS/MS data and data size Comparison results with Xeon X5650 2.67GHz CPU (single threaded) and ECC mode off in GPU Data filters Peak detection/properties Overall Average Speedup 40x 15x 8x 2012 Waters Corporation 60

Conclusions The GPU is enabling in LC/IMS/MS Proteomics applications Unacceptable long processing times in the CPU for many users Also enables real-time processing during data acquisition Embed the GPU inside or near the instrument The GPU will be more important in the future as LC/IMS/MS data size increases Future work Port to the GPU some processing steps still done in the CPU Performance optimizations Add support for features required by new Mass Spectrometry instruments Acknowledgements Marc Gorenstein, Dan Golick 2012 Waters Corporation 61

Questions? jose.de.corral@waters.com 2012 Waters Corporation 62