SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Similar documents
BIOINFORMATICS ORIGINAL PAPER

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU Cluster Computing. Advanced Computing Center for Research and Education

General Purpose GPU Computing in Partial Wave Analysis

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

The rcuda middleware and applications

Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Comparison of High-Speed Ray Casting on GPU

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

INTRODUCING NVBIO: HIGH PERFORMANCE PRIMITIVES FOR COMPUTATIONAL GENOMICS. Jonathan Cohen, NVIDIA Nuno Subtil, NVIDIA Jacopo Pantaleoni, NVIDIA

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to GPU hardware and to CUDA

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

arxiv: v1 [physics.comp-ph] 4 Nov 2013

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

JCudaMP: OpenMP/Java on CUDA

Performance analysis of parallel de novo genome assembly in shared memory system

Genome 373: Mapping Short Sequence Reads I. Doug Fowler

Tesla Architecture, CUDA and Optimization Strategies

Introduction to Multicore Programming

Mathematical computations with GPUs

Facial Recognition Using Neural Networks over GPGPU

high performance medical reconstruction using stream programming paradigms

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

GPGPU. Peter Laurens 1st-year PhD Student, NSC

A GPU Algorithm for Comparing Nucleotide Histograms

Performance potential for simulating spin models on GPU

Tesla GPU Computing A Revolution in High Performance Computing

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Introduction to CUDA

HPC Enabling R&D at Philip Morris International

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Bioinformatics Services for HT Sequencing

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Accelerating image registration on GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

When MPPDB Meets GPU:


GPU Accelerated API for Alignment of Genomics Sequencing Data

Hybrid Implementation of 3D Kirchhoff Migration

QR Decomposition on GPUs

CUDA Programming Model

Dense matching GPU implementation

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Using Virtual Texturing to Handle Massive Texture Data

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

UNIVERSITY OF OSLO. Department of informatics. Parallel alignment of short sequence reads on graphics processors. Master thesis. Bjørnar Andreas Ruud

Fundamental CUDA Optimization. NVIDIA Corporation

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

The Dell Precision T3620 tower as a Smart Client leveraging GPU hardware acceleration

Fast Snippet Generation. Hybrid System

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Using GPUs to compute the multilevel summation of electrostatic forces

The Fusion Distributed File System

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Accelerated Machine Learning Algorithms in Python

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Numerical Simulation on the GPU

Conquering Massive Clinical Models with GPU. GPU Parallelized Logistic Regression

POST-SIEVING ON GPUs

Solving Dense Linear Systems on Graphics Processors

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

High Performance Computing with Accelerators

G-NET: Effective GPU Sharing In NFV Systems

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to Parallel Computing with CUDA. Oswald Haan

Mapping NGS reads for genomics studies

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Tutorial on Memory Management, Deadlock and Operating System Types

Lecture 1: Introduction and Computational Thinking

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Accelerating String Matching Using Multi-threaded Algorithm

n N c CIni.o ewsrg.au

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Getting Started with MATLAB Francesca Perino

PFAC Library: GPU-Based String Matching Algorithm

SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform

GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA

High-performance short sequence alignment with GPU acceleration

Large-scale Deep Unsupervised Learning using Graphics Processors

Transcription:

SEASHORE SARUMAN Summary 1 / 24 SEASHORE / SARUMAN Short Read Matching using GPU Programming Tobias Jakobi Center for Biotechnology (CeBiTec) Bioinformatics Resource Facility (BRF) Bielefeld University 28.04.2010

2 / 24 Outline SEASHORE SARUMAN Summary 1 Read Filter Algorithm - SEASHORE 2 GPU implementation - SARUMAN 3 Summary

3 / 24 SEASHORE SARUMAN Summary Short Read Matching New sequencing technologies with short reads Solexa, SOLiD Reads of length ~35bp Solexa up to 100bp by now Mainly used for resequencing Matching reads to a reference genome

4 / 24 History SEASHORE SARUMAN Summary History Algorithm Summary Bowtie, MAQ, BWA, SOAP2 not yet available Normal alignment algorithms not feasable BLAST: Does not work for short reads SSAHA, agrep: No mismatch positions or alignments returned ELAND: Limited to 32bp read length SWIFT: Showed errors for 25bp reads Decision to implement own alignment algorithm

5 / 24 SEASHORE SEASHORE SARUMAN Summary History Algorithm Summary SEmiglobal Alignment of SHOrt REads Developed by Jochen Blom Create qgram index of reference genome Use index to estimate auspicious alignment positions Modified Needleman-Wunsch alignment Parallel calculation on the compute cluster

6 / 24 SEASHORE SARUMAN Summary History Algorithm Summary Read segmentation Figure: The two sets of segments S i and k i. The two sets are shifted by the distance of R.

7 / 24 Filter strategy SEASHORE SARUMAN Summary History Algorithm Summary Figure: Try to match all segments of k iteratively until a match is found or k c is reached.

8 / 24 Stats SEASHORE SARUMAN Summary History Algorithm Summary Algorithm is exact All possible hits are found Reasonable fast Perl implementation 7 mio. reads mapped to bacterial genome 1h on 100 CPUs

9 / 24 CUDA SEASHORE SARUMAN Summary History Algorithm Summary Time consuming step: Alignment Alignments are small (m (m + 2e)) Can be computed parallel Huge amount of small jobs CUDA

10 / 24 SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN CUDA - Compute Unified Device Architecture GPU computing power today exceeds 1 TFlops (GTX480, 1,35 TFlops SP / 168 GFlops DP, 1536MB) Reasonable prices for those cards: 500 EUR Even mainstream cards of normal family PCs sufficent Use this power to solve computationally intensive problems NVIDIA released CUDA API in 2006 C for CUDA (C with NVIDIA extensions) Multi platform (32/64Bit): Windows, Linux, Mac OS X

11 / 24 SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN Computing power of recent graphics cards

12 / 24 CUDA facts SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN Many core architecture of GPUs Optimized hardware for processing massive amounts of data in parallel Up to 512 (stream-)processors per card (varying) Grouped to multiprocessors with shared memory (8-16kb) Processor features Full support for integer and bitwise operations Double precision operations Thousands of threads per core

13 / 24 SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN How to use CUDA? Requirement: Problem should be dividable ( divide & conquer ) each thread solves a small part of the problem Kernel: C-function ( device code ) executed on the GPU hundreds of times in parallel with different data Called from the host code e.g. C/C++ Should be quite simple and relatively easy to compute

SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN 14 / 24 General CUDA processing flow 1 Copy input data from host memory to GPU memory 2 One or more kernels are executed on the input data 3 Copy output data back to the host memory

15 / 24 SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN SARUMAN runs in two phases SARUMAN Semiglobal Alignment of Short Reads using CUDA and Needleman-Wunsch Phase one: executed on the host Reads data files (reference genome & reads) Creates q-gram index Runs the SEASHORE matching algorithm Does I/O operations Phase two: executed on the GPU Computes the edit-distance of genome and read Does backtracking and stores the resulting alignment

16 / 24 SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN SARUMAN: program flow Figure: Overview of SARUMAN s program flow

SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN 17 / 24 The q-gram index Implementation: Normal mode: complete genome read at once Creating index using hashtable For larger genomes: reference genome dividable into several chunks

18 / 24 SEASHORE SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN SEASHORE runs on the CPU Perfect matches are pre filtered Possible hits are collected and saved in memory Batch alignment starts if a defined number of hits is reached

19 / 24 Alignment phase SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN A large number (e.g. 100.000) of hits is prepared for aligning Corresponding genome and read sequences are stored in auxilary data structures Memory is allocated on the card for sequences, alignments and scores Sequences and parameters are copied to the GPU GPU aligns ten-thousands of sequences in parallel

20 / 24 Alignment phase SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN During alignment on the GPU the host already collects new hits If all alignments are done: data is returned to the host Maximal number of hits processable in parallel depends on type of card and VRAM

SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN 21 / 24 Results Comparison: Perl implementation <=> SARUMAN Bacterial reference genome, 3.65 MB Over 6 million Solexa reads System: Intel E8400, 3GHz, 8GB RAM, NVidia GTX280 Perl implementation: 79 minutes SARUMAN prototype: 2,3 minutes

SEASHORE SARUMAN Summary Introduction to CUDA Design of SARUMAN 22 / 24 Results Comparison: Perl implementation <=> SARUMAN Perl implementation: 79 minutes SARUMAN prototype: 2,3 minutes over 30 speedup, reduced memory footprint

23 / 24 Summary SEASHORE SARUMAN Summary SEASHORE: exact matching algorithm SARUMAN: very fast short read mapping tool All parameters (read length, e, alignment cost) variable Complete alignments returned CUDA implementation extremely fast

SEASHORE SARUMAN Summary 24 / 24 Thank you for your attention Questions?