Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Similar documents
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

MULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT

Automatic SIMDization of Parallel Sorting on x86-based Manycore Processors

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

NA12878 Platinum Genome GENALICE MAP Analysis Report

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

Bring your application to a new era:

Modern Processor Architectures. L25: Modern Compiler Design

Performance analysis of parallel de novo genome assembly in shared memory system

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

On the Efficacy of Haskell for High Performance Computational Biology

Parallel Programming on Larrabee. Tim Foley Intel Corp

Hybrid Implementation of 3D Kirchhoff Migration

CPU-GPU Heterogeneous Computing

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Exome sequencing. Jong Kyoung Kim

Scout. A Source-to-Source Transformator for SIMD-Optimizations. Zellescher Weg 12 Willers-Bau A 105 Tel

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

Dynamic SIMD Scheduling

Mapping NGS reads for genomics studies

SIMD Exploitation in (JIT) Compilers

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

A GPU Algorithm for Comparing Nucleotide Histograms

Parallelism. Parallel Hardware. Introduction to Computer Systems

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Getting Started with Intel Cilk Plus SIMD Vectorization and SIMD-enabled functions

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Martin Kruliš, v

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

High Performance Computing on GPUs using NVIDIA CUDA

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Advanced OpenMP Features

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Omega: an Overlap-graph de novo Assembler for Metagenomics

Analysis of parallel suffix tree construction

CLC Server. End User USER MANUAL

CS4961 Parallel Programming. Lecture 7: Introduction to SIMD 09/14/2010. Homework 2, Due Friday, Sept. 10, 11:59 PM. Mary Hall September 14, 2010

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Heterogeneous compute in the GATK

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

High performance computing and numerical modeling

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

C6000 Compiler Roadmap

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Sequence Analysis Pipeline

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)

Computational Genomics and Molecular Biology, Fall

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

GPU Accelerated API for Alignment of Genomics Sequencing Data

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Intel Xeon Phi Coprocessor

The Art of Parallel Processing

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

Dindel User Guide, version 1.0

High-throughput Sequence Alignment using Graphics Processing Units

Optimization of Tele-Immersion Codes

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

A low memory footprint OpenCL simulation of short-range particle interactions

Code modernization of Polyhedron benchmark suite

Kelly et al. Genome Biology (2015) 16:6 DOI /s x. * Correspondence:

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Accelrys Pipeline Pilot and HP ProLiant servers

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Accelerating the Prediction of Protein Interactions

Phylogenetics on CUDA (Parallel) Architectures Bradly Alicea

Parallel Algorithm Engineering

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Intel Knights Landing Hardware

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

! Readings! ! Room-level, on-chip! vs.!

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Lecture 25: Board Notes: Threads and GPUs

GPU-Based Acceleration for CT Image Reconstruction

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Code optimization in a 3D diffusion model

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

ECE 574 Cluster Computing Lecture 10

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Using MPI One-sided Communication to Accelerate Bioinformatics Applications

Transcription:

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng

Evolution of Genome Sequencing1 In 20032: 1 human genome 1. 2. 3.4 billion 13 years Hundreds of international universities and research centers E.g. DNA sequencing: determining the precise order nucleotides within a DNA molecule. Human genome project (HGP) used First Generation Sequencing technology and was completed in 2003.

Evolution of Genome Sequencing1 In 20032: 3.4 billion 1 human genome 13 years Hundreds of international universities and research centers In 20143: $1,000 per genome A single day 45 human genomes 1. 2. 3. E.g. DNA sequencing: determining the precise order nucleotides within a DNA molecule. Human genome project (HGP) used First Generation Sequencing technology and was completed in 2003. The HiSeq XTM Ten3 released by Illumina Inc. in 2014.

Next Generation Sequencing (NGS) Next Generation Sequencing (NGS): Millions of short read sequences are processed by tens/hundreds/thousands of processing units in a massively parallelism fashion. Millions of DNA fragments Massively Parallelism PU PU PU PU Next Generation Sequencing (NGS)

Next Generation Sequencing (NGS) Next Generation Sequencing (NGS): Millions of short read sequences are processed by tens/hundreds/thousands of processing units in a massively parallelism fashion. Millions of DNA fragments Massively Parallelism PU PU PU PU Next Generation Sequencing (NGS) u Many software with NGS techniques don t fully explore the computation power of the modern hardware (GPUs, APUs, MIC, FPGA ), due to the lack of expertise on their architectures!

Indel and its Importance What is indel? Biology term of Insertion and Deletion. -Insertion: Before insertion After insertion ACTG TCAGTC ACTGCATCAGTC Why are they so important? Second most common type of polymorphism 1 variant in human genome. Affecting human traits and causing severe human diseases. In NGS, Indels detection is recommended in a post-alignment step in order to reduce the artifacts generated during the sequence alignment. -Deletetion: Before deletion ACTGCATCAGTC After deletion ACTG TCAGTC 1. Polymorphism: a DNA variation that is common in population, cut-off point is 1%.

Indels Detection And Dindel Tools: Dindel, VarScan, SAMtools mpileup, the Genome Analysis Toolkit (GATK), SOAPindel and so forth.

Indels Detection And Dindel Tools: Dindel, VarScan, SAMtools mpileup, the Genome Analysis Toolkit (GATK), SOAPindel and so forth. Dindel 1 Commonly used software that using a Bayesian based realignment model for calling small indels (<50 nucleotides) from short read data. Advantage: Highest sensitivity for calling small low coverage indels from short reads. Disadvantage: Time consumption. 1. Proposed by y C. A. Albers and et al. in 2010.

Indels Detection And Dindel Tools: Dindel, VarScan, SAMtools mpileup, the Genome Analysis Toolkit (GATK), SOAPindel and so forth. Dindel 1 Commonly used software that using a Bayesian based realignment model for calling small indels (<50 nucleotides) from short read data. Advantage: Highest sensitivity for calling small low coverage indels from short reads. Disadvantage: Time consumption. ² Motivation & Opportunity: Dindel is a unoptimized sequential C++ program that is hindered from fully exploiting the computation power of today s high performance architecture! 1. Proposed by y C. A. Albers and et al. in 2010.

Contribution 1. Analyze Dindel and identify the performance bottlenecks. 2. Accelerate Dindel with Thread-Level Parallelism (TLP) and Data-Level Parallelism (DLP). 3. Achieve up to 37-fold speedup for the parallelized code region and 9-fold speedup for the total execution time over the original Dindel on a modern multi-core processor.

Dindel: Performance Bottlenecks Windows Identify Reads Generate Candidate Haplotypes Calculating Reads to Haplotypes Likelihoods. Estimating Haplotype Frequencies and Prior Probabilities Estimating Quality Score of Candidate Indels and other Variants > 90% of original Dindel's total execution time More than 90% of the computation is spent on processing independent task sequentially! Most of these computation (>66%) presents lots of regular memory access patterns with some irregular patterns

Dindel: Sequential Processing & Memory Access Patterns 1 Sequential processing Reads 1...... Candidate Haplotypes 1...... m Memory access pattern (Sequentially) # of read bases A T T A C # of haplotype bases A C G A T G C T Loop1 Loop2 Loop3 Loop4 Loop5 n *Probabilities of base-pair 2 # of haplotype bases Regular: 2 Irregular: Conditional branch! # of haplotype bases # of read bases Loop1 Loop2 Loop3 Loop4 Loop5 # of read bases Loop1 Loop2 Loop3 Loop4 Loop5 1. In processing figure, green balls mean independent tasks, arrows mean work flow. 2. In memory access figures, arrows mean dependency, e.g. A -> B means updating value of A depending on the value of B.

Design & Optimization Overview q Data-Level Parallelism (DLP) q Auto- and Manual- Vectorization q Mixture of auto- and manual- Vectorization q Thread-Level Parallelism (TLP) q OpenMP q Mixture of DLP and TLP

Data-Level Parallelism: Vectorization Vectorization (SIMDization): Preparing a program for using on a SIMD processor. Auto- by compiler, e.g. GCC/G++, ICC/ICPC Manual- by the programmer, using low level instrinsics.

Data-Level Parallelism: Vectorization Vectorization (SIMDization): Preparing a program for using on a SIMD processor. Auto- by compiler, e.g. GCC/G++, ICC/ICPC Manual- by the programmer, using low level instrinsics. Traditional SISD Processor: Memory 1 2 3 4 5 R1 R2 R3 1 + 5 = 6 Memory 6

Data-Level Parallelism: Vectorization Vectorization (SIMDization): Preparing a program for using on a SIMD processor. Auto- by compiler, e.g. GCC/G++, ICC/ICPC Manual- by the programmer, using low level instrinsics. Traditional SISD Processor: SIMD Processor: 1 Memory 1 2 3 4 5 Memory 1 2 3 4 5 6 7 8 R1 R2 R3 1 + 5 = 6 R1 R2 Vectorization R3 Load contiguous data Vector Operation Store to Memory 1 2 3 4 5 6 7 8 6 8 10 12 Memory 6 Memory 6 8 10 12 ² We adopt a mix solution (auto- & manual-) for best practice performance! + = 1. Assume processing 32-bit integers with 128-bit SSE register. 2. Now Intel s AVX512 support 512-bit register that can handle 16 32-bit integers, 8 64-bit doubles.

Thread-Level Parallelism Candidate Haplotypes 1...... m Candidate Haplotypes 1...... m 1 Multi-Threading 1 Reads... Reads......... n n Using OpenMP instead of traditional Pthread solution OpenMP: A framework for developing parallel application in C,C++ and Fortran. Advantages: Progrmmability, portability, same level performance as Pthreads. Making each thread process one green task for load balance reason. 1 Most time consumption region within Dindel. Cover more than 90% sequential Dindel total execution time.

Dindel: Before & After Optimization Windows Identify Reads Generate Candidate Haplotypes Estimating Haplotype Frequencies and Prior Probabilities Estimating Quality Score of Candidate Indels and other Variants ² Parallelization with TLP and DLP. Windows Identify Reads Generate Candidate Haplotypes Estimating Haplotype Frequencies and Prior Probabilities Estimating Quality Score of Candidate Indels and other Variants TLP DLP

Evaluation: Set Up CPUs: two Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with AVX and Hyper-Threading. There are 24 physical cores and 48 logical cores. Compiler: G++ 4.8.2 with -O3 and -mavx. Three data sets: Data sets Original Dindel runtime in seconds Likelihood calculation runtime coverage Eight heavy-duty loops runtime coverage file1 321 92.79% 69.56% file2 757 92.52% 69.43% file3 3480 92.68% 69.58%

Performance Evaluation With 48 threads and vectorization, we achieve up to 9-fold speedup for total execution time and 37-fold speedup for calculation of readhaplotype likelihood. Hyper-threading, which hide high memory access latency, further improve performance when increase threads number from 24 to 48. Performance of parallelized region of the program: Overall Performance: file1& file2& file3& file1' file2' file3' run$me'in'seconds' 4096& 1024& 256& 64& 16& 4& original& 3225.60& 700.62& 298.27& vec& 2385.60& 516.52& 220.18& omp24& 137.04& 120.87& 104.43& 29.70& 26.27& 22.73& 12.76& 11.45& 9.67& omp48& omp24+vec& omp48+vec& 85.07& 18.59& 8.10& run$me'in'seconds' 4096' 2048' 1024' 512' 256' 128' 64' 32' 16' 8' 4' original' 3480.29' 757.29' 321.45' vec' 2642.62' 571.90' 243.61' omp24' 407.72' 400.28' 373.67' 363.46' 88.63' 87.62' 81.35' 79.22' 37.35' 37.36' 34.23' 33.68' omp48' omp24+vec' omp48+vec' DLP TLP DLP+TLP DLP TLP DLP+TLP vec: vectorization. omp#: optimization with # threads.

Recall of Contribution 1. Analyze Dindel and identify the performance bottlenecks. 2. Accelerate Dindel with Thread-Level Parallelism (TLP) and Data-Level Parallelism (DLP). 3. Achieve up to 37-fold speedup for the parallelized code region and 9-fold speedup for the total execution time over the original Dindel on a modern multi-core processor.

BACKUP

Nucleic acid sequence Three most important molecules for all form of life: DNA, RNA and Protein Nucleotide: Biological molecules Building blocks of DNA and RNA Nucleic acid sequence: Succession of letters Indicating the order of Nucleotides within a DNA or RNA molecule Short-read: Reads that were shorter (e.g. 50-150bp) than the mainstream technologies(1000bp), but produce high-quality deep coverage of genomes.

Indels Detection And Dindel Indels Detection Algorithms: Local realignment of gapped reads to the reference genome or alternative candidate haplotype 1. Local de novo assembly of the reads aligned around the target region followed by construction of a consensus sequence for indel discovery. Tools: Dindel, VarScan, SAMtools mpileup, the Genome Analysis Toolkit (GATK), SOAPindel and so forth. Dindel 2 Commonly used software that using a Bayesian based realignment model for calling small indels (<50 nucleotides) from short read data. Advantage: Highest sensitivity for calling small low coverage indels from short reads. Disadvantage: Time consumption. ² Motivation & Opportunity: Dindel is a unoptimized sequential C++ program, that hindered from fully exploiting the computation power of today s high performance architecture! 1. A haplotype is a set of DNA variations, or polymorphisms, that tend to be inherited together on the same chromosome. 2. Proposed by y C. A. Albers and et al. in 2010.

Dindel: Probabilistic Realignment Model b 0 : Anchor point. X ib : position in haplotype that read base b aligns to. I ib : {0,1}, indicates whether read case b is part of insertion w.r.t. haplotype. The alignment types of read bases w.r.t. haplotype: A read base aligns to a haplotype base. A read base aligns to the left or right of the haplotype. A read base is part of an insertion.

Vectorization Opportunities in Dindel Eight heavy-duty loops are not vectorized by compiler (G++ 4.8.2)! 1 Number of iterations can not be computed in advance. Dindel uses a class data member as the loop bound. However, the vectorizer cannot recognize the class data member. 2 Divergent control flow. Conditional branches prevent vectorization. 3 Noncontiguous memory access. Two-level nested loops result in noncontiguous memory access. Reasons of being not vectorizable Loop1 1,2,3 Loop2 1,2 Loop3 1 Loop4 1,2 Loop5 1,2,3 Loop6 1,2 Loop7 1 Loop8 1,2

Auto-Vectorization: Loop restructure1 Loop restructure1 number of iterations can not be computed in advance : Replace the class data members with local variables for each loop boundary.

Auto-Vectorization: Loop restructure2 Loop restructure2 divergent control flow : Using ternary operator (also called conditional operator, i.e. e 1?e 2 : e 3 ) instead of if...else... condition. Splitting the loop to two parts according to the condition and then avoid the if...else... condition. For combinations of multiple conditions, use bitwise operations.

Auto-Vectorization: Loop restructure3 Loop restructure3 noncontiguous memory access : Swapping the inner and the outer loops to make the memory access contiguous and separate the new outer loop to avoid control flow. # of read bases # of haplotype bases j loop i loop Swap Inner and Outer Loops # of haplotype bases # of read bases j loop i loop

Better performance! Manual-Vectorization Why? Compiler is conservative in optimization with respect to many restrictions. With the knowledge of the program, the programmer is more likely to find vectorization opportunity. Intrinsics: Assembly-coded functions, allow using C++ function calls and variable in place of assembly instructions and provide access to instructions that cannot be generated using standard constructs of the C and C++ language Expended inline to eliminate function call overhead, same benefit as using inline assembly with readability. Choosing intrinsics for Dindel: Using 128-bit SSE registers for 32-bit integer variables and 256-bit AVX registers for 64-bit double variables. Handling conditional branch: 1. Using Mask: General approach to handle branches in manual-vectorization. 2. Avoiding condition by splitting loop: each resulting loop is condition-free, but only applicable to some conditions. Choosing 2 to handle conditional branch whenever it is applicable.

Manual-Vectorization: Pseudocode Splitting loop. Using mask.

Auto- vs. Manual- Vectorization Auto- win: Compiler further improve performance by apply other optimizations, software prefetch, loop unrolling, etc. But compiler cannot apply these optimizations on intrinsics codes. Manual- win: Computation-intensive operations. And the programmer is more confident and aggressive to do vectorization with the knowledge of the program. 2.5" original" auto_vec" manual_vec" 2" Speedup& 1.5" 1" 0.5" 0" loop1" loop2" loop3" loop4" loop5" loop6" loop7" loop8" We choose either auto- or manual- vectorization if it beats the other! Auto- beats Manual- vectorization.

Thread granularities For every incoming task: 1read_1hap 1 Candidate Haplotypes 1...... m Each thread process one pair of read-haplotype. Nread_1hap Each thread process pairs of all reads to one haplotype. Reads...... n 1read_mhap Each thread process pairs of one read to all haplotypes. thread 1 thread k thread 1 thread k thread 1 thread k........................... 1read_1hap nread_1hap 1read_mhap

Performance of Different Granularities Up to 23-fold speedup for using 1read_1hap. Primarily because workload imbalance. file1& file2& file3& run$me'in'seconds' 4096& 2048& 1024& 512& 256& 128& 64& 32& 16& 8& 4& original& 3225.60& 700.62& 298.27& 1read_mhap& 203.31& 43.79& 19.74& nread_1hap& 185.08& 39.83& 16.89& 1read_1hap& 137.04& 29.70& 12.76& Small Granularity Large Granularity thread2 thread1 thread2 thread1 Performance of using TLP with different granularities. 1 Workload Imbalance 1. 24 threads with different granularities.