AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

Similar documents
Automatic SIMDization of Parallel Sorting on x86-based Manycore Processors

Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture

Dynamic SIMD Scheduling

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Architecture without explicit locks for logic simulation on SIMD machines

Native Computing and Optimization. Hang Liu December 4 th, 2013

Fast Sequence Alignment Method Using CUDA-enabled GPU

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Highly Efficient Compensationbased Parallelism for Wavefront Loops on GPUs

Improving CUDASW++, a Parallelization of Smith-Waterman for CUDA Enabled Devices

SIMD Exploitation in (JIT) Compilers

A Framework for the Automatic Vectorization of Parallel Sort on x86-based Processors

ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

A Design of a Hybrid System for DNA Sequence Alignment

GPU Accelerated API for Alignment of Genomics Sequencing Data

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Biological Sequence Comparison on Hybrid Platforms with Dynamic Workload Adjustment

Comparative Analysis of Protein Alignment Algorithms in Parallel environment using CUDA

Protein Alignment on the Intel Xeon Phi Coprocessor

Efficient Smith-Waterman on multi-core with FastFlow

Boundary element quadrature schemes for multi- and many-core architectures

Accelerating Smith-Waterman Alignment workload with Scalable Vector Computing

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Accelerator Programming Lecture 1

Fast Segmented Sort on GPUs

When MPPDB Meets GPU:

Conflict Detection-based Run-Length Encoding AVX-512 CD Instruction Set in Action

OpenCL Vectorising Features. Andreas Beckmann

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Intel Knights Landing Hardware

Biology 644: Bioinformatics

Introduction to CUDA Programming

Keywords -Bioinformatics, sequence alignment, Smith- waterman (SW) algorithm, GPU, CUDA

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

High Performance Computing: Tools and Applications

THE Smith-Waterman (SW) algorithm [1] is a wellknown

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Parallel Systems. Project topics

Parallel Exact Inference on the Cell Broadband Engine Processor

Kevin O Leary, Intel Technical Consulting Engineer

On Parallel Evaluation of Matrix-Based Dynamic Programming Algorithms

Dan Stafford, Justine Bonnot

Growth in Cores - A well rehearsed story

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

High Performance Computing: Tools and Applications

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

GPU-Supercomputer Acceleration of Pattern Matching

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Advanced OpenMP Features

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

High Performance Technique for Database Applications Using a Hybrid GPU/CPU Platform

Advanced programming with OpenMP. Libor Bukata a Jan Dvořák

Alignment of Long Sequences

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Accelerating Smith-Waterman on Kepler. Erich Elsen. Vishal Vaidyanathan

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel Xeon Phi Coprocessors

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

Harnessing Associative Computing for Sequence Alignment with Parallel Accelerators

rabbit.engr.oregonstate.edu What is rabbit?

HPC code modernization with Intel development tools

Masterpraktikum Scientific Computing

Accelerating the Smith-Waterman algorithm with interpair pruning and band optimization for the all-pairs comparison of base sequences

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Scientific Computing with Intel Xeon Phi Coprocessors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Performance potential for simulating spin models on GPU

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Bring your application to a new era:

Sequence Alignment with GPU: Performance and Design Challenges

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

A GPU Algorithm for Comparing Nucleotide Histograms

Matheus Serpa, Eduardo Cruz and Philippe Navaux

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

On the Efficacy of Haskell for High Performance Computational Biology

VECTORIZING RECOMPRESSION IN COLUMN-BASED IN-MEMORY DATABASE SYSTEMS

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Intel Architecture for HPC

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

ASW: Accelerating Smith-Waterman Algorithm on Coupled CPU-GPU Architecture

Technical Report. Research Lab: LERIA

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

HPC-BLAST Scalable Sequence Analysis for the Intel Many Integrated Core Future

What s New August 2015

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Transcription:

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu

Pairwise Sequence Alignment Algorithms Essential computational kernels in bioinformatics apps Quantify similarity between pairs of sequences Different types of algorithms Local alignment, e.g., Smith-Waterman Global alignment, e.g., Needleman-Wunsch Different gap systems Constant, linear, affine gap, etc. 2

Pairwise Sequence Alignment Algorithms Essential computational kernels in bioinformatics apps Quantify similarity between pairs of sequences Different types of algorithms Local alignment, e.g., Smith-Waterman Global alignment, e.g., Needleman-Wunsch Different gap systems These algorithms and gap systems ALL Constant, linear, Subject affine S n gap, etc. follow the same computational pattern. D i-1,j-1 U i-1,j-1 D i,j-1 U i,j-1 L i-1,j-1 T i-1,j-1 Diagonal Dep. L i,j-1 T i,j-1 Vertical Dep. D i-1,j U i-1,j D i,j U i,j L i-1,j T i-1,j L i,j T i,j 2 Horizontal Dep.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Subject S n Subject S n 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Subject S n Subject S n 1. Preprocess 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Subject S n Subject S n 1. Preprocess 2. Check 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Pairwise Sequence Alignment Algorithms Different Vectorization Strategies Two popular strategies: iterate & scan methods Iterate*: use a certain number of iterations to validate results Scan**: use a round of scan to validate results Subject S n Subject S n 1. Need an unpredictable # of iterations to correct results 2. Do NOT need to visit all the values 1. Need a scan to correct results 2. Need to visit all the values 1. Preprocess 2. Check 3. Correct 1. Preprocess 2. Scan 3. Correct 4 * M. Farrar, Striped Smith-Waterman Speeds Database Searches, Bioinformatics, 2007. ** A. Khajeh-Saeed et al., Acceleration of the Smith- Waterman Algorithm, J. Comp. Phys., 2010.

Motivation & Challenges Which vectorization strategy is better on x86 systems? Affected by different algorithms, configurations, inputs & platforms Different Algorithm with same affine gap over same input data Different gap systems with same SW algorithm over same input data Different inputs for the same SW algorithm with the same affine gap 19

Motivation & Challenges Which vectorization strategy is better on x86 systems? Affected by different algorithms, configurations, inputs & platforms 1. Can we automatically generate the vector codes? 2. Can we design an even better vectorization strategy? Different Algorithm with same affine gap over same input data Different gap systems with same SW algorithm over same input data Different inputs for the same SW algorithm with the same affine gap 19

Introduction & Motivation Background Vector ISA AAlign Framework Generalized Paradigm Vector Modules Vector Code Constructs Hybrid Method Evaluation & Discussion Conclusion Roadmap 19

Vector Processing Units Background: Vector ISA Carry out a single operation over a vector of data simultaneously AVX2 Instructions Platform: Vector ISA in current multi-core Haswell CPUs Width: 256 bits (two 128-bit lanes) Operations: Gather, cross-lane permute, per-element shift, etc. IMCI Instructions Platform: Vector ISA in many-core Knights Corner MIC Width: 512 bits (four 128-bit lanes) Operations: Scatter, gather, inner/cross-lane permute, etc. 20

Approaches to Using Vector ISA Compiler-based approaches Compiler options Pragma directives 23

Approaches to Using Vector ISA Compiler-based approaches Compiler options Pragma directives Issue: Fail to auto-vec loops due to complex memory access, convoluted data rearrangement, etc. 23

Approaches to Using Vector ISA Compiler-based approaches Compiler options Pragma directives Manual optimization via Compiler intrinsics Assembly code Issue: Tedious and error-prone. 23

Approaches to Using Vector ISA Compiler-based approaches Compiler options Pragma directives Manual optimization via Need a framework Compiler to intrinsics hide the details of vector codes and make it cross-platform. Assembly code Serial C codes for(i=0; i<2w; i++) { if(i<offset) array[i]=x; else array[i]= array[i-offset]; } Right-shift an array of length 2w AVX2 intrinsics on CPUs m256i v_ret; m256i cv_rev = _mm256_set_epi32(6, 5, 4, 3, 2, 1, 0, 7); v_ret = _mm256_load_si256(( m256i *)array); v_ret = _mm256_permutevar8x32_epi32(v_ret, cv_rev); v_ret = _mm256_insert_epi32(v_ret, x, 0); IMCI intrinsics on MIC m512i v_ret; m512i cv_rev = _mm512_set_epi32(14,13,12,11,10,9,8,7,6,5,4,3,2,1,0,15); unsigned short mask = 0xffff; mask <<= num; m512i cv_fil = _mm512_set1_epi32(x); v_ret = _mm512_load_epi32(array); v_ret = _mm512_permutevar_epi32(cv_rev, v_ret); v_ret = _mm512_mask_swizzle_epi32(cv_fil, mask, v_ret, _MM_SWIZ_REG_NONE); 23

Introduction & Motivation Background Vector ISA AAlign Framework Generalized Paradigm Vector Modules Vector Code Constructs Hybrid Method Evaluation & Discussion Conclusion Roadmap 25

Architectural Overview AAlign Framework seq code Clang framework AST (seq code) vec code constructs mod mod ISA-specific modules AST (vec code constructs) vec code Traverse Identify Build vec code Use hybrid method seq code mod mod ISA-specific modules vec code constructs vec code 26

Proposed Generalized Paradigm Sequential codes follow our generalized paradigm Support global and local alignment algorithms Support different gap systems: constant, linear, affine T i,j = max 0 max 0 l<j (T i,l + θ i,l + β i,k ) j k=l+1 j max 0 l<i (T l,j + θ l,j + β k,j ) T i 1,j 1 + γ i,j k=l+1 Example serial code follows the paradigm 27

Vector Operation Modules Used to express required primitive vector operations Basic vector operations (e.g., load/store/add/max) Application-specific vector operations Module Name set_vector rshift_x_fill influence_test wgt_max_scan Description Initialize a new vector using the given gap info. Right shift the vector and fill the gaps with specified value x Check if vector can affect another vector weighted max-scan over the values in a given array using vectorized method 28

Portability of Vector Modules Example: rshift_x_fill() 31

Portability of Vector Modules 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 16 x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Example: rshift_x_fill() x x x x x x x x x x x x x x x rshift_x_fill (IMCI 32-bit int) m512_permutevar_epi32 m512_set1_epi32 x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 m512_mask_swizzle_epi32 1 2 3 4 5 6 7 8 8 1 2 3 4 5 6 7 x 1 2 3 4 5 6 7 rshift_x_fill (AVX2 32-bit int) m256_permutevar_epi32 m256_insert_epi32 31

Portability of Vector Modules Example: rshift_x_fill() 1 2 3 4 5 6 7 8 8 1 2 3 4 5 6 7 x 1 2 3 4 5 6 7 rshift_x_fill (AVX2 32-bit int) m256_permutevar_epi32 m256_insert_epi32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 1 2 3 8 5 6 7 12 9 10 11 16 13 14 15 rshift_x_fill (AVX2 16-bit int) m256_shufflehi/lo_epi16 16 13 14 15 4 1 2 3 8 5 6 7 12 9 10 11 m256_permutevar8x32_epi16 16 x 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 m256_blend_epi16 m256_insert_epi16 31

Portability of Vector Modules Example: rshift_x_fill() Provide such portability of the same functionality over different ISAs and built-in datatypes. 31

Vector Code Constructs Striped layout of the query sequence SIMD-friendly due to the elimination of data-dependency among adjacent elements Original: i a b c d e A B C D E Striped: i a A b B c C d D e E v 1 v 2 v 3 v 4 v 5 Both iterate and scan methods can use the striped layout Provide foundations of merging the two methods 33

Striped-Iterate Vector Code Constructs Iteratively correct the results if the updated predefined vectors affect the results Striped-Scan Correct the results using the vectorized weighted scan Function aalign_iterate() // Preprocess the column values Function aalign_scan() // Preprocess the column values 34

Hybrid Method Can we design an even better method? Observations (e.g., SW with affine gap) 38

Hybrid Method Can we design an even better method? Observations (e.g., SW with affine gap) If two sequences are similar, "iterate checks each position with more re-computation steps to eliminate false negatives 38

Hybrid Method Can we design an even better method? Observations (e.g., SW with affine gap) If two sequences are not similar, only a few rounds of re-computation steps are needed. 38

Hybrid Method Can we design an even better method? Observations (e.g., SW with affine gap) In "scan", the re-computations steps are fixed, i.e. scan, correction. 38

Hybrid Method Our idea: Automatically switch to the better strategy based on the current # of re-computations at runtime 42

Hybrid Method Our idea: Automatically switch to the better strategy based on the current # of re-computations at runtime 1.Switch 42

Hybrid Method Our idea: Automatically switch to the better strategy based on the current # of re-computations at runtime 2. Probe : noop 1.Switch 42

Hybrid Method Our idea: Automatically switch to the better strategy based on the current # of re-computations at runtime 2. Probe : noop 1.Switch 3. Probe : noop 42

Hybrid Method Our idea: Automatically switch to the better strategy based on the current # of re-computations at runtime 2. Probe : noop 1.Switch 3. Probe : noop 4. Probe : Switch 42

Details of the AAlign Framework Code translation Use clang driver to create Abstract Syntax Tree (AST) Detect the type of sequential code (e.g., SW or NW; linear or affine gap; etc.) Create real vector codes based on vector code constructs and type information Multi-threaded version Perform one-to-all sequence alignment Database sequences have been sorted for better load balance 44

Introduction & Motivation Background Vector ISA AAlign Framework Generalized Paradigm Vector Modules Vector Code Constructs Hybrid Method Evaluation & Discussion Conclusion Roadmap 45

Evaluation & Discussion Experiment Setup The queries are real sequences selected from NCBI, i.e., AL4A1_HUMAN(Q500), COSA1_HUMAN(Q1000), B0I1R8_HUMAN(Q2000), MUC17_HUMAN(Q4000) The database is Swiss-prot containing over 570k sequences Parameter CPU MIC Product Name Intel Xeon E5-2680 v3 Intel Xeon Phi 5110P Code Name Haswell Knights Corner # of Cores 24 60 Clock Rate 2.5 GHz 1.05 GHz L1/L2/L3 Cache 32 KB/ 256 KB/ 30 MB 32 KB/ 512 KB/ - Memory 128 GB DDR3 8 GB GDDR5 Compiler icpc 15.3 icpc 15.3 Compiler Options -xcore-avx2 O3 -mmic -O3 Vector ISA AVX2 IMCI 46

Evaluation & Discussion Speedups over Serial Counterparts SW on Haswell CPU SW on Knights Corner MIC AAlign-Iterate can achieve better vectorization efficiency Superlinear speedups for AAlign-Iterate comes from the elimination of a considerable amount of re-computation when the influence_test() fails 47

Evaluation & Discussion Performance of Hybrid Method SW on Haswell CPU SW on Knights Corner MIC Use query coverage (QC) and max identity (MI) to describe the similarity of two sequences. (Format: <QC>_<MI>, e.g., lo_hi) 51

Evaluation & Discussion Performance of Hybrid Method lo_hi means only a small portion (<30%) of two sequences overlaps each other, but the overlapped areas are highly similar (>70%). SW on Haswell CPU SW on Knights Corner MIC Use query coverage (QC) and max identity (MI) to describe the similarity of two sequences. (Format: <QC>_<MI>, e.g., lo_hi) 51

Evaluation & Discussion Performance of Hybrid Method SW on Haswell CPU SW on Knights Corner MIC Use query coverage (QC) and max identity (MI) to describe the similarity of two sequences. (Format: <QC>_<MI>, e.g., lo_hi) Hybrid method can achieve better performance than both vector algorithms; for some cases, it can approximate the superior one 51

Performance Comparison with Open-Source Tools AAlign vs. SWPS3* on CPU AAlign vs. SWAPHI** on MIC SW on Haswell CPU Higher the better SW on Knights Corner MIC CPU: AAlign codes can outperform SWPS3 by up to 2.5x MIC: AAlign codes can outperform SWAPHI by up to 1.6x *A. Szalkowski, C. Ledergerber, P. Krhenbhl, and C. Dessimoz, SWPS3 fast multithreaded vectorized Smith-Waterman for IBM Cell/B.E. and 86/SSE2, BMC Res Notes, 2008 **Y. Liu and B. Schmidt, SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors, Int l Conf. on Application-specific Systems, Architectures and Processors (ASAP), 2014. 51

Conclusion AAlign: A specialized framework for pairwise alignment algorithms on the x86-based processors Efficient vector codes based on striped-iterate & striped-scan Sets of platform-specific vector modules Design: A new input-agnostic hybrid method Performance: Significant performance gains over serial counterparts Auto-switching to better vectorization strategy at runtime (hybrid method) Up to 2.5x performance benefit over existing multi-threaded tools Availability: https://github.com/vtsynergy/aalign THANK YOU! More info: http:// 52