Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

Similar documents
Efficient Smith-Waterman on multi-core with FastFlow

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Accelerating code on multi-cores with FastFlow

High-level parallel programming: (few) ideas for challenges in formal methods

Accelerating sequential programs using FastFlow and self-offloading

Parallel Programming using FastFlow

FastFlow: high-level programming patterns with non-blocking lock-free run-time support

Introduction to FastFlow programming

FastFlow: targeting distributed systems

An Efficient Synchronisation Mechanism for Multi-Core Systems

Marco Danelutto. October 2010, Amsterdam. Dept. of Computer Science, University of Pisa, Italy. Skeletons from grids to multicores. M.

P1261R0: Supporting Pipelines in C++

Marco Danelutto. May 2011, Pisa

Marco Aldinucci Salvatore Ruggieri, Massimo Torquati

FastFlow: targeting distributed systems Massimo Torquati

Structured approaches for multi/many core targeting

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Introduction to FastFlow programming

High Performance Computing on GPUs using NVIDIA CUDA

Intel Thread Building Blocks, Part IV

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Fast Sequence Alignment Method Using CUDA-enabled GPU

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Simultaneous Multithreading on Pentium 4

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

THE Smith-Waterman (SW) algorithm [1] is a wellknown

FastFlow: high-level programming patterns with non-blocking lock-free run-time support

Marwan Burelle. Parallel and Concurrent Programming

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Single-Producer/ Single-Consumer Queues on Shared Cache Multi-Core Systems

Computer Systems Architecture

High Performance Technique for Database Applications Using a Hybrid GPU/CPU Platform

Distributed File Systems Issues. NFS (Network File System) AFS: Namespace. The Andrew File System (AFS) Operating Systems 11/19/2012 CSC 256/456 1

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Handout 3 Multiprocessor and thread level parallelism

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

Overview of research activities Toward portability of performance

Parallel Processing for Scanning Genomic Data-Bases

Martin Kruliš, v

Intel Thread Building Blocks

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

Biology 644: Bioinformatics

Hyper-Threading Performance with Intel CPUs for Linux SAP Deployment on ProLiant Servers. Session #3798. Hein van den Heuvel

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns

SMD149 - Operating Systems - Multiprocessing

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Distributed systems: paradigms and models Motivations

OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT

Autonomic QoS Control with Behavioral Skeleton

CS377P Programming for Performance Multicore Performance Multithreading

WHY PARALLEL PROCESSING? (CE-401)

Computer Systems Architecture

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

INTERFACING REAL-TIME AUDIO AND FILE I/O

Multiprocessors and Locking

Advanced Parallel Programming I

Università di Pisa. Dipartimento di Informatica. Technical Report: TR FastFlow tutorial

Optimization Space Exploration of the FastFlow Parallel Skeleton Framework

Parallel Computing. Hwansoo Han (SKKU)

A Basic Snooping-Based Multi-processor

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

Biological Sequence Comparison on Hybrid Platforms with Dynamic Workload Adjustment

Sequence Alignment with GPU: Performance and Design Challenges

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Technology for a better society. hetcomp.com

Parallel and Distributed Computing

Region-Based Memory Management for Task Dataflow Models

Intel Thread Building Blocks

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Flexible Architecture Research Machine (FARM)

PCS - Part Two: Multiprocessor Architectures

Algorithmic Approaches for Biological Data, Lecture #20

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Introduction to GPU computing

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

! Readings! ! Room-level, on-chip! vs.!

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Parallel Architecture. Hwansoo Han

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

How to Write Fast Code , spring th Lecture, Mar. 31 st

Trends and Challenges in Multicore Programming

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

Threads. Raju Pandey Department of Computer Sciences University of California, Davis Spring 2011

Programming assignment for the course Sequence Analysis (2006)

GPU Accelerated Smith-Waterman

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

The benefits and costs of writing a POSIX kernel in a high-level language

The SARC Architecture

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Order Is A Lie. Are you sure you know how your code runs?

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Intel Architecture for Software Developers

Transcription:

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed Marco Aldinucci Computer Science Dept. - University of Torino - Italy Marco Danelutto, Massimiliano Meneghin, Massimo Torquti Computer Science Dept. - University of Pisa - Italy Peter Kilpatrick Computer Science Dept. - Queen s University Belfast - U.K. ParCo 2009 - Sep. 1st - Lyon - France

Outline Motivation Commodity architecture evolution Efficiency for fine-grained computation POSIX thread evaluation FastFlow Architecture Implementation Experimental results Micro-benchmarks Real-world App: the Smith-Waterman sequence alignment application Conclusion, future works, and surprise dessert (before lunch) 2

[< 2004] Shared Font-Side Bus (Centralized Snooping)

[< 2004] Shared Font-Side Bus (Centralized Snooping)

[2005] Dual Independent Buses (Centralized Snooping)

[2005] Dual Independent Buses (Centralized Snooping)

[2007] Dedicated High-Speed Interconnects (Centralized Snooping)

[2007] Dedicated High-Speed Interconnects (Centralized Snooping)

[2009] QuickPath (MESI-F Directory Coherence)

[2009] QuickPath (MESI-F Directory Coherence)

This and next generation SCM Exploit cache coherence and it is likely to happens also in the next future Memory fences are expensive Increasing core count will make it worse Atomic operations does not solve the problem (still fences) Fine-grained parallelism is off-limits I/O bound problems, High-throughput, Streaming, Irregular DP problems Automatic and assisted parallelization

Micro-benchmarks: farm of tasks Used to implement: parameter sweeping, master-worker, etc. void Emitter () { for ( i =0; i <streamlen;++i){ task = create_task (); queue=select_worker_queue(); queue >PUSH(task); } } void Worker() { while (!end_of_stream){ myqueue >POP(&task); do_work(task) ; } } int main () { spawn_thread( Emitter ) ; for ( i =0; i <nworkers;++i){ spawn_thread(worker); } wait_end () ; } E W 1 W 2 W n C

Using POSIX lock/unlock queues W 1 E W 2 C Ideal 50 μs 5 μs 0.5 μs 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Using POSIX lock/unlock queues W 1 E W 2 C Ideal 50 μs 5 μs 0.5 μs 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Using CompareAndSwap queues W 1 E W 2 C Ideal 50 μs 5 μs 0.5 μs 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Using CompareAndSwap queues W 1 E W 2 C Ideal 50 μs 5 μs 0.5 μs 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Evaluation Poor performance for fine-grained computations Memory fences seriously affect the performance

What about avoiding fences in SCM? Highly-level semantics matters! DP paradigms entail data bidirectional data exchange among cores Cache reconciliation can be made faster but not avoided Task Parallel, Streaming, Systolic usually result in a one-way data flow Is cache coherency really strictly needed? Well described by a data flowing graphs (streaming networks)

Streaming Networks A Streaming Network can be easily build POSIX (or other) threads Asynchronous channels But exploiting a global address space Threads can still share the memory using locks SPMC MPMC Asynchronous channels Thread lifecycle control + FIFO Queue Queue: Single Producer Single Consumer (SPSC), Single Producer Multiple Consumer (SPMC), Multiple Producer Single Consumer (MPSC), Multiple Producer Multiple Consumer (MPMC) Lifecycle: ready - active waiting (yield + over-provisioning) MCSP SPSC

Queues: state of the art MPMC Dozen of lock-free (and wait-free) proposal The quality is usually measured with number of atomic operations (CAS) SPSC CAS 1 lock-free, fence-free J. Giacomoni, T. Moseley, and M. Vachharajani. Fastforward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. PPoPP 2008. ACM. Supports Total Store Order OOO architectures (e.g. Intel Core) Active waiting. Use OS as less as possible. Native SPMC and MPSC see MPMC

SPMC and MCSP via SPSC + control SPMC(x) fence-free queue wit x consumers One SPSC input queue and x SPSC output queues One flow of control (thread) dispatch items from input to outputs E MPSC(y) fence-free queue with y producers One SPSC output queue and y SPSC input queues One flow of control (thread) gather items from inputs to output C x and y can be dynamically changed MPMC = MCSP + SPMC Just juxtapose the two parametric networks

FastFlow: A step forward Implements lock-free SPSC, SPMC, MPSC, MPMC queues Exploiting streaming networks Features can be composed as parametric streaming networks (graphs) E.g. an optimized memory allocator can be added by fusing the allocator graphs with the application graphs Not described here Features are represented as skeletons, actually which compilation target are streaming networks C++ STL-like implementation Can be used as a low-level library Can be used to generatively compile skeletons into streaming networks Blazing fast on fine-grained computations

Very fine grain (0.5 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Very fine grain (0.5 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Fine grain (5 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Fine grain (5 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Medium grain (50 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Medium grain (50 μs) W 1 E W 2 C Ideal POSIX lock CAS FastFlow 8 W n 6 Speedup 4 2 0 2 3 4 5 6 7 8 Number of Cores

Biosequence alignment Smith-Waterman algorithm Local alignment Time and space demanding O(mn), often replaced by approximated BLAST Dynamic programming Real-world application It has been accelerated by using FPGA, GCPU (CUDA), SSE2/x86, IBM Cell Best software implementation SWPS3: evolution of Farrar s implementation SSE2 + POSIX IPC

Smith-Waterman algorithm Local alignment - dynamic programming - O(nm)

Substitution Matrix: describes the rate at which one character in a sequence changes to other character states over time Gap Penalty: describes the costs of gaps, possibly as function of gap length Experiment parameters Affine Gap Penalty: 10-2k, 5-2k,... Substitution Matrix: BLOSUM50

Biosequence testbed Each query sequence (protein) is aligned against the whole Threads or Processes or... SW 1 protein DB SW 2 E.g. Compare unknown sequence against a DB of known sequences Query Sequences SW n Results SWPS3 implementation exploits POSIX processes and pipes Faster than POSIX threads + locks UniProtKB Swiss-Prot 471472 sequences 167326533 amino-acids Shared memory (read-only)

Smith Waterman (10-2k gap penalty) 40 SWPS3 FastFlow GCPUS (the higher the better) 30 20 10 0 144 189 246 464 553 1000 2005 3005 4061 22152 Query sequence lenght

Smith Waterman (5-2k gap penalty) 20 SWPS3 FastFlow GCPUS (the higher the better) 15 10 5 0 144 189 246 464 553 1000 2005 3005 4061 22152 Query sequence lenght

Conclusions FastFlow support efficiently streaming applications on commodity SCM (e.g. Intel core architecture) More efficiently than POSIX threads (standard or CAS lock) Smith Waterman algorithm with FastFlow Obtained from SWPS3 by syntactically substituting read and write on POSIX pipes with fastflow push and FastFlow pop an push In turn, POSIX pipes are faster than POSIX threads + locks in this case Scores twice the speed of best known parallel implementation (SWPS3) on the same hardware (Intel 2 x Quad-core 2.5 GHz)

Future Work FastFlow Is open source (STL-like C++ library will be released soon) [ ] Contact me if you interested Include a specialized (very fast) parallel memory allocator [ ] Can be used to automatically parallelize a wide class of problems [ ] Since it efficiently supports fine grain computations Can be used as compilation target for skeletons [ ] Support parametric parallelism schemas and support compositionality (can be formalized as graph rewriting) Can be extended for CC-NUMA architectures [ ] Can be used to extend Intel TBB and OpenMP [ ] Increasing the performances of those tools

GCUPS 20 18 16 14 12 10 8 6 4 2 P14942 P05013 P01111 P02232 P27895 P03435 P25705 P10635 P01008 P07327 P00762 5-2k gap penalty Query sequence length (protein length) 144 189 189 222 246 375 464 497 553 567 1000 1500 2005 2504 3005 3564 4061 5478 22152 P19096 P04775 P07756 FastFlow OpenMP Cilk TBB SWPS3 Q9UKN1 P20930 P0C6B8 P28167 Q8WXI7 Query sequence (protein) FastFlow is also faster than Open MP, Intel TBB and Cilk (at least for streaming on Intel 2 x quad-core)

THANK YOU! QUESTIONS?... and one question for you Are those chips really build for parallel computing?