Parallel Programming using FastFlow

Similar documents
Introduction to FastFlow programming

Introduction to FastFlow programming

Introduction to FastFlow programming

Efficient Smith-Waterman on multi-core with FastFlow

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Evolution of FastFlow: C++11 extensions, distributed systems

FastFlow: targeting distributed systems

The Loop-of-Stencil-Reduce paradigm

Marco Aldinucci Salvatore Ruggieri, Massimo Torquati

Structured approaches for multi/many core targeting

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

FastFlow: targeting distributed systems Massimo Torquati

A Parallel Edge Preserving Algorithm for Salt and Pepper Image Denoising

Marco Danelutto. May 2011, Pisa

Accelerating code on multi-cores with FastFlow

Structured parallel programming

Parallel patterns, data-centric concurrency, and heterogeneous computing

Parallel video denoising on heterogeneous platforms

The multi/many core challenge: a pattern based programming perspective

Pool evolution: a parallel pattern for evolutionary and symbolic computing

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Marco Danelutto. October 2010, Amsterdam. Dept. of Computer Science, University of Pisa, Italy. Skeletons from grids to multicores. M.

Autonomic QoS Control with Behavioral Skeleton

Overview of research activities Toward portability of performance

Exercising high-level parallel programming on streams: a systems biology use case

Accelerating sequential programs using FastFlow and self-offloading

Università di Pisa. Dipartimento di Informatica. Technical Report: TR FastFlow tutorial

Intel Thread Building Blocks

P1261R0: Supporting Pipelines in C++

Intel Thread Building Blocks

Costless Software Abstractions For Parallel Hardware System

Top-down definition of Network Centric Operating System features

High-Level and Efficient Stream Parallelism on Multi-core Systems with SPar for Data Compression Applications

Intel Thread Building Blocks, Part II

Supporting Advanced Patterns in GrPPI, a Generic Parallel Pattern Interface

HPC with Multicore and GPUs

arxiv: v1 [cs.dc] 16 Sep 2016

Solving Dense Linear Systems on Graphics Processors

Autonomic Features in GCM

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns

Overview of Project's Achievements

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Euro-Par Pisa - Italy

Optimization Space Exploration of the FastFlow Parallel Skeleton Framework

Accelerating sequential computer vision algorithms using commodity parallel hardware

Skel: A Streaming Process-based Skeleton Library for Erlang (Early Draft!)

Guillimin HPC Users Meeting January 13, 2017

AperTO - Archivio Istituzionale Open Access dell'università di Torino

Image Processing. Filtering. Slide 1

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

RESEARCH ARTICLE. Performance Analysis and Structured Parallelization of the STAP Computational Kernel on Multi-core Architectures

GCM Non-Functional Features Advances (Palma Mix)

Using (Suricata over) PF_RING for NIC-Independent Acceleration

SNAP: Stateful Network-Wide Abstractions for Packet Processing. Collaboration of Princeton University & Pomona College

19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 28 19/05/2010 SPD 09/10 - M. Coppola - The ASSIST Environment 29. <?xml version="1.0"?

The MPI Message-passing Standard Lab Time Hands-on. SPD Course Massimo Coppola

The ASSIST Programming Environment

Using peer to peer. Marco Danelutto Dept. Computer Science University of Pisa

Parallelism paradigms

State Access Patterns in Stream Parallel Computations

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

CPU-GPU Heterogeneous Computing

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Investigation of System Timing Concerns in Embedded Systems: Tool-based Analysis of AADL Models

Background: I/O Concurrency

Native Offload of Haskell Repa Programs to Integrated GPUs

European Network on New Sensing Technologies for Air Pollution Control and Environmental Sustainability - EuNetAir COST Action TD1105

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Design patterns percolating to parallel programming framework implementation

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

FastFlow: high-level programming patterns with non-blocking lock-free run-time support

Distributed systems: paradigms and models Motivations

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Copperhead: Data Parallel Python Bryan Catanzaro, NVIDIA Research 1/19

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Task based parallelization of recursive linear algebra routines using Kaapi

Metadata for Component Optimisation

libcppa Now: High-Level Distributed Programming Without Sacrificing Performance

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

arxiv: v1 [cs.dc] 16 Jun 2016

Transactum Business Process Manager with High-Performance Elastic Scaling. November 2011 Ivan Klianev

M. Danelutto. University of Pisa. IFIP 10.3 Nov 1, 2003

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Overview: Emerging Parallel Programming Models

NEW ADVANCES IN GPU LINEAR ALGEBRA

Design Pattern. CMPSC 487 Lecture 10 Topics: Design Patterns: Elements of Reusable Object-Oriented Software (Gamma, et al.)

Implementing Modern Short Read DNA Alignment Algorithms in CUDA. Jonathan Cohen Senior Manager, CUDA Libraries and Algorithms

High-level parallel programming: (few) ideas for challenges in formal methods

Parallel Computing. Hwansoo Han (SKKU)

Pattern-Based Analysis of an Embedded Real-Time System Architecture

FastFlow: high-level programming patterns with non-blocking lock-free run-time support

Consistent Rollback Protocols for Autonomic ASSISTANT Applications

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Parallel stochastic simulators in systems biology: the evolution of the species

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Transcription:

Parallel Programming using FastFlow Massimo Torquati <torquati@di.unipi.it> Computer Science Department, University of Pisa - Italy Karlsruhe, September 2nd, 2014

Outline Structured Parallel Programming FastFlow: A data-flow framework for heterogeneous many-core platforms Algorithmic Skeletons & Parallel Design Patterns In this talk we consider mainly multi-core systems Applications developed using FastFlow & Structured Parallel Programming 2

Structured Parallel Programming Structured parallel programming aims to provide standard (and effective) rules for composing parallel computations in a machine-independent way Goal: reducing the complexity of parallelization problems by introducing constraints i.e. restricting the computation structure Modularity portability and programmability are the keywords Parallel paradigms are the base components of parallel applications Using structured parallel programming force to think parallel 3

Structured Parallel Programming The parallel programmer is relieved from all concerns related to the implementation of the parallel paradigms on the target platform The parallel programmer has to concentrate only on computational aspects Separation of concerns principles 4

Skeletons & Patterns Parallel Design Patterns Algorithmic Skeletons From HPC community From SW engineering community From early '90 From early '00 Pre-defined parallel high-order functions provided as constructs or lib calls Recipes to handle parallelism (name, problem, algorithms, solutions,...) The same concept at different abstraction levels We use the two terms patterns and skeletons, interchangeably. We want to emphasise the similarities of these two concepts 5

Structured Parallel Programming using Patterns System developer Patterns Map, Reduce, Stencil,... Pipeline, Task-Farm Divide&Conquer, MDF,. Which pattern? Problem Data Parallel Stream Parallel Task Parallel run-time support ct se e l se po m co tend iate t ex tan ins High-level code App developer iterate 6

Structured Parallel Programming: example Problem: apply the function F to all N elements of the input array A Select the Map pattern Instantiate the Map pattern with static partitioning App developer run-time support Select the best Map implementation on the target platform with static partitioning of data Execute the code and provide the result Analyse the result: the workload is unbalanced Instantiate the Map pattern with dynamic partitioning App developer run-time support Select the best Map implementation on the target platform with dynamic partitioning of data Execute the code and provide the result 7

Assessment (algorithmic skeletons) Separation of concerns Inversion of control Performance Application programmer: what is computed System programmer: how the result is computed Program structure suggested by the programmer The run-time selects the optimization for the target platform Close to hand tuned code (sometimes better) Reduced development time. Lower total cost to solution. Structured Parallel Programming by Marco Danelutto Available on-line as SPM course material at M. Danelutto web page http://www.di.unipi.it/~marcod 8

The FastFlow framework http://mc-fastflow.sourceforge.net http://calvados.di.unipi.it/fastflow C++ class library Promotes structured parallel programming It aims to be flexible and efficient enough to target multi-core, manycore and distributed systems. Layered design: Building blocks minimal set of mechanisms: channels, code wrappers, combinators. Core patterns streaming patterns (pipeline and task-farm) plus the feedback pattern modifier High-level patterns aim to provide flexible reusable parametric patterns for solving specific parallel problems 9

FastFlow Building Blocks Minimal definition of a node : nodes are concurrent activities POSIX threads struct mynode: ff_node { void *svc(void *task) { return task; } }; arrows are shared-memory channels implemented as SPSC lock-free queues 10

Stream Parallel Patterns ( core patterns) A stream is a sequence of data items having the same type pipeline ff_pipe<mytask> pipe(s1,s2,...,sn); pipe.run_and_wait_end(); task-farm ff_node std::vector<ff_node*> WorkerArray; ff_farm<> farm(workerarray, &E, &C); farm.run_and_wait_end(); Emitter: schedules input data items Collector: gathers results 11

FastFlow core patterns task-farm Specializations pipeline Patterns 12

Core Patterns Composition 13

Example: filtering images Time= 288s 4-stage pipeline replicating Blur & Emboss functions by using task-farm optimizing resources Time= 71s Time= 34s Time= 112s farm with pipeline workers parallelizing I/O too Time= 75s Time= 33s 14 2 Xeon E5-2695 @ 2.4GHz, 1 disk storage

Data Parallel Patterns (1) ParallelFor/ParallelForReduce M: map function ParallelFor pf; pf.parallel_for(0,n,[&](const long i) { A[i] = F(i); }); R: reduce function ParallelForReduce<double> pfr; pfr.parallel_reduce(sum, 0.0, M & R: computed in parallel 0,N,[&](const long i) { sum += F(i); }, [](double &sum, const double v) { ParallelForPipeReduce sum += v; }); ParallelForPipeReduce<task_t> pfr; auto MapF = [&](...ff_buffernode &n) { ; M & R: computed in pipeline n.put(task); }; Static and dynamic scheduling of tasks auto ReducdF = [&](task_t *task) { }; With or without scheduler thread (Sched) pfr.parallel_reduce_idx(0,n,step,chunk, 15 MapF, ReduceF);

Data Parallel Patterns (2) Map poolevolution May be used inside stream parallel patterns (pipeline and task-farm) On multi-core systems all of them are implemented on top of ParallelFor* patterns Stencil Algorithms + ParallelFor* wrappers For example, the Map pattern is a FastFlow node with inside an optimized instantiation of the ParallelForReduce pattern struct mymap: ff_map<> { void *svc(void *task) {.; ff_map<>::parallel_for(...); }}; Targeting GPGPUs: support for both CUDA and/or OpenCL available for some data-parallel patterns Work still in progress 16

Example: Mandelbrot set (1) Very simple data-parallel computation Each pixel can be computed independently Simple ParallelFor implementation Black-pixel requires much more computation A naïve partitioning of the images quickly leads to load unbalanced computation and poor performance Let's consider the minimum computation unit a single image line (image size 2048x2048, max 103 iterations per point) Static partitioning of lines (48 workers) MaxSpedup 14 Dynamic partitioning of lines (48 workers) MaxSpeedup 37 Data-partitioning may have a big impact on the performance 17 2 Xeon E5-2695 @ 2.4GHz, 1 disk storage

Example: Mandelbrot set (2) Suppose now we want to compute a number of Mandelbrot images (for example varying the computing threshold per point)...... We have basically two options: 1. One single parallel-for inside a sequential for iterating over all different threshold points for_each threshold values 2. A task-farm with map workers implementing two different scheduling strategies parallel_for ( Mandel(threshold)); Which one is better having limited resources? Depends on many factors, too difficult to say in advance Moving quickly between the two solutions is the key point 18

Task-parallel patterns Macro-Data Flow (MDF) The MDF executes data-dependency graph (DAG) Is a general approach to parallelism pipeline(taskgen, farm(ws,sched)) The user has to specify data dependency using a sequential function taskgen and to generate tasks (by using the AddTask method) The run-time automatically takes care of dependencies and then schedules ready task to Ws void taskgen( mdf...) { Param 1 = {&A, INPUT}; Param 2 = {&A, OUTPUT}; mdf->addtask(params, F, A); } ff_mdf<params> mdf(taskgen); mdf.run_and_wait_end(); Divide&Conquer Currently is a task-farm with feedback channel with specialized Emitter and Collector. 19

Example: Strassen's algorithm Matrix multiplication using Strassen's algorithm: A11 A12 A21 A22 X B11 B12 B21 B22 = C11 C12 C21 C22 S1 = A11 + A22 S2 = B11 + B22 P1 = S1 * S2 S3 = A21 + A22 P2 = S3 * B11 S4 = B12 B22 P3 = A11 * S4 S5 = B21 B11 P4 = A22 * S5 S6 = A11 + A12 P5 = S6 * B22 S7 = A21 A11 S8 = B11 + B12 P6 = S7 * S8 S9 = A12 A22 S10 = B21 + B22 P7 = S9*S10 C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 - P2 + P3 + P6 The sequential function taskgen is responsible for generating instructions S1, S2, P1, S3, P2... in any order specifying INPUT and OUTPUT dependencies. Each macro instruction can be computed in parallel using a ParallelFor pattern or optimized linear algebra matrix operations (BLAS, PLASMA, Lapack...) 20

Real applications (some) Stream Parallel Bowtie (BT) and BWA Sequence Alignment Tools Peafowl, an open-source parallel DPI framework Task Parallel YaDT-FF: fast C4.5 classifier (*) Block-based LU & Cholesky factorizations Data Parallel Two Stage Image and Video Denoiser (*) M. Aldinucci, S. Ruggieri and M. Torquati Decision Tree building on muli-core using FastFlow, Concurrency and Computations: practice and experience, Vol. 26, 2014 21

Stream: Bowtie (BT) and DWA Sequence Alignment Tools Very widely used tools for DNA alignment Hand-tuned C/C++/SSE2 code Spin-locks + POSIX Threads Reads are streamed from memory-mapped files to worker threads Task-farm+feedback implementation in FastFlow Thread pinning + memory affinity + affinity scheduling Quite substantial improvement in performance C. Misale, G. Ferrero, M. Torquati, M. Aldinucci Sequence alignment tools: one parallel pattern to rule them all? BioMed Research International, 2014 22

Stream: 10Gbit Deep Packet Inspection (DPI) on multi-core using Peafowl Peafowl is an open-source highperformance DPI framework with FastFlow-based run-time Task-farm + customized Emitter and Collector We developed an HTTP virus pattern matching application for 10 Gibit networks It is able to sustain the full network bandwidth using commodity HW M. Danelutto, L. Deri, D. De Sensi and M. Torquati Deep Packet Inspection on commodity hardware using FastFlow in PARCO 2013 conference, Vol. 25, pg. 99-99, 2013 23

Task-Parallel: LU & Cholesky factorizations using the MDF pattern Dense matrix, block-based algorithms Macro-Data-Flow (MDF) pattern encoding dependency graph (DAG) The DAG is generated dynamically during computation Configurable scheduling affinity scheduling of tasks, Comparable performance w.r.t. specialized multi-core dense linear algebra framework (PLASMA) D. Buono, M. Danelutto, T. De Matteis, G. Mencagli and M. Torquati A light-weight run-time support for fast dense linear algebra on multi-core in PDCN 2014 conference, 2014 DAG represents, 5 tiles, left-looking 24 version of Cholesky algorithm

Data-Parallel: Two stage image restoration Detect: adaptive median filter, produces a noise map Denoise: variational Restoration (iterative optimization algorithm) 9-point stencil computation High-quality edge preserving filtering Higher computational costs w.r.t. other edge preserving filters without parallelization, no practical use of this technique because too costly The 2 phases can be pipelined for video streaming M. Aldinucci, C. Spampinato, M. Drocco, M. Torquati and S. Palazzo A parallel edge preserving algorithm for salt and pepper image denoising IPTA 2012 conference, 2012 25

Salt & Pepper image restoration 26

Stream+Data-Parallel: Video de-noising, possible parallelization options using patterns M. Aldinucci, M. Torquati, M. Drocco, G. Peretti Pezzi and C. Spampinato Combining pattern level abstractions and efficiency in GPGPUs GTC 2014 invited talk, 2014 27

Video de-noising: different deployments CPU C++ only OpenCL CPU + GPGPU GPGPUs only Best option is to use all available GPGPUs 28

Video de-noising demo 29 Thanks to Marco Aldinucci for the demo video

Conclusions Structured Parallel Programming models have been here from many years It is proved they work clear semantics, enforce separation of concerns, allow rapid prototyping and portability of code, almost same performance as hand-tuned parallel code. FastFlow: a C++ class library framework Research project framework Promotes structured parallel programming Can be used at different levels, with each level providing a small number of building-blocks/patterns with clear parallel semantics 30

Thanks for your attention! Thanks to: Marco Danelutto (Pisa) Marco Aldinucci (Turin) Peter Kilpatrick (Belfast) University of Pisa FastFlow project site: University of Turin Queen's University Belfast EU-FP7 projects using FastFlow: http://mc-fastflow.sourceforge.net http://calvados.di.unipi.it/fastflow 31