Supporting Advanced Patterns in GrPPI, a Generic Parallel Pattern Interface

Similar documents
ParaFormance: An Advanced Refactoring Tool for Parallelising C++ Programs

Predictable Timing Analysis of x86 Multicores using High-Level Parallel Patterns

Where shall I parallelize?

Supporting Advanced Patterns in GRPPI, a Generic Parallel Pattern Interface

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

P1261R0: Supporting Pipelines in C++

GrPPI. GrPPI. Generic Reusable Parallel Patterns Interface. ARCOS Group University Carlos III of Madrid Spain. January 2018

Marco Danelutto. May 2011, Pisa

ASYNCHRONOUS COMPUTING IN C++

Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface

Introduction to FastFlow programming

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction to FastFlow programming

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

High Performance Computing

Maximizing Face Detection Performance

Evolutionary Artificial Potential Field for Path Planning: A GPU Implementation

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

Fundamentals of Computer Design

The Art of Parallel Processing

Intel Thread Building Blocks

AUTOMATIC SMT THREADING

Overview of research activities Toward portability of performance

Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed

Advanced cache memory optimizations

Solving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing

Parallel Programming using FastFlow

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Harp-DAAL for High Performance Big Data Computing

Application parallelization for multi-core Android devices

Efficiently Introduce Threading using Intel TBB

Massively Parallel Approximation Algorithms for the Traveling Salesman Problem

Parallel Systems. Project topics

Intel Thread Building Blocks

Efficient Smith-Waterman on multi-core with FastFlow

Using SYCL as an Implementation Framework for HPX.Compute

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

Parallel Programming Principle and Practice. Lecture 7 Threads programming with TBB. Jin, Hai

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

A Parallel Architecture for the Generalized Traveling Salesman Problem

Runtime Support for Scalable Task-parallel Programs

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Analysis of Parallelization Techniques and Tools

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction to FastFlow programming

Parallel Programming to solve Traveling Salesman Problem. Team Parallax Jaydeep Untwal Sushil Mohite Harsh Sadhvani parallax.hpearth.

A Distributed Hash Table for Shared Memory

A Comparative Study for Efficient Synchronization of Parallel ACO on Multi-core Processors in Solving QAPs

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Task-based Execution of Nested OpenMP Loops

Optimize an Existing Program by Introducing Parallelism

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Machine Learning for Software Engineering

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

AN 831: Intel FPGA SDK for OpenCL

Introduction to Computer Systems /18-243, fall th Lecture, Dec 1

Performance Tools for Technical Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CS420: Operating Systems

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA)

Modern Processor Architectures. L25: Modern Compiler Design

Parallel Processing with the SMP Framework in VTK. Berk Geveci Kitware, Inc.

Journal of Universal Computer Science, vol. 14, no. 14 (2008), submitted: 30/9/07, accepted: 30/4/08, appeared: 28/7/08 J.

Parallel Patterns Andrew Clymer

Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism

Hybrid Implementation of 3D Kirchhoff Migration

Tabu search and genetic algorithms: a comparative study between pure and hybrid agents in an A-teams approach

Structured approaches for multi/many core targeting

Simulation of Scale-Free Networks

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

Intel Thread Building Blocks, Part II

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Pervasive and Mobile Computing. Dr. Atiq Ahmed. Introduction Network Definitions Network Technologies Network Functions 1/38

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

A grid representation for Distributed Virtual Environments

Why do we care about parallel?

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

ParaFormance TM : An Advanced Refactoring Tool for Parallelising C++ Programs Part 1

Addressing Heterogeneity in Manycore Applications

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Perchance to Stream with Java 8

DawnCC : a Source-to-Source Automatic Parallelizer of C and C++ Programs

Amanur Rahman Saiyed (Indiana State University) THE TRAVELING SALESMAN PROBLEM November 22, / 21

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Solving the Traveling Salesman Problem using Reinforced Ant Colony Optimization techniques

Fundamentals of Computers Design

Transcription:

Supporting Advanced Patterns in GrPPI, a Generic Parallel Pattern Interface David del Rio, Manuel F. Dolz, Javier Fernández. Daniel Garcia University Carlos III of Madrid Autonomic Solutions for Parallel and Distributed Data Stream Processing (Auto-DaSP) Euro-Par 2017 Santiago de Compostela, August 29 th, 2017

RePhrase Project: Refactoring Parallel Heterogeneous Software a Software Engineering Approach (ICT-644235), 2015-2018, 3.6M budget 8 Partners, 6 European countries UK, Spain, Italy, Austria, Hungary, Israel http://www.rephrase-ict.eu 0

Thinking in Parallel Fundamentally, programmers must learn to think parallel this requires new high-level programming constructs you cannot program effectively while worrying about deadlocks etc. they must be eliminated from the design! you cannot program effectively while handling with communication etc. this needs to be packaged/abstracted! you cannot program effectively without performance information this needs to be included! How this can be solved? We use two key technologies: Refactoring (changing the source code structure) Parallel Patterns (high-level functions of parallel algorithms) 3

GRPPI: A Generic and Reusable Parallel Pattern Interface Objectives: Provide a high-level parallel pattern interface for C++ applications A Generic and Reusable Parallel Pattern Interface (GrPPI) High-level C++ interface based on metaprogramming and template-based techniques Support for different parallel programming frameworks as back ends Support for basic stream and data patterns and their composability (a) Pipeline. (b) Farm. (c) Filter. (d) Accumulator.

GRPPI: A Generic and Reusable Parallel Pattern Interface Example of the Pipeline pattern interface: Journal publication: D. R. Astorga, M. F. Dolz, J. Fernandez and J. D. Garcia, A generic parallel pattern interface for stream and data processing, Concurrency and Computation: Practice and Experience, http://dx.doi.org/10.1002/cpe.4175. OpenAccess available. https://github.com/arcosuc3m/grppi

GRPPI: A Generic and Reusable Parallel Pattern Interface Patterns vs. Parallel programming frameworks: Stream and data parallel patterns They can be composed among them! Basic patterns Stream patterns Sequential OpenMP TBB C++ Threads CUDA Thrust Pipeline ü ü ü ü û Farm ü ü ü ü ü StreamFilter ü ü ü ü û StreamAcumulator ü ü ü ü ü StreamIterator ü ü ü ü û Data patterns Sequential OpenMP TBB C++ Threads CUDA Thrust Map ü ü ü ü ü Stencil ü ü ü ü û Reduce ü ü ü ü ü MapReduce ü ü ü ü ü Divide&Conquer ü ü ü ü û Journal publication: D. R. Astorga, M. F. Dolz, J. Fernandez and J. D. Garcia, A generic parallel pattern interface for stream and data processing, Concurrency and Computation: Practice and Experience, http://dx.doi.org/10.1002/cpe.4175. OpenAccess available. https://github.com/arcosuc3m/grppi

GRPPI: A Generic and Reusable Parallel Pattern Interface Patterns vs. Parallel programming frameworks: Stream and data parallel patterns They can be composed among them! Weinclude advancedpatterns! Basic patterns Stream patterns Sequential OpenMP TBB C++ Threads CUDA Thrust Pipeline ü ü ü ü û Farm ü ü ü ü ü StreamFilter ü ü ü ü û StreamAcumulator ü ü ü ü ü StreamIterator ü ü ü ü û Data patterns Sequential OpenMP TBB C++ Threads CUDA Thrust Map ü ü ü ü ü Stencil ü ü ü ü û Reduce ü ü ü ü ü MapReduce ü ü ü û ü Divide&Conquer ü ü ü ü û Advanced patterns Stream/data Sequential OpenMP TBB C++ Threads CUDA Thrust?????????????????? Journal publication: D. R. Astorga, M. F. Dolz, J. Fernandez and J. D. Garcia, A generic parallel pattern interface for stream and data processing, Concurrency and Computation: Practice and Experience, http://dx.doi.org/10.1002/cpe.4175. OpenAccess available. https://github.com/arcosuc3m/grppi

Advanced parallel patterns New advanced patterns in GrPPI: Pool, Windowed-Farm and Stream-Iterator

Advanced parallel patterns The Pool pattern: Models the evolution of a population of individuals Applies iteratively the following functions on the population P: Ø Ø Ø Ø Selection (S): Selects sepecific individuals (pure func.) Evolve (E): Evolves individuals to any number of new or modified individuals (pure func.) Filter (F): Filters individuals and inserts them back into the population S=Selection; E=Evolution; F=Filter; T=Termination Termination (T): Determines whether the evolution should finish or continue

Advanced parallel patterns The Pool pattern is mainly used in: Symbolic computing domain: Orbit patterns. Evolutionary computing domain: Genetic algorithm patterns, multiagent systems, etc. Example: Travelling salesman, a NP-problem computing the shortest path among cities. GrPPI interface: Listing 1.1: Pool interface. 1 < EM, P, S, E, F, T> 2 Pool(EM exec_mod, P &popul, S &&select, E &&evolve, F &&filt, T &&term, num_select); Parallelism degree (EM exec_mod) Number of selections (int num_select)

Advanced parallel patterns The Windowed-Farm pattern: Stream pattern that delivers windows of processed items to the output stream Performs the following actions: Ø Ø Ø The (pure) function window-farm WF transforms consecutive windows of size x to windows of size y. The output items may contain the items from the input windows collapsed in a specific way. The windows can have an overlap factor.

Advanced parallel patterns The Windowed-Farm pattern can be used in: Real-time processing engines. Wireless sensor networks. Example: compute average window values from sensor readings. GrPPI interface Listing 1.2: Windowed-Farm interface. 1 < EM, I, WF, O> 2 WindowedFarm(EM exec_mod, I &&in, WF &&task, O &&out, win_size, overlap); Parallelism degree (EM exec_mod ) Window size and overlap (int win_size, int overlap)

Advanced parallel patterns The stand-alone and farmed Stream-Iterator pattern: F=Farm; T=Termination; G=Guard (c) Stream-Iterator. (d) Farm Stream-Iterator. Stream pattern that recurrentlycomputes a given purefunction Applies the following functions on the input stream items: Farm (F): transforms a single stream input; it can be computed in parallel (pure func.). Termination (T): determines whether the computation of F should be continued or not. Guard (G): determines in each iteration if the result of the function F should delivered to the output stream or not.

Advanced parallel patterns The stand-alone and farmed Stream-Iterator patterns: Real-time processing applications Example: reduce to different resolutions the frames appearing on an input video. The resolution is halved in eachiteration. GrPPI interface Listing 1.3: Stream-Iterator interface. 1 < EM, I, F, O, T, G> 2 StreamIteration(EM exec_mod, I &&in, F &&task, O &&out, T &&term, G &&guard); Parallelism degree (EM exec_mod )

Advanced parallel patterns Composability features: Stream-Iterator + Pipeline 1 StreamIteration( parallel_execution_thr{4}, 2 [&]() -> optional< > { // Consumer function 3 value = read_value(is); 4 ( value > 0 )? value : {}; 5 }, 6 Pipeline( // Kernel function 7 []( e ) { e + 2*e; }, 8 []( e ) { e - 1; } 9 ), 10 // Producer function 11 [&]( e ){ os << e << endl; }, 12 // Termination function 13 [] ( e ){ e < 100; }, 14 // Output guard function 15 [] ( e ){ e % 2 == 0; } 16 );

Experimental evaluation Usability and performance evaluationof the parallel patterns: Target platform: 2x Intel Xeon Ivy Bridge E5-2695 (24 cores) Parallel technologies: C++11 threads, OpenMP and Intel TBB Benchmarks: Pool pattern: travelling salesman (TSP) using a regular evolutionary algorithm. NP-problem computing the shortest route among different cities, visiting them only once and returning to the origin city. Window-Farm pattern: computationof average window values from an emulatedsensor readings. Stream-Iteration pattern: reduction of the resolution of the images appearing in the input stream, and producing images of different resolutions to the output stream.

Experimental evaluation Usability performance Analysis of the modified lines of code (LOCs) w.r.t the sequential version Advanced % of modified lines of code pattern C++ Threads OpenMP Intel TBB Pool +55.0 % +70.0 % +55.0 % +22.5 % Windowed-Farm +152.1 % +75.8 % +51.7 % +31.0 % Stream-Iterator +153.5 % +56.4 % +46.1 % +30.8 % C++ threads requires more modified LOCs than other frameworks providing high-level interfaces (OpenMP and Intel TBB) Windowed-Farm and Stream-Iterator are more difficult to parallelize! GrPPI requires less modified LOCs than any other framework

Experimental evaluation Pool pattern pattern evaluation usingthe travelling salesman problem: Population of 50 individuals representing feasible routes Speedup 20 16 12 Number of selections = 200 Number of threads = 12 8 C++ Threads 4 Intel TBB OpenMP 0 2 4 8 12 24 Number of threads (a) Speedup vs. number of threads. Speedup 12 10 8 6 4 2 0 10 50 100 150 200 Number of selections C++ Threads Intel TBB OpenMP (b) Speedup vs. number of selections. a) Good speedup scaling w.r.t number of threads, however Intel TBB and OpenMP back ends perform better b) The speedup grows with the number of selections -> only selection and evolution functions are parallelized!

Experimental evaluation Windowed-Farm pattern evaluationusinga benchmark computing average window values from an input sensor readings: Sensor sampling frequency is set to 1 khz Fixed overlapping factor among windows is 90% Speedup 21 18 15 12 9 6 3 0 Window size = 100 Number of threads = 12 2 4 8 12 24 Number of threads C++ Threads Intel TBB OpenMP (a) Speedup vs. number of threads. Speedup 12 10 8 6 4 2 0 100 250 400 550 700 Window size C++ Threads Intel TBB OpenMP (b) Speedup vs. window size. a) Good speedup scaling w.r.t number of threads, no difference among frameworks. b) The speedup descreases with increasing the window sizes -> number of non-overlapping items grows.

Experimental evaluation Stream-Iteration pattern evaluation using a benchmark that halves the resolution of a stream of images and delivers them in concrete resolutions. Input square images of resolution 8,192 pixels Output square image resolutions of 128, 512 and 1,024 pixels Speedup 12 10 8 6 4 2 0 Input square image resolution = 8,192 Number of threads = 12 2 4 8 12 24 Number of threads C++ Threads Intel TBB OpenMP (a) Speedup vs. number of threads. Speedup 16 14 12 10 8 6 4 2 0 2048 4096 8192 12288 16384 Image size C++ Threads Intel TBB OpenMP (b) Speedup vs. image size. a) Good speedup scaling until 12 threads (75% of efficiency), but strong degradation for 24 threads à This benchmark is memory-bound when the threads access simultaneously to the input images. b) Speedup descrease with increasing image sizes -> images accessed by the threads do not completely fit into L2/L3 (30 MB) of the target platform.

Conclusions Most programming models are too low-level Ease the parallelization task! Patterns hide away the complexity of parallel programming GrPPI is an usable, simple, generic and highlevel parallel pattern interface The overheads of GrPPI are negligible with respect to using directly parallel programming frameworks Advanced patterns can aid in developing complex applications from different specific domains Parallelizing code with GrPPI only requires to modify ~30% the number of lines of the sequential code Future work Support for other parallel programming frameworks: FastFlow

THANK YOU! 22