Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons

Similar documents
Introduction to FastFlow programming

Evolution of FastFlow: C++11 extensions, distributed systems

Marco Danelutto. May 2011, Pisa

Overview of research activities Toward portability of performance

EE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

Parallel Computing. Parallel Algorithm Design

Parallel Algorithm Design. CS595, Fall 2010

5.12 EXERCISES Exercises 263

Introduction to FastFlow programming

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 14 Parallelism in Software V

Parallelism paradigms

Introduction to Multicore Programming

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

CS691/SC791: Parallel & Distributed Computing

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Introduction to Parallel Programming Models

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Parallelism with Haskell

Compsci 590.3: Introduction to Parallel Computing

Introduction to Multicore Programming

Multi/Many Core Programming Strategies

Why Structured Parallel Programming Matters. Murray Cole

CS377P Programming for Performance Multicore Performance Multithreading

A Pattern Language for Parallel Programming

Introduction to Parallel Programming

Threaded Programming. Lecture 9: Alternatives to OpenMP

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

CellSs Making it easier to program the Cell Broadband Engine processor

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Static Methods. Why use methods?

Parallelizing a Monte Carlo simulation of the Ising model in 3D

Solving Linear Recurrence Relations (8.2)

ECE 574 Cluster Computing Lecture 13

GPU Programming EE Final Examination

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Parallel Programming in C with MPI and OpenMP

Parallelism. Parallel Hardware. Introduction to Computer Systems

Parallel Programming on Larrabee. Tim Foley Intel Corp

FastFlow: targeting distributed systems

OpenMP 4.0. Mark Bull, EPCC

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

Copyright 2010, Elsevier Inc. All rights Reserved

Parallel Programming in C with MPI and OpenMP

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Building a (resumable and extensible) DSL with Apache Groovy Jesse Glick CloudBees, Inc.

OpenMP Shared Memory Programming

Assignment 2 Using Paraguin to Create Parallel Programs

Costless Software Abstractions For Parallel Hardware System

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

A brief introduction to OpenMP

L20: Putting it together: Tree Search (Ch. 6)!

OpenACC Course. Office Hour #2 Q&A

BASIC COMPUTATION. public static void main(string [] args) Fundamentals of Computer Science I

Could you make the XNA functions yourself?

FastFlow: targeting distributed systems Massimo Torquati

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

1.7 Limit of a Function

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Concurrent Programming with OpenMP

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Raspberry Pi Basics. CSInParallel Project

CS103 Handout 29 Winter 2018 February 9, 2018 Inductive Proofwriting Checklist

Technology for a better society. hetcomp.com

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP 4.0/4.5. Mark Bull, EPCC

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Università di Pisa. Dipartimento di Informatica. Technical Report: TR FastFlow tutorial

CS 475: Parallel Programming Introduction

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

CS12020 (Computer Graphics, Vision and Games) Worksheet 1

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Semantics of programming languages

Parallel Computing Ideas

Third party names are the property of their owners.

COMP528: Multi-core and Multi-Processor Computing

Semantics of programming languages

Assignment 1c: Compiler organization and backend programming

(Refer Slide Time: 01.26)

Reading on the Accumulation Buffer: Motion Blur, Anti-Aliasing, and Depth of Field

Semantics of programming languages

Introduction to Parallel Computing

Going to cover; - Why we have SPIR-V - Brief history of SPIR-V - Some of the core required features we wanted - How OpenCL will use SPIR-V - How

Parallel Programming

Lecture 2: SML Basics

Administrivia. Minute Essay From 4/11

Lecture 4: warp shuffles, and reduction / scan operations

Simon Peyton Jones (Microsoft Research) Tokyo Haskell Users Group April 2010

QUIZ. What is wrong with this code that uses default arguments?

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

Chapter 3 Parallel Software

Task Graph. Name: Problem: Context: D B. C Antecedent. Task Graph

Data Parallel Algorithmic Skeletons with Accelerator Support

GPU Programming EE Final Examination

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Part 6b: The effect of scale on raster calculations mean local relief and slope

DATA PARALLEL PROGRAMMING IN HASKELL

Transcription:

Communicating Process Architectures in Light of Parallel Design Patterns and Skeletons Dr Kevin Chalmers School of Computing Edinburgh Napier University Edinburgh k.chalmers@napier.ac.uk

Overview ˆ I started looking into patterns and skeletons when I wrote some nice helper functions for C++11 CSP ˆ par for ˆ par read ˆ par write ˆ I started wondering what other helper functions and blocks I could develop ˆ Which led me to writing the paper, which I ve done some further thinking about ˆ So, I ll start with my proposals to the CPA community and add in some extra ideas not in the paper

1 Creating Patterns and Skeletons with CPA Outline

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

Comparing Pattern Definitions Table: Mapping Catanzaro s and Massingill s view of parallel design patterns. Catanzaro Massingill Not Covered Finding Concurrency Structural Supporting Structures Computational Not Covered Algorithm Strategy Algorithm Structures Implementation Strategy Supporting Structures Concurrent Execution Implementation Mechanisms

Common Patterns Discussed in the Literature ˆ Pipeline (or pipe and filter). ˆ Master-slave (or work farm, worker-farmer). ˆ Agent and repository. ˆ Map-reduce. ˆ Task-graph. ˆ Loop parallelism (or parallel for). ˆ Thread pool (or shared queue). ˆ Single Program - Multiple Data (SPMD). ˆ Message passing. ˆ Fork-join. ˆ Divide and conquer.

Slight Aside - The 7 Dwarves (computational problem patterns) ˆ Structured grid. ˆ Unstructured grid. ˆ Dense matrix. ˆ Sparse matrix. ˆ Spectral (FFT). ˆ Particle methods. ˆ Monte Carlo (map-reduce).

Pipeline and Map-reduce Process 1 Process 2... Process n Figure: Pipeline Design Pattern. f(x) g(x, y) f(x) f(x) g(x, y) f(x) Figure: Map-reduce Design Pattern.

Skeletons ˆ Pipeline. ˆ Master-slave. ˆ Map-reduce. ˆ Loop parallelism. ˆ Divide and conquer. ˆ Fold. ˆ Map. ˆ Scan. ˆ Zip.

Data Transformation - How Functionals Think ˆ I ll come back to this again later ˆ Basically many of these ideas come from the functional people ˆ Everything in their mind is a data transform ˆ Having been to a few with functional people (Scotland has a lot of Haskellers) they see every parallel problem as a map-reduce one ˆ This has real problems for scalability

Example - FastFlow Creating a Pipeline with FastFlow int main () { // Create a vector of two workers vector < ff_ node * > workers = { new worker, new worker }; // Create a pipeline of two stages and a farm ff_pipe < fftask_ t > pipeline ( new stage_1, new stage_2, new ff_farm < >( workers )); // Execute pipeline pipeline. run_and_wait_end (); return 0; }

Plug and Play with CPA ˆ We ve actually been working with skeletons for a long time ˆ The plug and play set of processes capture some of the ideas - but not quite in the same way ˆ Some of the more interesting processes we have are: ˆ Paraplex (gather) ˆ Deparaplex (scatter) ˆ Delta ˆ Basically any communication pattern ˆ So we already think in this way. We just need to extend our thinking a little.

An aside - Shader Programming on the GPU Some GLSL // Incoming / outgoing values layout ( location = 0) in vec3 position ; layout ( location = 0) out float shade ; // Setting the value shade = 5. 0; // Emitting vertices and primitives for ( i = 0; i < 3; ++ i) { //.. do some calculation EmitVertex (); } EndPrimitive ();

Tasks as a Unit of Computation Task Interface in C++11 CSP void my_task ( chan_in < input_type > input, chan_out < output_type > out ) { while ( true ) { // Read input auto x = input (); //... // Write output output (y); } } ˆ Unlike a pipeline task we can match arbitrary input to arbitrary output

Tasks as a Unit of Computation Creating a Pipeline in C++11 CSP // Plug processes together directly task_1. out ( task_2.in ()); // Define a pipeline ( pipeline is also a task ) pipeline < input_ type, ouput_ type > pipe1 { task_1, task_2, task_ 3 }; // Could also add processes together task < input_ type, output_ type > pipe2 = task_ 1 + task_ 2 + task_ 3 ;

Programmers not Plumbers ˆ These potential examples adopt a different style to standard CPA ˆ Notice that we don t have to create channels to connect tasks together ˆ Although the task method uses channels ˆ This pluggable approach to process composition is something I tried away back in 2006 with.net CSP ˆ I don t think it was well received however

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

Descriptive Languages ˆ Descriptive languages have been put forward to describe skeletal programs ˆ Limited set of base skeletons used to describe further skeletons ˆ Aim is to describe the structure of a parallel application using this small set of components ˆ The description can then be reasoned about to enable simplification ˆ In other words examine the high level description and determine if a different combination of skeletons would provide the same behaviour which would be faster (less communication, reduction, etc.)

RISC-pb2l ˆ Describes a collection of general purpose blocks Wrappers describe how the function is to be run (e.g. sequentially or in parallel) Combinators describes communication between blocks 1-to-N or a deparaplex N-to-1 or a paraplex policy for example unicast, gather, scatter, etc. Functionals run parallel computations (e.g. spread, reduce, pipeline)

Example - Task Farm Reading from left to right: Unicast(Auto) [ ] n Gather T askf arm(f) = Unicast(Auto) [ ] n Gather denotes a 1-to-N communication using a unicast policy that is auto selected. auto means that work is sent to a single available node to process from the available processes. is a separator between stages of the pipeline. denotes that n computations are occurring in parallel. is the computation being undertaken, which is f in the declaration T askf arm(f). is a separator between stages of the pipeline. denotes a N-to-1 communication using a gather policy.

SkeTo ˆ Uses a functional approach to composition ˆ For example map ˆ And reduce map L (f, [x 1, x 2,..., x n ]) = [f(x 1 ), f(x 2 ),..., f(x 3 )] reduce L (, [x 1, x 2,..., x n ]) = x 1 x 2 x n ˆ We can therefore describe a Monte Carlo π computation as: pi(points) = result where f(x, y) = sqr(x) + sqr(y) <= 1 result = reduce L (+, map L (f, points))/n

Thinking about CSP as a Description Language ˆ OK, I ve not thought about this too hard ˆ I ll leave this to the CSP people ˆ However I see the same sort of terms used in the descriptive languages ˆ Description ˆ Reasoning ˆ Communication ˆ etc. ˆ Creating a set of CSP blocks that could be used to describe skeleton systems could be interesting

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

What Doesn t Work for Parallelism ˆ There has been discussion around what doesn t work for exploiting parallelism in the wide world (don t blame me, blame the literature) ˆ automatic parallelization. ˆ compiler support is limited to low level optimizations. ˆ explicit technologies such as OpenMP and MPI require too much effort. ˆ The creation of new languages is also not considered a viable route (again don t shoot the messenger) ˆ So how do we use what we have?

CCSP as a Runtime ˆ To quote Peter - we have the fastest multicore scheduler ˆ So why isn t it used elsewhere? ˆ I would argue we need to use the runtime as a target platform for existing ideas

Example OpenMP OpenMP Parallel For # pragma parallel for num_ threads ( n) for ( int i = 0; i < m; ++i) { //... do some work } ˆ Pre-processor generates necessary code ˆ OpenMP is restrictive on n above - usually 64 max ˆ A CCSP runtime could overcome this

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

Skeletons Aimed at MPI ˆ Just one slide! ˆ Most skeleton frameworks use MPI under the hood ˆ Help exploit parallelism using MPI ˆ This is something any CPA skeleton framework would have to look into supporting ˆ Handily I m working on that just now ˆ As we consider communication more it shouldn t be that difficult.

Outline 1 Creating Patterns and Skeletons with CPA 2 CSP as a Descriptive Language for Skeletal Programs 3 Using CCSP as a Lightweight Runtime 4 Targeting Cluster Environments 5 Summary

Summary ˆ This work is really about pointing to some potential future directions for CPA ˆ I have put forward four proposals: To the community at large The description and implementation of parallel design patterns and skeletons with CPA techniques To the CSP people The use of CSP as a description language for these skeletons To the CCSP developers The use of CCSP as a runtime to support parallel execution (such as OpenMP) To the distributed runtime developers The use of these ideas in distributed computing to better target cluster computing ˆ We also need to disseminate these ideas to the wider parallel community if we want them to use these techniques

Questions?