Overview: Emerging Parallel Programming Models

Similar documents
Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Chapel: Multi-Locale Execution 2

. Programming in Chapel. Kenjiro Taura. University of Tokyo

Chapel Introduction and

Chapel: Locality and Affinity

Brad Chamberlain Cray Inc. March 2011

Steve Deitz Chapel project, Cray Inc.

Chapel: An Emerging Parallel Programming Language. Brad Chamberlain, Chapel Team, Cray Inc. Emerging Technologies, SC13 November 20 th, 2013

CS 470 Spring Parallel Languages. Mike Lam, Professor

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Survey on High Productivity Computing Systems (HPCS) Languages

The Cascade High Productivity Programming Language

Linear Algebra Programming Motifs

Introduction to Parallel Programming

Affine Loop Optimization using Modulo Unrolling in CHAPEL

User-Defined Parallel Zippered Iterators in Chapel

Abstract unit of target architecture Supports reasoning about locality Capable of running tasks and storing variables

Chapel Hierarchical Locales

Distributed Shared Memory for High-Performance Computing

Sung-Eun Choi and Steve Deitz Cray Inc.

Online Course Evaluation. What we will do in the last week?

Steve Deitz Cray Inc.

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

Chapel: Features. Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

Task based parallelization of recursive linear algebra routines using Kaapi

OpenMP 4.0/4.5. Mark Bull, EPCC

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Data Parallelism COMPUTE STORE ANALYZE

Introduction II. Overview

COMPUTATIONAL LINEAR ALGEBRA

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Locality/Affinity Features COMPUTE STORE ANALYZE

Parallel Programming with OpenMP. CS240A, T. Yang

Table of Contents. Cilk

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

A Lightweight OpenMP Runtime

Scalable Shared Memory Programing

Issues in Multiprocessors

Parallel Languages: Past, Present and Future

Scientific Computing WS 2017/2018. Lecture 25. Jürgen Fuhrmann Lecture 25 Slide 1

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

The Mother of All Chapel Talks

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

Scalable Software Transactional Memory for Chapel High-Productivity Language

Solving Dense Linear Systems on Graphics Processors

On the cost of managing data flow dependencies

High Performance Computing. University questions with solution

CS 240A: Shared Memory & Multicore Programming with Cilk++

Lect. 2: Types of Parallelism

Concurrency for data-intensive applications

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

OpenMP 4.5: Threading, vectorization & offloading

Sung Eun Choi Cray, Inc. KIISE KOCSEA HPC SIG Joint SC 12

Parallel Computing. Hwansoo Han (SKKU)

Parallelization, OpenMP

An Introduction to Parallel Programming

High Performance Computing in C and C++

Parallel Programming with Coarray Fortran

Overview: The OpenMP Programming Model

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Adaptive Cluster Computing using JavaSpaces

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

A brief introduction to OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

New Programming Paradigms: Partitioned Global Address Space Languages

OpenMP 4.0. Mark Bull, EPCC

for Task-Based Parallelism

CS 261 Fall Mike Lam, Professor. Threads

A Standard for Batching BLAS Operations

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Making Dataflow Programming Ubiquitous for Scientific Computing

STREAM and RA HPC Challenge Benchmarks

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

The Art of Parallel Processing

Using Chapel to teach parallel concepts. David Bunde Knox College

CUDA GPGPU Workshop 2012

Issues in Multiprocessors

Moore s Law. Computer architect goal Software developer assumption

Shared Memory Programming. Parallel Programming Overview

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Givy, a software DSM runtime using raw pointers

Allows program to be incrementally parallelized

Parallel Numerical Algorithms

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Chapel: Productive Parallel Programming from the Pacific Northwest

Parallel Architectures

Threaded Programming. Lecture 9: Alternatives to OpenMP


Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Trends and Challenges in Multicore Programming

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Parallel Linear Algebra in Julia

Multithreaded Programming in Cilk. Matteo Frigo

Transcription:

Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum array example, domains and locales directed acyclic task graph programming models motivations, ideas PLASMA: tiled Cholesky Factorization example, architecture Intel CnC Refs: Chapel home page, Chamberlain et al. User-Defined Parallel Zippered Iterators in Chapel, PGAS 2011; see embedded hyperlinks COMP4300/8300 L28: Emerging Parallel Programming Models2017 1

The High Productivity Computer Systems Initiative DARPA project (2003 ) to provide a new generation of economically viable high productivity computing systems... that double in productivity (or value) every 18 months HPCS Program envisioned large-scale systems with aggressively multicore components P is for productivity (performance, programability, portability & robustness) measure/predict the ease or difficulty of developing HPC applications reduce the time to result: write code run code analyse results introduce a variety of experimental languages: X10 (IBM) Chapel (Cray) Fortress (Sun - no longer supported by DARPA) State until late 2008: implementation was in a poor state (very slow to compile & run, only support for shared memory targets; inadequate documentation). COMP4300/8300 L28: Emerging Parallel Programming Models2017 2

Partitioned Global Address Space recall the shared memory model: multiple threads with pointers to a global address space in the partitioned shared memory model (PGAS) model: have multiple threads, each with affinity to some partition of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access also can map to NUMA domains COMP4300/8300 L28: Emerging Parallel Programming Models2017 3

Chapel: Design Principles object-oriented (Java-like syntax, but influenced by ZPL & HPF) supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module wrappings high-level abstractions for fine-grained parallel programming coforall, atomic code sections; sync variables and for coarse grained parallel programming (global-view abstractions): tasks (cobegin block) locales (UMA places to run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales) claims to drastically reduce code size over MPI programs more info on home page; current implementation is in C++ COMP4300/8300 L28: Emerging Parallel Programming Models2017 4

Chapel: Task Parallelism task creation: begin dostuff (); cobegin { dostuff1 (); dostuff2 (); } synchronisation variables: var a$ : sync int ; begin a$ = foo (); c = 2 a$ ; // suspend until a$ is assigned example: Fibonacci numbers proc fib ( n : int ): int { if ( n < 2) then return 1; var t1$ : sync int ; var t2 : int ; begin t1$ = fib ( n1 ); t2 = fib ( n2 ); return t1$ + t2 ; // wait t1$ } proc fib ( n : int ): int { if ( n < 2) then return 1; var t1, t2 : int ; cobegin { t1 = fib ( n1 ); t2 = fib ( n2 ); } return t1 + t2 ; } COMP4300/8300 L28: Emerging Parallel Programming Models2017 5

ranges: Chapel: Data Parallelism var r1 = 0..9; // equiv : r1 = 0..#10; const MyRange = 0..# N ; data parallel loops: var A, B :[0..# N ] double ; forall i in 0..# N do // cf. coforall A ( i ) = A ( i ) + B ( i ); scalar promotion: A = A + B ; reductions and scans: max = ( max reduce A ); A = (+ scan A ); // prefix sum of A example: DAXPY config const N = 1000, alpha = 3.0; proc daxpy ( x :[ MyRange ] real, y :[ MyRange ] real ): int { forall i in MyRange do y ( i ) = alpha x ( i ) + y ( i ); } Alt., via promotion, the forall loop can be replaced by: y = alpha x + y; the target of data parallelism could be SIMD, GPU or normal threads COMP4300/8300 L28: Emerging Parallel Programming Models2017 6

Sum Array in Chapel use Time ; config const size : int = 5 1 0 0 0 1000; class ArraySum { var array : [0.. size 1] int = 1; var currentsum : sync int = 0; def sum ( numthreads : int ) { var blocksize : int = size / numthreads ; coforall i in [0.. numthreads 1] { var partsum : int = 0; forall j in [ i blocksize.. ( i +1) blocksize 1] do partsum += array ( j ); currentsum += partsum ; } } } var thearray : ArraySum = new ArraySum (); var sumtimer : Timer = new Timer (); for i in [0.. 6] { thearray. currentsum. writexf (0); sumtimer. start (); thearray. sum ( 2 i ); writeln ( 2 i, " t h r e a d s ", sumtimer. elapsed ( milliseconds )," ms " ); sumtimer. clear (); } COMP4300/8300 L28: Emerging Parallel Programming Models2017 7

Chapel: Domains and Locales a rectangular domain is a tensor product of ranges, e.g. const D = {0..# M, 0..# N }; also have sparse and associative domains locale: a unit of the target architecture: processing elements with (uniform) local memory const Locales : [0..# numlocales ] locale =... ; // built in on Locales [1] do foo (); coforall ( loc, id ) in ( Locales, 1..) do // ` zippered ' iteration on loc do // migrates this task to loc forall tid in 0..# numtasks do writeln ( " T a s k ", id, " t h r e a d ", tid, " e x e c u t e s on ", loc ); use domain maps to map indices in a domain to locales: const Dist = new dmap ( new Cyclic ( startidx = 1, targetlocales = Locales [0..1])); const D = [0..# N ] dmapped Dist ; var x, y : [ D ] real ; const a = 2.0; forall i in D do y ( i ) = alpha x ( i ) + y ( i ) // we could use x (i 1) too! y = alpha x + y // same! COMP4300/8300 L28: Emerging Parallel Programming Models2017 8

DAG Execution Models: Motivations and Ideas in most programming models, serial or parallel, the algorithm is over-specified sequencing that is not necessary is often specified specifying what (sub-) tasks of a program can run in parallel is difficult and error-prone the model may constrain the program to run on a particular architecture (e.g. a single address space) directed acyclic task graph (DAG) programming models specify only the necessary semantic ordering constraints we express an instance of an executing program as a graph of tasks a node has an edge points to a second node if there is a (data) dependency between them the DAG run-time system can then determine when and where each task executes, with the potential to extract maximum parallelism COMP4300/8300 L28: Emerging Parallel Programming Models2017 9

DAG Execution Models: Mechanisms in DAG programming models, we version data (say by iteration count) it thus has declarative, write-once semantics a node in a DAG will have associated with it: the input data items (including version number) required the output data items produced (usually with updated version number) the function which performs this task running a task-dag program involves: generating the graph allowing an execution engine to schedule tasks to processors a task may execute when all of its input data items are ready the function first unpacks the input data items the function informs the engine that its output data items have been produced before exiting COMP4300/8300 L28: Emerging Parallel Programming Models2017 10

DAG Example: Tiled Cholesky Factorization PLASMA is a domain-specific programming model for dense linear algebra a (positive definite) symmetric matrix A may be factored A = LL T, where L is triangular the right-looking tiled algorithm executes in-place (A may be stored in a (lower) triangular matrix), with L overwriting A (tile by tile) the (PLASMA) DAG pseudo-code is for k = 0.. nb tiles 1 DPOTRF ( A [ k ][ k ]) // A [ k ][ k ] = sqrt ( A [ k ][ k ]) for m = k +1.. nb tiles 1 DTRSM ( A [ k ][ k ], A [ m ][ k ]) // A [ m ][ k ] = A [ k ][ k]^ 1 A [m, k ] for n = k +1.. nb tiles 1 DSYRK ( A [ n ][ k ], A [ n ][ n ]) // A [ n ][ n ] = A [ n ][ k ] A [ n ][ k ]^ T for ( m = n +1.. nb tiles 1) DGEMM ( A [ m ][ k ], A [ n ][ k ], A [m, n ]) // A [ m ][ n ] = A [ n ][ k ] A [m, n ]^ T { size of nb tiles is a trade-off: ism / load balance vs amortize task startup & data fetch/store costs, cache performance etc COMP4300/8300 L28: Emerging Parallel Programming Models2017 11

Tiled DAG Cholesky Factorization (II) (courtesy Haidar et al, Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures) COMP4300/8300 L28: Emerging Parallel Programming Models2017 12

Tiled DAG Cholesky Factorization (III) task graph with (courtesy Haidar et al, ) nb tiles = 5 column on left shows the depth:width ratio of the DAG COMP4300/8300 L28: Emerging Parallel Programming Models2017 13

Cholesky Factorization in PLASMA for each core task, we define a function to insert it into PLASMA s task-dag scheduler and one to perform the task int D s c h e d d p o t r f ( D s c h e d dsched, int nb, double A, int lda ) { D S C H E D I n s e r t T a s k ( dsched, T A S K c o r e p o t r f, sizeof ( int ), & nb, VALUE, sizeof ( double ) nb nb, A, INOUT LOCALITY, sizeof ( int ), & lda, VALUE, 0); } void T A S K c o r e d p o t r f ( Dsched d s c h e d ) { int nb, lda ; double A ; d s c h e d u n p a c k a r g s 3 ( dsched, nb, A, lda ); d p o r t f ( " L ", &n, A, & lda,...); } and these are inserted into the scheduler, which works out the implicit dependencies: for ( k = 0; k < n b t i l e s ; k ++) { D s c h e d d p o t r f ( dsched, nb, A [ k ][ k ], lda ); for ( m = k +1; m < n b t i l e s ; m ++) { D s c h e d d t r s m ( dsched, nb, A [ k ][ k ], A [ m ][ k ], lda ); for ( n = k +1; n < n t i l e s ; n ++) { D s c h e d d s y r k ( dsched, nb, A [ n ][ k ], A [ n ][ n ], lda ); for ( m = n +1; m < n b t i l e s ; m ++) D s c h e d d g e m m ( dsched, nb, A [ m ][ k ], A [ n ][ k ], A [m, n ] ) ;} } } COMP4300/8300 L28: Emerging Parallel Programming Models2017 14

PLASMA Architecture (courtesy Haidar et al) (courtesy Haidar et al) architecture for scheduler: inserted tasks go into an implicit DAG can be in NotReady, Queued, or Done states workers execute queued tasks descendants of Done tasks are examined to see if they are ready execution trace of Cholesky: large vs small window sizes a full DAG becomes very large necessary only to expand a window of the next tasks to be executed needs to reasonably large to get good scheduling COMP4300/8300 L28: Emerging Parallel Programming Models2017 15

Intel Concurrent Collections (CnC) Intel CnC is based on two sources of ordering requirements: producer / consumer (data dependence) controller / controllee (control dependence) CnC combines ideas of streaming, tuple spaces and dataflow users need to supply a CnC dependency graph and code for step functions step functions get and put data into the CnC store (figs courtesy Knobe&Sarker, CnC Programming Model) COMP4300/8300 L28: Emerging Parallel Programming Models2017 16

Cholesky in CnC Dependency graph specification: Step function for the symmetric rank-k update: (courtesy Knobe&Sarker, CnC Programming Model) (Cholesky = dportf, Trisolve = dtrsm, Update = dsyrk+dgemm) note the iteration is an explicit part of the tag the task graph is generated by executing a user-supplied harness (figures courtesy Schlimbach, Brodman & Knob, Concurrent Collections on Distributed Memory) COMP4300/8300 L28: Emerging Parallel Programming Models2017 17

DAG Task Graph Programming Models: Summary programmer needs only to break computation into tasks otherwise not too much more than specifying sequential computation specify task executor and task harness generator becomes trickier when the task DAG is data-dependent e.g. iterative linear system solvers advantages: maximizing (optimizing) parallelism, transparent load balancing arguably simpler programming model: no race hazards / deadlocks! testing on a serial processor should reveal all bugs abstraction over underlying architecture permits fault-tolerance: tasks on a failed process may be re-executed (requires data items are kept in a resilient store) shows promise for both small and large scale parallel execution distributed versions exist (CnC, DPLASMA, DARMA) COMP4300/8300 L28: Emerging Parallel Programming Models2017 18