. Programming in Chapel. Kenjiro Taura. University of Tokyo

.. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44

Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and arrays Other nice things about Chapel 2 / 44

Chapel brief history 2003: started as a DARPA-funded project under HPCS program HPCS: High Productivity Computing Systems DARPA: The Defense Advanced Research Projects Agency 2008: first public release 3 / 44

References http://chapel.cray.com/ section numbers below refer to those in Chapel Specification (version 0.92) http://chapel.cray.com/spec/spec-0.92.pdf tutorials: concise cut-and-pastable: http://faculty.knox.edu/dbunde/teaching/chapel/ extensive: http://chapel.cray.com/tutorials/sc11/ SC11-Chapel.tar.gz cheat sheet: http://chapel.cray.com/spec/quickreference.pdf 4 / 44

the implementation from CRAY http://chapel.cray.com/ this tutorial uses Chapel 1.6.0 (newest as of 2012 Nov) 5 / 44

Compiling and running Chapel programs compile with chpl command with the following environment variables CHPL COMM CHPL TASKS CHPL COMM SUBSTRATE e.g. with bash, $ CHPL_ COMM = gasnet CHPL_ COMM_ SUBSTRATE = udp CHPL_ TASKS = fifo chpl program. chpl run the executable giving the number of nodes (locales) with -nl; the command line depends on the choice of CHPL COMM SUBSTRATE. e.g. with udp, $./a. out -nl 1 $ SSH_SERVERS =" oooka000 oooka001 "./a. out -nl 2 see Appendix 2 for more details 6 / 44

Hello world in Chapel proc main () { writeln (" hello "); } proc introduces a function a function called main is the entry point: writeln is a versatile print-with-newline function 7 / 44

Chapel programming model basics there is only one main thread (or a task ) as opposed to the SPMD model of MPI/UPC a task can create another thread at an arbitrary point (task parallelism) ( 24) begin, sync variables, sync, cobegin, coforall a node is represented as a locale object and can be used to specify task/data distribution Locales, on ( 26) objects and arrays can be remotely referenced and shared (global address space) higher-level data parallel constructs on top of them ( 25) forall 8 / 44

Chapel and other languages distributed memory global address arbitrary nested support space parallelism OpenMP n N/A n TBB n N/A y MPI y n n UPC y y n Chapel y y y 9 / 44

Primitive types int(8), int(16), int(32), int(64); similar for uint int = int(64); uint = uint(64) real(32), real(64); real = real(64); complex(64), complex(128); complex = complex(128); bool (true/false) string 10 / 44

Variable declaration three types of variables param, const, var param : compile time constant const : run time constant (variables initialized and never assigned again) var : general variables (variables assigned arbitrary number of times) you are advised to make constantness explicit types are automatically inferred from initializing expression param x = 2; const r = rs. getnext (); var s = 0. 0; types can/should be explicitly given when necessary param x : real = 2; var s : string; 11 / 44

Procedure definition begin with proc keyword; the return type is automatically inferred from returned expressions proc f(x : int ) { return x + 1; } it can/should be specified explicitly when necessary proc g( x : int ) : real { return x + 1; } in particular, it is mandatory for recursive procedures proc fib ( n : int ) : int { if ( n <2) then return 1; else return fib (n -1) + fib (n -2) ; } 12 / 44

For loop simplest examples of for loops: for i in 1.. n {... } var A : [1.. n] real ; for i in A. domain { A[i] = i; } a similar syntax for parallel loops (coforall task parallel loops and forall data parallel loops) more about loops later 13 / 44

Overview of task parallelism in Chapel begin ( 24.2) creates a new task TBB s task group.run synchronization variables ( 24.3) can be used for synchronization TBB s task group.wait cobegin ( 24.4), coforall ( 24.5), sync ( 24.6) are higher-level constructs built on top of the two 14 / 44

begin statement and synchronization variables example: proc fib ( n : int ) : int { if ( n <2) then return 1; else { var a : single int ; begin { a = fib (n -1) ; } const b = fib (n -2) ; return a + b; } } begin statement creates a new task executing statement variables are shared between the parent and the new task reading a synchronization variable will block until it is written a : single int;... a +... begin {... } a a =... 15 / 44

Synchronization variables ( 24.3) there are two types of synchronization variables (sync and single) single : write once, read many sync : a bounded buffer of capacity one 16 / 44

cobegin, sync cobegin { A 1,..., A n } will begin each of A 1,... A n and wait for them to finish; e.g. var a, b : int ; cobegin { a = fib (n - 1); b = fib (n - 2); } sync {... } will wait for all tasks begin ed in... to finish; e.g. const primes : [ 2.. 100] bool ; sync { for i in 2.. 100 { begin { primes [i] = is_prime (i); } } } 17 / 44

coforall coforall var in... {... } will begin each iteration of the loop; e.g. coforall i in 2.. 100 { primes [i] = is_prime (i); } 18 / 44

Chapel and distributed memory basics Chapel does not automatically distribute tasks tasks are created at the local node they do not automatically migrate to other nodes Chapel does not automatically distribute data either variables are allocated at the local node objects and arrays are allocated at the local node it s (ultimately) the programmer who distributes tasks/data across nodes 19 / 44

Locale Chapel abstracts a compute node as a locale object Locales is an array of all participating nodes on statement explicitly moves a task on (locale) statement executes statement on locale 20 / 44

Inter-node communication in Chapel it happens as a result of: on statement (RPC style) accessing remote non-const/param variables var a; on ( Locales [1]) { a = a + 1; } referencing remote object fields class foo {} var f; on ( Locales [1]) { f = new foo (); } f.x = f.x + 1; Note: objects are assigned by references similar for arrays, but things are more complex due to value-assignment semantics (later) 21 / 44

Querying locales useful to understand what s going on... here is a locale you are now executing in expression.locale gives you a locale hosting that location (variable, array element etc.) locale.id is an integer identifier of the locale locale.name is a symbolic name (hostname) of the locale var a : int ; on ( Locales [1]) { writeln (" accessing a at locale ", a. locale.id, " from locale ", here.id, " (", here.name, ")"); } 22 / 44

A quick latency/overhead test var a = 0; on ( Locales [1]) { for i in 1.. n { a = i; } } stack network round-trip latency UDP 10G Ethernet 108979 ns OpenMPI Infiniband 35931 ns (?) OpenFabric Infiniband 3740 ns within a node 3 ns 23 / 44

forall : data parallel for loop forall var in... {... } will partition iterations among tasks e.g. forall i in 2.. 100 { primes [i] = is_prime (i); } 24 / 44

coforall and forall coforall create a task for each iteration NG for fine grain iterations; simply may not run with some tasking layers (fifo) you may synchronize between tasks forall partitions iterations into (a small number of) tasks OK for fine grain iterations iterations may be serialized in an arbitrary way; synchronizations between iterations not allowed (may lead to deadlocks) 25 / 44

More about loops for x in... { statements } three kinds of loops for : serial forall : data parallel coforall : task parallel (iteration task)... can take various things including: range : 1..n domain : {1..n,1..m} array : [ 1, 2, 3, 4 ] they are all first-class entities most generally, it can take iterator : a function-like object defined by iter instead of proc any object implementing iter these() method (iterator) 26 / 44

Iterator syntax is similar to procedure definition (proc) it may call yield to generate the next value a trivial example: iter gen_ prime () { yield 2; yield 3; yield 5; yield 7; } for x in gen_prime () { write (x, " "); } writeln (); 2 3 5 7 27 / 44

Slightly more useful iterator iter gen_fib (n : int ) { var a : int = 1, b : int = 1, c : int ; yield a; /* fib 0 */ yield b; /* fib 1*/ for i in 2.. n { c = a + b; a = b; b = c; yield c; } } for x in gen_fib (10) { write (x, " "); } writeln (); 1 1 2 3 5 8 13 21 34 55 89 28 / 44

Distributed arrays and loops we ve so far covered task creation within a node via begin etc. data parallelism on top of task creation via forall (still within a node) explicit migration of tasks via on remote reference to objects not yet covered : parallelism over distributed memory distributed arrays (how to distribute arrays into multiple locales?) distributed parallel loops (e.g. how to distribute executions of a loop over distributed arrays, without using on clauses every time?) 29 / 44

Chapel design goals around distributed arrays/loops build and abstract them within Chapel on top of low level machinery of locales and parallelism within a node 1. users are able to write something as simple as: for x in A {... } for distributed arrays and executions automagically distributed 2. distribution of elements/iterations to locales are implementable within Chapel 30 / 44

Ranges, domains, and arrays range an interval domain multidimensional rectangular regions (a set of multidimensional indexes) arrays is a mapping from index in a domain value both ranges and domains are first-class data range const r : range = 1..n; 31 / 44

Distributed domains and arrays ( 27) domain can be distributed (dmapped) arrays whose domain is dmapped is a distributed array range const r : range = 1..n; 32 / 44

Distributed domains/arrays example ( 33) const r = 1.. n; // range literal ( includes n!) const d1 = {1..n,1.. n}; // domain literal const d2 = {r,r}; // range is first class const a : [1..n,1.. n] real ; // array specifies domain and the value type const b : [ d2] real ; // domain is first class too const c = [ 1.0, 2.0, 3.0 ]; // array literal Note: type declarations were omitted where possible 33 / 44

Dmapped domains yield distributed execution use BlockDist ; const blocked_ dom = { 0.. 9} dmapped Block(0..9); forall x in blocked_ dom { write (x, ":", here.id, " "); } writeln (); executed with 2 locales 0:0 1:0 2:0 3:0 4:0 5:1 6:1 7:1 8:1 9:1 it s implemented in an iterater (these() method) of Block distribution class (presumably using task parallelism and on) Chapel compiler does not have any builtin policy about where to execute particular iterations 34 / 44

Distributed arrays too use BlockDist ; const blocked_dom = {0..9} dmapped Block ({0..9}) ; const A : [ blocked_ dom ] real ; forall i in A. domain { writeln (i, ": executing at ", here.id, ", accessing ", A[i]. locale.id); A[i] = i; } forall a in A { a =...; } 35 / 44

Other distributions cyclic: use Cyclic ; const cyclic_ dom = { 0.. 9} dmapped Cyclic ( startidx =1) ; block-cyclic: const block_ cyclic_ dom = { 0.. 9} dmapped BlockCyclic ( startidx =1, blocksize =3) ; all are flexible enough to accommodate: multidimensional domains distributing to a subset of all locales you are able to define your own distribution (I haven t mastered it yet) 36 / 44

Calling external C functions ( 31) Chapel is designed to make it easy to call external C functions all you need to do to call C s system function getpid() extern proc getpid () : int ( 32) ; as straightforward as this to call many system-supplied functions 37 / 44

Calling external C functions ( 31) want to call a C function you wrote? 1. write a C file containing the function (func.c) int foo ( int x) { return x + 1; } 2. write a corresponding header file containing its declaration (func.h) int foo ( int x); 3. write this in your Chapel program (you did this for getpid()) extern proc foo ( int (32) ) : int (32) ; 4. include all files in the command line $ chpl func.h func.c program. chpl 38 / 44

Configuration variable ( 8.5) writing a program that takes command line options is very straightforward and scalable 1. define your global variable as config config const n =10; // 10 is the default 2. run your program with -svar=value./a. out -nl 2 -sn =1000 39 / 44

Appendix a detailed note for those who work on Chapel now under construction, stay tuned 40 / 44

Chapel configurations : summary Chapel installation can choose: tasking layer implementation communication layer implementation underlying gasnet you must build Chapel (i.e. run make ) for each combination, but you can keep all modules in a single build tree you can choose the configuration to use when you compile your chapel program into an executable, through environment variables 41 / 44

Tasking layer choices see chapel-1.6.0/doc/readme.tasks for the full list what we have experiences with are: fifo : default massivethreads : U-Tokyo s MassiveThreads library qthreads : Sandia lab s Qthreads library you choose one of them through environment variable CHPL TASKS, both when you build Chapel and you compile your Chapel program 42 / 44

Communication layer choices Chapel currently uses GasNet library, which in turn allows us to choose an underlying communication substrate from several see chapel-1.6.0/doc/readme.multilocale for the overview what we have experiences with are: udp : UDP socket supported by OS (portable but slow) mpi : MPI; see chapel-1.6.0/third-party/gasnet/gasnet- 1.18.2/mpi-conduit/ for further details ibv : Infiniband; see chapel-1.6.0/thirdparty/gasnet/gasnet-1.18.2/vapi-conduit/ for further details you choose one of them through environment variable CHPL COMM SUBSTRATE, both when you build Chapel and you compile your Chapel program you also set CHPL COMM=gasnet, common in all substrates 43 / 44

Building Chapel so the basic procedure is to run make with the three environment variables CHPL COMM=gasnet CHPL COMM SUBSTRATE={udp,mpi,ibv} CHPL TASKS={fifo,massivethreads,qthreads} to use massivethreads or qthreads, you must cd third - party ; make massivethreads ( or qthreads ) before compiling Chapel when using MPI, you might want to set MPI CC and MPIRUN CMD to point to 44 / 44