. Programming in Chapel. Kenjiro Taura. University of Tokyo

Similar documents
Brad Chamberlain Cray Inc. March 2011

Overview: Emerging Parallel Programming Models

Chapel: Multi-Locale Execution 2

Chapel Introduction and

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Chapel: Locality and Affinity

Linear Algebra Programming Motifs

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

Steve Deitz Cray Inc.

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Abstract unit of target architecture Supports reasoning about locality Capable of running tasks and storing variables

SYNCHRONIZED DATA. Week 9 Laboratory for Concurrent and Distributed Systems Uwe R. Zimmer. Pre-Laboratory Checklist

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Sung-Eun Choi and Steve Deitz Cray Inc.

Eureka! Task Teams! Kyle Wheeler SC 12 Chapel Lightning Talk SAND: P

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

STREAM and RA HPC Challenge Benchmarks

Steve Deitz Chapel project, Cray Inc.

Distributed Shared Memory for High-Performance Computing

Locality/Affinity Features COMPUTE STORE ANALYZE

Survey on High Productivity Computing Systems (HPCS) Languages

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Introduction to OpenMP. Tasks. N.M. Maclaren September 2017

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Introduction to C++ Introduction. Structure of a C++ Program. Structure of a C++ Program. C++ widely-used general-purpose programming language

Introduction to C++ with content from

Introduction to Parallel Programming

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

Data Parallelism COMPUTE STORE ANALYZE

Parallel Programming with Coarray Fortran

Chapel: An Emerging Parallel Programming Language. Brad Chamberlain, Chapel Team, Cray Inc. Emerging Technologies, SC13 November 20 th, 2013

A Local-View Array Library for Partitioned Global Address Space C++ Programs

Sung Eun Choi Cray, Inc. KIISE KOCSEA HPC SIG Joint SC 12

Compiler Construction

Chapel Hierarchical Locales

Decaf Language Reference Manual

15-418, Spring 2008 OpenMP: A Short Introduction

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

G Programming Languages - Fall 2012

Chapel: A Versa-le Tool for Teaching Undergraduates Parallel Programming. David P. Bunde, Knox College Kyle Burke, Colby College

LLVM-based Communication Optimizations for PGAS Programs

Scalable Software Transactional Memory for Chapel High-Productivity Language

Programming Models for Supercomputing in the Era of Multicore

Practical Considerations for Multi- Level Schedulers. Benjamin

A brief introduction to OpenMP

Short Notes of CS201

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Using Chapel to teach parallel concepts. David Bunde Knox College

Chapel: Features. Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

CS201 - Introduction to Programming Glossary By

Steve Deitz Cray Inc.

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Intel Thread Building Blocks, Part II

Implementation of Parallelization

Summary of Go Syntax /

COMP-520 GoLite Tutorial

Towards Resilient Chapel

Scalable Shared Memory Programing

. Shared Memory Programming in OpenMP and Intel TBB. Kenjiro Taura. University of Tokyo

CS 470 Spring Parallel Languages. Mike Lam, Professor

Affine Loop Optimization Based on Modulo Unrolling in Chapel

CUDA GPGPU Workshop 2012

Efficiently Introduce Threading using Intel TBB

Lecture V: Introduction to parallel programming with Fortran coarrays

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Managing Hierarchy with Teams in the SPMD Programming Model

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

The good, the bad and the ugly: Experiences with developing a PGAS runtime on top of MPI-3

Chapel: Productive Parallel Programming from the Pacific Northwest

Optimize an Existing Program by Introducing Parallelism

1 Lexical Considerations

Parallel algorithm templates. Threads, tasks and parallel patterns Programming with. From parallel algorithms templates to tasks

Data-Centric Locality in Chapel

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

Pierce Ch. 3, 8, 11, 15. Type Systems

Note: There is more in this slide deck than we will be able to cover, so consider it a reference and overview

Introduction to OpenMP

HPCC STREAM and RA in Chapel: Performance and Potential

Computer Architecture

A Static Cut-off for Task Parallel Programs

Parallel Programming: OpenMP

Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Chapter 3: Process Concept

Chapter 3: Process Concept

5/23/2015. Core Java Syllabus. VikRam ShaRma

CHAPTER 2: PROCESS MANAGEMENT

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

CSC 533: Organization of Programming Languages. Spring 2005

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Arrays. Comp Sci 1570 Introduction to C++ Array basics. arrays. Arrays as parameters to functions. Sorting arrays. Random stuff

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Compiler Construction

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Titanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium

Chapter 3: Process Concept

Transcription:

.. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44

Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and arrays Other nice things about Chapel 2 / 44

Chapel brief history 2003: started as a DARPA-funded project under HPCS program HPCS: High Productivity Computing Systems DARPA: The Defense Advanced Research Projects Agency 2008: first public release 3 / 44

References http://chapel.cray.com/ section numbers below refer to those in Chapel Specification (version 0.92) http://chapel.cray.com/spec/spec-0.92.pdf tutorials: concise cut-and-pastable: http://faculty.knox.edu/dbunde/teaching/chapel/ extensive: http://chapel.cray.com/tutorials/sc11/ SC11-Chapel.tar.gz cheat sheet: http://chapel.cray.com/spec/quickreference.pdf 4 / 44

the implementation from CRAY http://chapel.cray.com/ this tutorial uses Chapel 1.6.0 (newest as of 2012 Nov) 5 / 44

Compiling and running Chapel programs compile with chpl command with the following environment variables CHPL COMM CHPL TASKS CHPL COMM SUBSTRATE e.g. with bash, $ CHPL_ COMM = gasnet CHPL_ COMM_ SUBSTRATE = udp CHPL_ TASKS = fifo chpl program. chpl run the executable giving the number of nodes (locales) with -nl; the command line depends on the choice of CHPL COMM SUBSTRATE. e.g. with udp, $./a. out -nl 1 $ SSH_SERVERS =" oooka000 oooka001 "./a. out -nl 2 see Appendix 2 for more details 6 / 44

Hello world in Chapel proc main () { writeln (" hello "); } proc introduces a function a function called main is the entry point: writeln is a versatile print-with-newline function 7 / 44

Chapel programming model basics there is only one main thread (or a task ) as opposed to the SPMD model of MPI/UPC a task can create another thread at an arbitrary point (task parallelism) ( 24) begin, sync variables, sync, cobegin, coforall a node is represented as a locale object and can be used to specify task/data distribution Locales, on ( 26) objects and arrays can be remotely referenced and shared (global address space) higher-level data parallel constructs on top of them ( 25) forall 8 / 44

Chapel and other languages distributed memory global address arbitrary nested support space parallelism OpenMP n N/A n TBB n N/A y MPI y n n UPC y y n Chapel y y y 9 / 44

Primitive types int(8), int(16), int(32), int(64); similar for uint int = int(64); uint = uint(64) real(32), real(64); real = real(64); complex(64), complex(128); complex = complex(128); bool (true/false) string 10 / 44

Variable declaration three types of variables param, const, var param : compile time constant const : run time constant (variables initialized and never assigned again) var : general variables (variables assigned arbitrary number of times) you are advised to make constantness explicit types are automatically inferred from initializing expression param x = 2; const r = rs. getnext (); var s = 0. 0; types can/should be explicitly given when necessary param x : real = 2; var s : string; 11 / 44

Procedure definition begin with proc keyword; the return type is automatically inferred from returned expressions proc f(x : int ) { return x + 1; } it can/should be specified explicitly when necessary proc g( x : int ) : real { return x + 1; } in particular, it is mandatory for recursive procedures proc fib ( n : int ) : int { if ( n <2) then return 1; else return fib (n -1) + fib (n -2) ; } 12 / 44

For loop simplest examples of for loops: for i in 1.. n {... } var A : [1.. n] real ; for i in A. domain { A[i] = i; } a similar syntax for parallel loops (coforall task parallel loops and forall data parallel loops) more about loops later 13 / 44

Overview of task parallelism in Chapel begin ( 24.2) creates a new task TBB s task group.run synchronization variables ( 24.3) can be used for synchronization TBB s task group.wait cobegin ( 24.4), coforall ( 24.5), sync ( 24.6) are higher-level constructs built on top of the two 14 / 44

begin statement and synchronization variables example: proc fib ( n : int ) : int { if ( n <2) then return 1; else { var a : single int ; begin { a = fib (n -1) ; } const b = fib (n -2) ; return a + b; } } begin statement creates a new task executing statement variables are shared between the parent and the new task reading a synchronization variable will block until it is written a : single int;... a +... begin {... } a a =... 15 / 44

Synchronization variables ( 24.3) there are two types of synchronization variables (sync and single) single : write once, read many sync : a bounded buffer of capacity one 16 / 44

cobegin, sync cobegin { A 1,..., A n } will begin each of A 1,... A n and wait for them to finish; e.g. var a, b : int ; cobegin { a = fib (n - 1); b = fib (n - 2); } sync {... } will wait for all tasks begin ed in... to finish; e.g. const primes : [ 2.. 100] bool ; sync { for i in 2.. 100 { begin { primes [i] = is_prime (i); } } } 17 / 44

coforall coforall var in... {... } will begin each iteration of the loop; e.g. coforall i in 2.. 100 { primes [i] = is_prime (i); } 18 / 44

Chapel and distributed memory basics Chapel does not automatically distribute tasks tasks are created at the local node they do not automatically migrate to other nodes Chapel does not automatically distribute data either variables are allocated at the local node objects and arrays are allocated at the local node it s (ultimately) the programmer who distributes tasks/data across nodes 19 / 44

Locale Chapel abstracts a compute node as a locale object Locales is an array of all participating nodes on statement explicitly moves a task on (locale) statement executes statement on locale 20 / 44

Inter-node communication in Chapel it happens as a result of: on statement (RPC style) accessing remote non-const/param variables var a; on ( Locales [1]) { a = a + 1; } referencing remote object fields class foo {} var f; on ( Locales [1]) { f = new foo (); } f.x = f.x + 1; Note: objects are assigned by references similar for arrays, but things are more complex due to value-assignment semantics (later) 21 / 44

Querying locales useful to understand what s going on... here is a locale you are now executing in expression.locale gives you a locale hosting that location (variable, array element etc.) locale.id is an integer identifier of the locale locale.name is a symbolic name (hostname) of the locale var a : int ; on ( Locales [1]) { writeln (" accessing a at locale ", a. locale.id, " from locale ", here.id, " (", here.name, ")"); } 22 / 44

A quick latency/overhead test var a = 0; on ( Locales [1]) { for i in 1.. n { a = i; } } stack network round-trip latency UDP 10G Ethernet 108979 ns OpenMPI Infiniband 35931 ns (?) OpenFabric Infiniband 3740 ns within a node 3 ns 23 / 44

forall : data parallel for loop forall var in... {... } will partition iterations among tasks e.g. forall i in 2.. 100 { primes [i] = is_prime (i); } 24 / 44

coforall and forall coforall create a task for each iteration NG for fine grain iterations; simply may not run with some tasking layers (fifo) you may synchronize between tasks forall partitions iterations into (a small number of) tasks OK for fine grain iterations iterations may be serialized in an arbitrary way; synchronizations between iterations not allowed (may lead to deadlocks) 25 / 44

More about loops for x in... { statements } three kinds of loops for : serial forall : data parallel coforall : task parallel (iteration task)... can take various things including: range : 1..n domain : {1..n,1..m} array : [ 1, 2, 3, 4 ] they are all first-class entities most generally, it can take iterator : a function-like object defined by iter instead of proc any object implementing iter these() method (iterator) 26 / 44

Iterator syntax is similar to procedure definition (proc) it may call yield to generate the next value a trivial example: iter gen_ prime () { yield 2; yield 3; yield 5; yield 7; } for x in gen_prime () { write (x, " "); } writeln (); 2 3 5 7 27 / 44

Slightly more useful iterator iter gen_fib (n : int ) { var a : int = 1, b : int = 1, c : int ; yield a; /* fib 0 */ yield b; /* fib 1*/ for i in 2.. n { c = a + b; a = b; b = c; yield c; } } for x in gen_fib (10) { write (x, " "); } writeln (); 1 1 2 3 5 8 13 21 34 55 89 28 / 44

Distributed arrays and loops we ve so far covered task creation within a node via begin etc. data parallelism on top of task creation via forall (still within a node) explicit migration of tasks via on remote reference to objects not yet covered : parallelism over distributed memory distributed arrays (how to distribute arrays into multiple locales?) distributed parallel loops (e.g. how to distribute executions of a loop over distributed arrays, without using on clauses every time?) 29 / 44

Chapel design goals around distributed arrays/loops build and abstract them within Chapel on top of low level machinery of locales and parallelism within a node 1. users are able to write something as simple as: for x in A {... } for distributed arrays and executions automagically distributed 2. distribution of elements/iterations to locales are implementable within Chapel 30 / 44

Ranges, domains, and arrays range an interval domain multidimensional rectangular regions (a set of multidimensional indexes) arrays is a mapping from index in a domain value both ranges and domains are first-class data range const r : range = 1..n; 31 / 44

Distributed domains and arrays ( 27) domain can be distributed (dmapped) arrays whose domain is dmapped is a distributed array range const r : range = 1..n; 32 / 44

Distributed domains/arrays example ( 33) const r = 1.. n; // range literal ( includes n!) const d1 = {1..n,1.. n}; // domain literal const d2 = {r,r}; // range is first class const a : [1..n,1.. n] real ; // array specifies domain and the value type const b : [ d2] real ; // domain is first class too const c = [ 1.0, 2.0, 3.0 ]; // array literal Note: type declarations were omitted where possible 33 / 44

Dmapped domains yield distributed execution use BlockDist ; const blocked_ dom = { 0.. 9} dmapped Block(0..9); forall x in blocked_ dom { write (x, ":", here.id, " "); } writeln (); executed with 2 locales 0:0 1:0 2:0 3:0 4:0 5:1 6:1 7:1 8:1 9:1 it s implemented in an iterater (these() method) of Block distribution class (presumably using task parallelism and on) Chapel compiler does not have any builtin policy about where to execute particular iterations 34 / 44

Distributed arrays too use BlockDist ; const blocked_dom = {0..9} dmapped Block ({0..9}) ; const A : [ blocked_ dom ] real ; forall i in A. domain { writeln (i, ": executing at ", here.id, ", accessing ", A[i]. locale.id); A[i] = i; } forall a in A { a =...; } 35 / 44

Other distributions cyclic: use Cyclic ; const cyclic_ dom = { 0.. 9} dmapped Cyclic ( startidx =1) ; block-cyclic: const block_ cyclic_ dom = { 0.. 9} dmapped BlockCyclic ( startidx =1, blocksize =3) ; all are flexible enough to accommodate: multidimensional domains distributing to a subset of all locales you are able to define your own distribution (I haven t mastered it yet) 36 / 44

Calling external C functions ( 31) Chapel is designed to make it easy to call external C functions all you need to do to call C s system function getpid() extern proc getpid () : int ( 32) ; as straightforward as this to call many system-supplied functions 37 / 44

Calling external C functions ( 31) want to call a C function you wrote? 1. write a C file containing the function (func.c) int foo ( int x) { return x + 1; } 2. write a corresponding header file containing its declaration (func.h) int foo ( int x); 3. write this in your Chapel program (you did this for getpid()) extern proc foo ( int (32) ) : int (32) ; 4. include all files in the command line $ chpl func.h func.c program. chpl 38 / 44

Configuration variable ( 8.5) writing a program that takes command line options is very straightforward and scalable 1. define your global variable as config config const n =10; // 10 is the default 2. run your program with -svar=value./a. out -nl 2 -sn =1000 39 / 44

Appendix a detailed note for those who work on Chapel now under construction, stay tuned 40 / 44

Chapel configurations : summary Chapel installation can choose: tasking layer implementation communication layer implementation underlying gasnet you must build Chapel (i.e. run make ) for each combination, but you can keep all modules in a single build tree you can choose the configuration to use when you compile your chapel program into an executable, through environment variables 41 / 44

Tasking layer choices see chapel-1.6.0/doc/readme.tasks for the full list what we have experiences with are: fifo : default massivethreads : U-Tokyo s MassiveThreads library qthreads : Sandia lab s Qthreads library you choose one of them through environment variable CHPL TASKS, both when you build Chapel and you compile your Chapel program 42 / 44

Communication layer choices Chapel currently uses GasNet library, which in turn allows us to choose an underlying communication substrate from several see chapel-1.6.0/doc/readme.multilocale for the overview what we have experiences with are: udp : UDP socket supported by OS (portable but slow) mpi : MPI; see chapel-1.6.0/third-party/gasnet/gasnet- 1.18.2/mpi-conduit/ for further details ibv : Infiniband; see chapel-1.6.0/thirdparty/gasnet/gasnet-1.18.2/vapi-conduit/ for further details you choose one of them through environment variable CHPL COMM SUBSTRATE, both when you build Chapel and you compile your Chapel program you also set CHPL COMM=gasnet, common in all substrates 43 / 44

Building Chapel so the basic procedure is to run make with the three environment variables CHPL COMM=gasnet CHPL COMM SUBSTRATE={udp,mpi,ibv} CHPL TASKS={fifo,massivethreads,qthreads} to use massivethreads or qthreads, you must cd third - party ; make massivethreads ( or qthreads ) before compiling Chapel when using MPI, you might want to set MPI CC and MPIRUN CMD to point to 44 / 44