Introduction to Parallel Programming

Similar documents
Chapel: Multi-Locale Execution 2

Sung-Eun Choi and Steve Deitz Cray Inc.

Steve Deitz Chapel project, Cray Inc.

Chapel Introduction and

Overview: Emerging Parallel Programming Models

Sung-Eun Choi and Steve Deitz Cray Inc.

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency

Steve Deitz Cray Inc.

Abstract unit of target architecture Supports reasoning about locality Capable of running tasks and storing variables

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Introduction to parallel computing concepts and technics

Chapel: Locality and Affinity

. Programming in Chapel. Kenjiro Taura. University of Tokyo

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Chapel: Features. Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Introduction to Parallel. Programming

Brad Chamberlain Cray Inc. March 2011

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming Libraries and implementations

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Parallel Programming. Libraries and Implementations

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

An Overview of Chapel

A brief introduction to OpenMP

Message Passing Interface

Lecture 4: OpenMP Open Multi-Processing

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Introduction to Parallel Programming

Computer Architecture

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Shared Memory programming paradigm: openmp

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Models and languages of concurrent computation

Locality/Affinity Features COMPUTE STORE ANALYZE

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Survey on High Productivity Computing Systems (HPCS) Languages

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Note: There is more in this slide deck than we will be able to cover, so consider it a reference and overview

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduction to Parallel Programming

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

CS 470 Spring Mike Lam, Professor. OpenMP

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Parallel Programming. Libraries and implementations

MPI Message Passing Interface

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

High Performance Computing Course Notes Message Passing Programming I

CS 426. Building and Running a Parallel Application

Alfio Lazzaro: Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Introduction to Parallel Programming

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to the Message Passing Interface (MPI)

Holland Computing Center Kickstart MPI Intro

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Distributed Memory Programming with Message-Passing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

High Performance Computing

EE/CSCI 451: Parallel and Distributed Computation

Learning Lab 1: Parallel Algorithms of Matrix-Vector Multiplication

CSE 4/521 Introduction to Operating Systems

Parallel Programming in C with MPI and OpenMP

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

CS420: Operating Systems

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

CSE 160 Lecture 18. Message Passing

CUDA GPGPU Workshop 2012

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Linear Algebra Programming Motifs

Allows program to be incrementally parallelized

EPL372 Lab Exercise 5: Introduction to OpenMP

Anomalies. The following issues might make the performance of a parallel program look different than it its:

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Hybrid MPI and OpenMP Parallel Programming

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

ECE 574 Cluster Computing Lecture 10

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

MPI MESSAGE PASSING INTERFACE

Lecture 7: Distributed memory

PGAS: Partitioned Global Address Space

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

CMSC 714 Lecture 3 Message Passing with PVM and MPI

Transcription:

2012 Summer School on Concurrency August 22 29, 2012 St. Petersburg, Russia Introduction to Parallel Programming Section 4. Victor Gergel, Professor, D.Sc. Lobachevsky State University of Nizhni Novgorod (UNN)

Contents Distributed memory systems MPI technology Shared memory systems OpenMP technology The model of partitioned global address space (PGAS) Chapel language Language elements Task parallelism Extended capabilities for data processing The concept of executor (locale) Data distribution management Examples St. Petersburg, Russia,2012 2 of 60

Approaches to parallel programs development Usage of libraries for existing programming languages MPI for C and Fortran for distributed memory systems Usage of above-language means (directives, comments) - OpenMP for C and Fortran for shared memory systems Extensions of existing programming languages e.g. CAF, UPC Creation of new parallel programming languages e.g. Chapel, X10, Fortress, St. Petersburg, Russia,2012 3 of 60

Development of parallel programs for distributed memory systems: MPI In computing systems with distributed memory processors work independently from each other. For parallel computing organization one needs to: Distribute computational load, Processor Cache Operating memory Data passing paths Organize informational interaction (data passing) between processors Processor Cache Operating memory Solution for all the above items is provided by MPI (message passing interface) St. Petersburg, Russia,2012 4 of 60

Development of parallel programs for distributed memory systems: MPI Within MPI a single program is developed for a problem solution and is run simultaneously on all available processors The following capabilities are available for arrangement of different computations on different processors: Possibility to use different data for a program on different processor, Ways to identify the processor on which the program is being executed Such way of parallel computations organization is usually called single program multiple processes (SPMP ) St. Petersburg, Russia,2012 5 of 60

Development of parallel programs for distributed memory systems: MPI MPI offers plenty of data transfer operations: Provides different ways of sending data, Implements practically all main communication operations. These capabilities are the strongest advantage of MPI (the name of the technology Message Passing Interface speaks for itself) St. Petersburg, Russia,2012 6 of 60

Development of parallel programs for distributed memory systems: MPI What is MPI? MPI is the standard which must be met by message transfer set-ups. MPI is the software which provides message transfer capability and meets all the requirements of the MPI standard: The software must be arranged as programming modules libraries (MPI library), The software must be available for the most widely used algorithmic languages C and Fortran. St. Petersburg, Russia,2012 7 of 60

Development of parallel programs for distributed memory systems: MPI MPI advantages MPI allows to significantly mitigate the problem of parallel programs portability MPI helps increase parallel computations efficiency there are MPI library implementations for nearly all types of computing system MPI decreases complexity of parallel programs development: Bigger part of main data transfer operations are provided by MPI standard, There are many libraries of parallel methods which use MPI. St. Petersburg, Russia,2012 8 of 60

MPI: main concepts and definitions Parallel program notion A parallel program in MPI is a set of simultaneously executed processes: Each process of a parallel program is created based on a copy of same source code (SPMP model), Processes can run on different processors ; at the same time one processor can host several processes. Original source code is developed using algorithmic languages C or Fortran and MPI library. The number of processes and the number of processors to be used are defined by MPI programs execution environment at the program start. All processes of a program are enumerated. Process number is called process rank. St. Petersburg, Russia,2012 9 of 60

MPI: main concepts and definitions Four main concepts underlie the MPI technology: Message transfer operations Types of data in messages Communicator concept (group of processes) Virtual topology concept St. Petersburg, Russia,2012 10 of 60

MPI: main concepts and definitions Data transfer operations Message transfer operations are the MPI fundamentals. MPI provides different types of functions: Pared (point-to-point) operations between two processes, Collective communication actions for simultaneous cooperation of several processes. St. Petersburg, Russia,2012 11 of 60

MPI: main concepts and definitions Communicators concept A communicator in MPI is a specially created service object which joins a group of processes and a set of additional parameters (context): Paired operations of data transfer are executed for processes belonging to same communicator, Collective operations are applied simultaneously for all processes of a communicator. Specifying the communicator in use is mandatory for data transfer operations in MPI. St. Petersburg, Russia,2012 12 of 60

MPI: main concepts and definitions Communicators concept New communicators can be created and existing can be destroyed during computations. Same process can belong to different communicators. All processes of a parallel program are included in a default communicator with MPI_COMM_WORLD identifier. To arrange data transfer between processes from different groups one needs to create a global communicator (intercommunicator). St. Petersburg, Russia,2012 13 of 60

MPI: main concepts and definitions Data types It is required to indicate type of transferred data when specifying sent or received data in MPI functions. MPI contains a big set of base data types very similar to data types in algorithmic languages C and Fortran. MPI offers ways to create new derived data types for more exact and laconic specification of transferred messages contents. St. Petersburg, Russia,2012 14 of 60

MPI: main concepts and definitions Virtual topology Logical topology of connection lines between processes has the structure of a complete graph (regardless of existence of actual physical communication links between processors). MPI offers a capability of describing a set of processes as a grid of arbitrary dimension. Boundary processes of grids can be claimed adjacent therefore torus-type structures can be defined based on grids. MPI also has means of forming logical (virtual) topologies of any desired type. St. Petersburg, Russia,2012 15 of 60

MPI: Sample program #include " mpi.h " int main(int argc, char* argv[]) { int ProcNum, ProcRank, RecvRank; MPI_Status Status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &ProcNum); MPI_Comm_rank(MPI_COMM_WORLD, &ProcRank); if ( ProcRank!= 0 ) // Actions for all processes except for process 0 MPI_Send(&ProcRank,1,MPI_INT,0,0,MPI_COMM_WORLD); else { // Actions for process 0 printf ("\n Hello from process %3d", ProcRank); for ( int i=1; i < ProcNum; i++ ) { MPI_Recv(&RecvRank, 1, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &Status); printf("\n Hello from process %3d", RecvRank); } } // Job finishing MPI_Finalize(); return 0; } St. Petersburg, Russia,2012 16 of 60

MPI: additional information 1. Gergel, V.P. Theory and practice of parallel computing. M.: Binom. The Knowledge Laboratory, Internet University of Information Technologies, 2007. 424 pp. (In Russian) 2. Antonov, A.S. Parallel programming using MPI technology: Tutorial. M.: MSU publishing house, 2004.- 71 pp. (In Russian) 3. Nemnyugin, S.A., Stesik, O.L. Parallel programming for multi-processor computer systems.- BHV Petersburg, 2002, 400 pp. St. Petersburg, Russia,2012 17 of 60

Parallel programs development for shared memory systems: OpenMP OpenMP interface is intended to be a parallel programming standard for multi-processor systems with shared memory (SMP, ccnuma, ) Processor Cache Processor Cache In general case shared memory systems are described as a model of a parallel computer with random access to memory (parallel randomaccess machine PRAM) Operating memory St. Petersburg, Russia,2012 18 of 60

Parallel programs development for shared memory systems: OpenMP Fundamentals of the approach To indicate the possibility of parallelization developer inserts directives or comments in the source code. Program s source code remains unchanged the compiler can ignore added indications and thus build a regular serial program The compiler which supports OpenMP replaces the parallelism directives with some additional code (usually in a form of some parallel library functions calls) St. Petersburg, Russia,2012 19 of 60

Parallel programs development for shared memory systems: OpenMP Reasons for reaching the effect shared by parallel processes data reside in shared memory, no message passing operations are needed to arrange the communication. St. Petersburg, Russia,2012 20 of 60

Parallel programs development for shared memory systems: OpenMP Parallelism arrangement principle Usage of threads (common address space) Pulse (fork-join) parallelism Main thread Parallel regions St. Petersburg, Russia,2012 21 of 60

Parallel programs development for shared memory systems: OpenMP Parallelism arrangement principle When executing the normal code (outside of parallel regions) the program is executed by one thread (master thread) When directive #parallel appears the thread team is created for parallel execution of computations When leaving the region of #parallel directive effect synchronization is done, all threads except for the master thread are destroyed Serial execution of code continues (until the next occurrence of directive #parallel) St. Petersburg, Russia,2012 22 of 60

Parallel programs development for shared memory systems: OpenMP Advantages: Step-by-step (incremental) parallelization Possible to parallelize serial programs step-by-step, not changing their structure Single source code No need to support serial and parallel variants of a program as directives Efficiency are ignored by regular compilers (in general case) Taking into account and making use of shared memory systems resources Portability, support in most widespread languages (C, Fortran) and platforms (Windows, Unix) St. Petersburg, Russia,2012 23 of 60

Parallel programs development for shared memory systems: OpenMP Example Sum of vector elements #include <omp.h> #define NMAX 1000 main () { int i, sum; float a[nmax]; <data initialization> #pragma omp parallel for shared(a) private(i,j,sum) { sum = 0; for (i=0; i<nmax; i++) sum += a[i]; printf ( Sum of vector elements is equal %f\n«,sum); } /* End of parallel region */ } St. Petersburg, Russia,2012 24 of 60

Parallel programs development for shared memory systems: OpenMP Additional information: 1. Gergel, V.P. High performance computing for multiprocessor multicore systems. M.: MSU publishing house, 2010. (In Russian) 2. Antonov, A.S. Parallel programming using OpenMP technology: Tutorial. M.: MSU publishing house, 2009.- 77 pp. (In Russian) 3. Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., and Melon, R.. Parallel Programming in OpenMP. San- Francisco, CA: Morgan Kaufmann Publishers., 2000. St. Petersburg, Russia,2012 25 of 60

model HPCS Program: High Productivity Computing Systems (DARPA) Goal: 10x productivity increase by 2010 (Efficiency = Performance + Programmability + Portability + Reliability) Phase II: Cray, IBM, Sun (July 2003 June 2006) Existing architectures analysis Three new programming languages (Chapel, X10, Fortress) Phase III: Cray, IBM (July 2006 2010) Implementation Work continued for all proposed languages St. Petersburg, Russia,2012 26 of 60

model General overview of the HPCS program languages: New object-oriented languages which support a wide set of programming means, parallelism and safety. Global name space, multithreading and explicit mechanisms to work with locality are supported. Existing means for arrays distribution among computational nodes. Existing means for linking execution threads with data processed by those threads. Both data parallelism and task parallelism are supported. St. Petersburg, Russia,2012 27 of 60

model General overview of the HPCS program languages: Locality management All three languages provide user access to virtual units of locality called locales in Chapel, regions in Fortress and places in X10. Each execution of a program is bundled to a set of such locality units which are mapped by operating system to physical entities like computational nodes. This provides users with a mechanism for (1) data collections distribution among locality units, (2) balancing different data collections and (3) linking threads with data they are processing. St. Petersburg, Russia,2012 28 of 60

model General overview of the HPCS program languages: Fortress and X10 provide extensive libraries of built-in distributions. These languages also provide a possibility to create user-defined data distributions via index space decomposition or combining distributions in different dimensions. Chapel doesn t provide built-in distributions but provides an extensive infrastructure which supports arbitrary user-defined data distributions capable of sparse data management. St. Petersburg, Russia,2012 29 of 60

model General overview of the HPCS program languages: Multithreading (loop parallelization). Chapel distinguishes serial loops for and parallel loops forall in which iterations on index region elements are performed without limitations. Users are responsible for avoidance of dependencies which can lead to data races. In Fortress the for loop is parallel by default thus if loop iterations are performed on a distributed dimension of an array these iterations will be grouped by processors according to data distributions. X10 has two types of parallel loops: foreach which is limited to a single locality unit and ateach which allows to perform iterations on several locality units. St. Petersburg, Russia,2012 30 of 60

model General overview of the HPCS program languages: Chapel - http://chapel.cray.com/ Fortress - http://fortress.sunsource.net/ X10 - http://x10-lang.org/ St. Petersburg, Russia,2012 31 of 60

model: Chapel Preview «Hello, world» program Fast prototyping writeln( Hello, world ); Structured programming def main() { } writeln( Hello, world ); Application module HelloWorld { } def main() { } writeln( Hello, world ); St. Petersburg, Russia,2012 32 of 60

model: Chapel Language Overview: Syntax Uses some features of C, C#, C++, Java, Ada, Perl,... Semantics Imperative, block-structured, arrays Object-oriented programming (optional) Static typing St. Petersburg, Russia,2012 33 of 60

model: Chapel Language elements - ranges: Syntax range-expr: [low].. [high] [by stride] Semantics Regular sequence of integer values stride > 0: low, low+stride, low+2*stride,... high stride < 0: high, high+stride, high+2*stride,... low Examples 1..6 by 2 // 1, 3, 5 1..6 by -1 // 6, 5, 4, 3, 2, 1 3.. by 3 // 3, 6, 9, 12,... St. Petersburg, Russia,2012 34 of 60

model: Chapel Language elements - Arrays: Syntax array-type: [ index-set-expr ] type Semantics Array of elements, size is defined by a set of indexes Examples var A: [1..3] int, // Array of 3 elements B: [1..3, 1..5] real, // 2D array C: [1..3][1..5] real; // Array of arrays St. Petersburg, Russia,2012 35 of 60

model: Chapel Language elements - Loop: Syntax for-loop: for index-expr in iterator-expr { stmt-list } Semantics Body expression evaluation on each iteration of the loop Example var A: [1..3] string = ( DO, RE, MI ); for i in 1..3 do write(a(i)); // DOREMI for a in A { a += LA ; write(a); } // DOLARELAMILA St. Petersburg, Russia,2012 36 of 60

model: Chapel Language elements Conditional Expressions: Conditional Expression if cond then computea() else computeb(); while loop while cond { compute(); } Select expression select key { when value1 do compute1(); when value2 do compute2(); otherwise compute3(); } St. Petersburg, Russia,2012 37 of 60

model: Chapel Language elements - Functions: Example def area(radius: real) return 3.14 * radius**2; writeln(area(2.0)); // 12.56 Example of a function with default values of arguments def writecoord(x: real = 0.0, y: real = 0.0) { writeln( (, x,,, y, ) ); } writecoord(2.0); // (2.0, 0.0) writecoord(y=2.0); // (0.0, 2.0) St. Petersburg, Russia,2012 38 of 60

model: Chapel Task parallelism: begin operator Syntax begin-stmt: begin stmt Semantics Creates a parallel task for stmt execution Execution of the parent program is not suspended Example begin writeln( hello world ); writeln( good bye ); Possible outputs (1) hello world (2) good bye good bye hello world St. Petersburg, Russia,2012 39 of 60

model: Chapel Task parallelism: sync type Syntax sync-type: sync type Semantics Variables of the sync type have two states «full», «empty» Writing to a sync variable changes its state to «full» Reading from a sync variable changes its state to «empty» Examples (1) var lock$: sync bool; (2) var future$: sync real; lock$ = true; begin future$ = compute(); critical(); computesomethingelse(); lock$; usecomputeresults(future$); St. Petersburg, Russia,2012 40 of 60

model: Chapel Task parallelism: cobegin operator Syntax cobegin-stmt: cobegin { stmt-list } Semantics Spawns a parallel task for each operator from the stmt-list Synchronization at the end of block Example cobegin { consumer(1); consumer(2); producer(); } St. Petersburg, Russia,2012 41 of 60

model: Chapel Task parallelism: coforall operator Syntax coforall-loop: coforall index-expr in iterator-expr { stmt } Semantics Spawns parallel tasks for every loop iteration Synchronization at the end of the loop Example ( Producer-consumer task) begin producer(); coforall i in 1..numConsumers { consumer(i); } St. Petersburg, Russia,2012 42 of 60

model: Chapel Task parallelism: atomic operator Syntax atomic-statement: atomic stmt Semantics stmt is executed as an atomic operation Example atomic A(i) = A(i) + 1; St. Petersburg, Russia,2012 43 of 60

model: Chapel Domains: advanced ways to set ranges var Dense: domain(2) = [1..10, 1..20], Strided: domain(2) = Dense by (2, 4), Sparse: subdomain(dense) = genindices(), Associative: domain(string) = readnames(), Opaque: domain(opaque); St. Petersburg, Russia,2012 44 of 60

model: Chapel Domain operations forall (i,j) in Sparse { } DenseArr(i,j) += SparseArr(i,j); = St. Petersburg, Russia,2012 45 of 60

model: Chapel Data reduction: reduce operation Syntax reduce-expr: reduce-op reduce iterator-expr Semantics Applies reduce-op operation for every data element Example total = + reduce A; bigdiff = max reduce [i in InnerD] abs(a(i)-b(i); St. Petersburg, Russia,2012 46 of 60

model: Chapel Data reduction: scan operation Syntax scan-expr: scan-op scan iterator-expr Semantics Applies scan-op operation for every data element (with return of all partial results) Example var A, B, C: [1..5] int; A = 1; // A: 1 1 1 1 1 B = + scan A; // B: 1 2 3 4 5 B(3) = -B(3); // B: 1 2-3 4 5 C = min scan B; // C: 1 1-3 -3-3 St. Petersburg, Russia,2012 47 of 60

model: Chapel Computing System Model: locale Definition Is an abstraction of computing element Contains a processing unit and memory (storage) Properties Tasks running within a locale have uniform access to local memory Longer latency for accessing the memory of other locales Example locale SMP, multicore processor St. Petersburg, Russia,2012 48 of 60

model: Chapel Computing System Model Declaration of all locales set config const numlocales: int; const LocaleSpace: domain(1) = [0..numLocales-1]; const Locales: [LocaleSpace] locale; Definition of the locales set at execution start prompt> a.out --numlocales=8 Definition of locales set topology var TaskALocs = Locales[0..1]; var TaskBLocs = Locales[2..numLocales-1]; var Grid2D = Locales.reshape([1..2,1..4]); St. Petersburg, Russia,2012 49 of 60

model: Chapel Computing System Model: Operations Get locale index def locale.id: int {... } Get locale name def locale.name: string {... } Get the number of cores on a locale def locale.numcores: int {... } Get the size of available memory on a locale def locale.physicalmemory(...) {... } Example calculating the size of all available memory const totalsystemmemory = + reduce Locales.physicalMemory(); St. Petersburg, Russia,2012 50 of 60

model: Chapel Computing System Model: Linking tasks and locales Running tasks on a locale on operation on-stmt: on expr { stmt } Semantics stmt executed on the locale, defined by expr Example var A: [LocaleSpace] int; coforall loc in Locales do on loc do A(loc.id) = compute(loc.id); St. Petersburg, Russia,2012 51 of 60

model: Chapel Computing System Model: Example var x, y: real; // x and y on the locale 0 on Locales(1) { // migrate task to locale 1 var z: real; // z on locale 1 z = x + y; // remote access to x and y on Locales(0) do // return to the locale 0 z = x + y; // remote access to z // return to the locale 1 on x do // transition to the locale 0 z = x + y; // remote access to z // transition to the locale 1 } // return to the locale 0 St. Petersburg, Russia,2012 52 of 60

model: Chapel Computing System Model: Data distribution (Data) Ranges distribution among over a set of locales Example const Dist = new dmap(new Cyclic(startIdx=(1,1))); vardom: domain(2) dmapped Dist = [1..4, 1..8]; Data is distributed on a grid of locales There is a library of standard distributions Possible to define new (own) distributions St. Petersburg, Russia,2012 53 of 60

model: Chapel Computing System Model: Data distribution Ranges distribution is performed in the same way for every range type St. Petersburg, Russia,2012 54 of 60

model: Chapel NAG MG Stencil in Fortran+MPI St. Petersburg, Russia,2012 55 of 60

model: Chapel NAG MG Stencil in Chapel def rprj3(s, R) { const Stencil = [-1..1, -1..1, -1..1], W: [0..3] real = (0.5, 0.25, 0.125, 0.0625), W3D = [(i,j,k) in Stencil] W((i!=0)+(j!=0)+(k!=0)); forall inds in S.domain do S(inds) = + reduce [offset in Stencil] (W3D(offset) * R(inds + offset*r.stride)); St. Petersburg, Russia,2012 56 of 60

model: Chapel Algorithm for Banded Matrix Multiplication Intel Core 2 DUO T5250 @ 1.5GHz 1.5GHz 2.00 Gb RAM St. Petersburg, Russia,2012 57 of 60

Conclusion Major approaches to develop parallel programs: Usage of libraries for existing programming languages, Usage of preprocessor (directives, comments), Extension of existing programming languages, Creation of new parallel programming languages Major technologies: OpenMP for systems with shared memory, MPI for systems with distributed memory A direction based on the model of a distributed globallyaddressed space (PGAS) is actively developed St. Petersburg, Russia,2012 58 of 60

Contacts: Nizhny Novgorod State University Department of Computational Mathematics and Cybernetics Victor P. Gergel gergel@unn.ru St. Petersburg, Russia,2012 59 of 60

Thank you for attention! Any questions? St. Petersburg, Russia,2012 60 of 60