Steve Deitz Chapel project, Cray Inc.

Similar documents
Chapel Introduction and

Implementing a Scalable Parallel Reduction in Unified Parallel C

Sung-Eun Choi and Steve Deitz Cray Inc.

Overview: Emerging Parallel Programming Models

CSE 590o: Chapel. Brad Chamberlain Steve Deitz Chapel Team. University of Washington September 26, 2007

Chapel. Cray Cascade s High Productivity Language. Mary Beth Hribar Steven Deitz Brad Chamberlain Cray Inc. CUG 2006

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Survey on High Productivity Computing Systems (HPCS) Languages

HPF High Performance Fortran

Introduction to Parallel Programming

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Steve Deitz Cray Inc.

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

CS 470 Spring Parallel Languages. Mike Lam, Professor

1 GF 1988: Cray Y-MP; 8 Processors. Static finite element analysis. 1 TF 1998: Cray T3E; 1,024 Processors. Modeling of metallic magnet atoms

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

The Mother of All Chapel Talks

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Chapel Background and Motivation COMPUTE STORE ANALYZE

Unified Parallel C, UPC

Linear Algebra Programming Motifs

. Programming in Chapel. Kenjiro Taura. University of Tokyo

HPCC STREAM and RA in Chapel: Performance and Potential

Chapel: Global HPCC Benchmarks and Status Update. Brad Chamberlain Chapel Team. CUG 2007 May 7, 2007

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Given: m- element vectors A, B, C. Compute: i 1..m, A i = B i + α C i. In pictures:

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Parallel Programming with Coarray Fortran

Chapel: an HPC language in a mainstream multicore world

Chapel Background. Brad Chamberlain. PRACE Winter School 12 February Chapel: a new parallel language being developed by Cray Inc.

The Cascade High Productivity Programming Language

New Programming Paradigms: Partitioned Global Address Space Languages

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Unified Parallel C (UPC)

Sung Eun Choi Cray, Inc. KIISE KOCSEA HPC SIG Joint SC 12

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Steve Deitz Cray Inc.

User-Defined Distributions and Layouts in Chapel Philosophy and Framework

Parallel Languages: Past, Present and Future

Data Parallelism COMPUTE STORE ANALYZE

Models and languages of concurrent computation

Introduction to Standard OpenMP 3.1

Comparing One-Sided Communication with MPI, UPC and SHMEM

Friday, May 25, User Experiments with PGAS Languages, or

Titanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium

Introduction to Parallel Programming

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Appendix D. Fortran quick reference

Parallel Programming in C with MPI and OpenMP

Sung-Eun Choi and Steve Deitz Cray Inc.

for Task-Based Parallelism

C PGAS XcalableMP(XMP) Unified Parallel

Interfacing Chapel with traditional HPC programming languages

CS4411 Intro. to Operating Systems Exam 1 Fall points 10 pages

Parallel programming models. Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Java SE 8 Programming

Chapel: Features. Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

PGAS: Partitioned Global Address Space

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Co-arrays to be included in the Fortran 2008 Standard

Alexey Syschikov Researcher Parallel programming Lab Bolshaya Morskaya, No St. Petersburg, Russia

Parallel programming models (cont d)

1 HiPEAC January, 2012 Public BEYOND THE NODE

Lecture V: Introduction to parallel programming with Fortran coarrays

Scalable Shared Memory Programing

Parallel Programming in C with MPI and OpenMP

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

STREAM and RA HPC Challenge Benchmarks

A Local-View Array Library for Partitioned Global Address Space C++ Programs

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Intermediate Code Generation

Partitioned Global Address Space (PGAS) Model. Bin Bao

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Distributed Shared Memory for High-Performance Computing

PGAS Languages (Par//oned Global Address Space) Marc Snir

OpenMP: Open Multiprocessing

Parallel Programming in C with MPI and OpenMP

Parallel Programming. March 15,

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

High Performance Fortran. James Curry

Parallelization Principles. Sathish Vadhiyar

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Evaluating the Portability of UPC to the Cell Broadband Engine

Fundamentals of Programming Languages. Data Types Lecture 07 sl. dr. ing. Ciprian-Bogdan Chirila

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

RedGRID: Related Works

MPI: A Message-Passing Interface Standard

A SURVEY ON TRADITIONAL PLATFORMS AND NEW TRENDS IN PARALLEL COMPUTATION

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Parallel Programming in C with MPI and OpenMP

Transcription:

Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc.

Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across processors Inter-processor communication Load balancing Race conditions Deadlock And more...

How is it getting easier? Support for new global address space languages Co-Array Fortran Modest extension to Fortran 95 for SPMD parallelism Developed at Cray Accepted as part of next Fortran Unified Parallel C (UPC) Small extension to C for SPMD parallelism Development of new high-level parallel languages Chapel Under development at Cray (part of HPCS) Builds on MTA C, HPF, and ZPL Provides modern language features (C#, Java, ) Fortress and X10

Chapel s Context HPCS = High Productivity Computing Systems (a DARPA program) Overall Goal: Increase productivity for High-End Computing (HEC) users by 10 for the year 2010 Productivity = Programmability + Performance + Portability + Robustness Result must be revolutionary, not evolutionary marketable to users beyond the program sponsors Phase II Competitors (July `03 June`06): Cray, IBM, Sun

Outline Chapel Features Domains and Arrays Cobegin, Begin, and Synchronization Variables 3-Point Stencil Case Study MPI Co-Array Fortran Unified Parallel C Chapel

Domains and Arrays Domain: an index set Specifies size and shape of arrays Supports sequential and parallel iteration Potentially decomposed across locales Three main classes: Arithmetic: indices are Cartesian tuples Rectilinear, multidimensional Optionally strided and/or sparse Indefinite: indices serve as hash keys Supports hash tables, associative arrays, dictionaries Opaque: indices are anonymous Supports sets, graph-based computations Fundamental Chapel concept for data parallelism A generalization of ZPL s region concept

A Simple Domain Declaration var m: int = 4; var n: int = 8; var D: domain(2) = [1..m, 1..n]; D

A Simple Domain Declaration var m: int = 4; var n: int = 8; var D: domain(2) = [1..m, 1..n]; var DInner: subdomain(d) = [2..m-1, 2..n-1]; DInner D

Domain Usage Declaring arrays: var A, B: [D] float; Sub-array references: A(DInner) = B(DInner); A B Sequential iteration: for i,j in DInner { A(i,j) } or: for ij in DInner { A(ij) } Parallel iteration: or: forall ij in DInner { A(ij) } [ij in DInner] A(ij) A DInner D D 1 2 3 4 5 6 7 8 9 10 11 12 B DInner Array reallocation: D = [1..2*m, 1..2*n]; A B

Other Arithmetic Domains var StridedD: subdomain(d) = D by (2,3); StridedD var SparseD: sparse domain(d) = (/(1,1),(4,1),(2,3),(4,2), /); SparseD

Cobegin Executes statements in parallel Synchronizes at join point cobegin { leftsum = computeleft(); rightsum = computeright(); } sum = leftsum + rightsum; Composable with data parallel abstractions var D: domain(1) distributed(block) = [1..n]; var A: [D] float;... cobegin { [i in 1..n/2] computeleft(a(i)); [i in n/2+1..n] computeright(a(i)); }

Synchronization Variables Implements produce/consumer idioms cobegin { var s: sync int = computeproducer(); computeconsumer(s); } Subsumes locks var lock: sync bool; lock = true; compute(); lock; Implements futures

Begin Spawns a concurrent computation No synchronization var leftsum: sync int; begin leftsum = computeleft(); rightsum = computeright();... sum = leftsum + rightsum; Subsumes SPMD model for locale in Locales do begin mymain(locale);

More Chapel Features Locales and user-defined distributions Objects (classes, records, and unions) Generics and local type inference Tuples Sequences and iterators Reductions and scans (parallel prefix) Default arguments, name-based argument passing Function and operator overloading Modules (namespace management) Atomic transactions Automatic garbage collection

Outline Chapel Features Domains and Arrays Cobegin, Begin, and Synchronization Variables 3-Point Stencil Case Study MPI Co-Array Fortran Unified Parallel C Chapel

3-Point Stencil Case Study Compute the counting numbers via iterative averaging Initial condition 0 0 0 0 4 Iteration 1 0 0 0 2 4 Iteration 2 0 0 1 2 4 Iteration 3 0 0.5 1 2.5 4 Iteration 4 0 0.5 1.5 2.5 4 Iteration 5 0 0.75 1.5 2.75 4 Iteration 6 0 0.75 1.75 2.75 4 Iteration K 0 1 2 3 4

Array Allocation C + MPI my_n = n * (me+1)/p n*me/p; A = (double *)malloc((my_n+2)*sizeof(double)); Co-Array Fortran MY_N = N*ME/P N*(ME-1)/P ALLOCATE(A(0:MY_N+1)[*]) fragmented view Unified Parallel C A = (shared double*)upc_all_alloc((n+2), sizeof(double)); Chapel var D : domain(1) distributed(block) = [0..n+1]; var InnerD : subdomain(d) = [1..n]; var A : [D] float; global view

Array Initialization C + MPI for (i = 0; i <= my_n; i++) A[i] = 0.0; if (me == p - 1) A[my_n+1] = n + 1.0; Co-Array Fortran A(:) = 0.0 IF (ME.EQ. P) THEN A(MY_N+1) = N + 1.0 ENDIF Unified Parallel C upc_forall (i = 0; i <= n; i++; &A[i]) A[i] = 0.0; if (MYTHREAD == THREADS-1) A[n+1] = n + 1.0; Chapel A(0..n) = 0.0; A(n+1) = n + 1.0;

Stencil Computation C + MPI if (me < p-1) MPI_Send(&(A[my_n]), 1, MPI_DOUBLE, me+1, 1, MPI_COMM_WORLD); if (me > 0) MPI_Recv(&(A[0]), 1, MPI_DOUBLE, me-1, 1, MPI_COMM_WORLD, &status); if (me > 0) MPI_Send(&(A[1]), 1, MPI_DOUBLE, me-1, 1, MPI_COMM_WORLD); if (me < p-1) MPI_Recv(&(A[my_n+1]), 1, MPI_DOUBLE, me+1, 1, MPI_COMM_WORLD, &status); for (i = 1; i <= my_n; i++) Tmp[i] = (A[i-1]+A[i+1])/2.0; Co-Array Fortran CALL SYNC_ALL() IF (ME.LT. P) THEN A(MY_N+1) = A(1)[ME+1] A(0)[ME+1] = A(MY_N) ENDIF CALL SYNC_ALL() DO I = 1,MY_N TMP(I) = (A(I-1)+A(I+1))/2.0 ENDDO Unified Parallel C upc_barrier; upc_forall (i = 1; i <= n; i++; &A[i]) Tmp[i] = (A[i-1]+A[i+1])/2.0; Chapel forall i in InnerD do Tmp(i) = (A(i-1)+A(i+1))/2.0;

Reduction C + MPI my_delta = 0.0; for (i = 1; i <= my_n; i++) my_delta += fabs(a[i]-tmp[i]); MPI_Allreduce(&my_delta, &delta, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); Co-Array Fortran MY_DELTA = 0.0 DO I = 1,MY_N MY_DELTA = MY_DELTA + ABS(A(I)-TMP(I)) ENDDO CALL SYNC_ALL() DELTA = 0.0 DO I = 1,P DELTA = DELTA + MY_DELTA[I] ENDDO Unified Parallel C my_delta = 0.0; upc_forall (i = 1; i <= n; i++; &A[i]) my_delta += fabs(a[i]-tmp[i]); upc_lock(lock); delta = delta + my_delta; upc_unlock(lock); Chapel delta = + reduce abs(a(innerd)-tmp(innerd));

Reduction C + MPI my_delta = 0.0; for (i = 1; i <= my_n; i++) my_delta += fabs(a[i]-tmp[i]); MPI_Allreduce(&my_delta, &delta, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD); Co-Array Fortran MY_DELTA = 0.0 DO I = 1,MY_N MY_DELTA = MY_DELTA + ABS(A(I)-TMP(I)) ENDDO CALL SYNC_ALL() DELTA = 0.0 DO I = 1,P DELTA = DELTA + MY_DELTA[I] ENDDO Unified Parallel C my_delta = 0.0; upc_forall (i = 1; i <= n; i++; &A[i]) my_delta += fabs(a[i]-tmp[i]); upc_lock(lock); delta = delta + my_delta; upc_unlock(lock); Chapel delta = 0.0; forall i in InnerD do atomic delta += abs(a(i)-tmp(i));

Output C + MPI if (me == 0) printf("iterations: %d\n", iters); Co-Array Fortran IF (ME.EQ. 1) THEN WRITE(*,*) 'Iterations: ', ITERS ENDIF Unified Parallel C if (MYTHREAD == 0) printf("iterations: %d\n", iters); Chapel writeln("iterations: ", iters);

Case Study Summary desirable features global view of computation global view of data data distributions arbitrary data distributions multi-dimensional arrays syntactic execution model one-sided communication support for reductions simple mechanisms C+MPI CAF languages UPC Chapel

Implementation Status Serial implementation Covers over 95% of features in Chapel Prototype quality Non-distributed threaded implementation Implements sync variables, single variables, begin, and cobegin

For More Information http://chapel.cs.washington.edu/ chapel_info@cray.com deitz@cray.com