Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Similar documents
Unified Parallel C (UPC)

Unified Parallel C, UPC

Lecture 32: Partitioned Global Address Space (PGAS) programming models

AMS 250: An Introduction to High Performance Computing PGAS. Shawfeng Dong

Lecture 15. Parallel Programming Languages

Partitioned Global Address Space (PGAS) Model. Bin Bao

New Programming Paradigms: Partitioned Global Address Space Languages

Co-arrays to be included in the Fortran 2008 Standard

Fortran Coarrays John Reid, ISO Fortran Convener, JKR Associates and Rutherford Appleton Laboratory

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

PGAS Languages (Par//oned Global Address Space) Marc Snir

CS 267 Unified Parallel C (UPC)

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

Parallel dense linear algebra computations (1)

Lecture V: Introduction to parallel programming with Fortran coarrays

Parallel Programming in Fortran with Coarrays

Fortran 2008: what s in it for high-performance computing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel programming models (cont d)

Implementing a Scalable Parallel Reduction in Unified Parallel C

Titanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

Parallel Programming with Coarray Fortran

IBM XL Unified Parallel C User s Guide

Parallelism in Software

Steve Deitz Chapel project, Cray Inc.

Evaluating the Portability of UPC to the Cell Broadband Engine

Introduction to UPC. Presenter: Paul H. Hargrove (LBNL) Joint work with Berkeley UPC and Titanium Groups at Lawrence Berkeley Nat l Lab & UC Berkeley

Comparing One-Sided Communication with MPI, UPC and SHMEM

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

XLUPC: Developing an optimizing compiler UPC Programming Language

Draft. v 1.0 THE GEORGE WASHINGTON UNIVERSITY. High Performance Computing Laboratory. UPC Manual

Parallel programming models. Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

SHMEM/PGAS Developer Community Feedback for OFI Working Group

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Friday, May 25, User Experiments with PGAS Languages, or

Brook Spec v0.2. Ian Buck. May 20, What is Brook? 0.2 Streams

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

DPHPC: Introduction to OpenMP Recitation session

Givy, a software DSM runtime using raw pointers

Chapel Introduction and

Computer Science Technical Report. Pairwise synchronization in UPC Ernie Fessenden and Steven Seidel

CS 5220: Shared memory programming. David Bindel

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

A Local-View Array Library for Partitioned Global Address Space C++ Programs

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

Overview: The OpenMP Programming Model

CS 152, Spring 2012 Section 11

Using OpenMP June 2008

Appendix D. Fortran quick reference

Coarray Fortran: Past, Present, and Future. John Mellor-Crummey Department of Computer Science Rice University

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

UPC: A Portable High Performance Dialect of C

Parallel Programming with OpenMP. CS240A, T. Yang

Locality/Affinity Features COMPUTE STORE ANALYZE

Message-Passing Programming with MPI

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Introduction to Parallel Programming

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

CS 470 Spring Parallel Languages. Mike Lam, Professor

Linear Algebra Programming Motifs

A Comparison of Co-Array Fortran and OpenMP Fortran for SPMD Programming

C PGAS XcalableMP(XMP) Unified Parallel

5.12 EXERCISES Exercises 263

Blue Waters Programming Environment

Relaxed Memory Consistency

PGAS: Partitioned Global Address Space

Parallel Programming without MPI Using Coarrays in Fortran SUMMERSCHOOL

Message-Passing Programming with MPI. Message-Passing Concepts

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Concurrent Programming in the D Programming Language. by Walter Bright Digital Mars

Introduction to parallel Computing

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

CPSC 427a: Object-Oriented Programming

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Scalable Shared Memory Programing

MATH 676. Finite element methods in scientific computing

Parallel Computing. Prof. Marco Bertini

Introduction to Standard OpenMP 3.1

IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN

Programming techniques for heterogeneous architectures. Pietro Bonfa SuperComputing Applications and Innovation Department

LLVM-based Communication Optimizations for PGAS Programs

Parallel Programming Libraries and implementations

DPHPC: Introduction to OpenMP Recitation session

Overview: Emerging Parallel Programming Models

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

ROSE-CIRM Detecting C-Style Errors in UPC Code

Parallel Programming Languages 1 - OpenMP

Open Multi-Processing: Basic Course

Programming with OpenMP*

Execution model of three parallel languages: OpenMP, UPC and CAF

Parallel Programming. Libraries and Implementations

OPENSHMEM AS AN EFFECTIVE COMMUNICATION LAYER FOR PGAS MODELS

Transcription:

Parallel Programming Languages HPC Fall 2010 Prof. Robert van Engelen

Overview Partitioned Global Address Space (PGAS) A selection of PGAS parallel programming languages CAF UPC Further reading HPC Fall 2010 2

Global Address Space (GAS) Global address space languages take advantage of Ease of programmability of shared memory parallel SPMD parallelism Allow local-global distinction of data, because data layout matters for performance Global address space is logically shared, physically distributed Shared arrays are distributed over processor memories Implicit communication for remote data access X(1) X(2) X(N) Processor1 Processor2 ProcessorN HPC Fall 2010 3

Partitioned Global Address Space (PGAS) Global address space with two-level model that supports locality management Local memory (private variables) Remote memory (shared variables) shared int X[P]; int *ptr = &X[1]; int n = ; Global address space X[0] X[1] X[P-1] shared ptr= n=1 ptr= n=5 ptr= n=3 private Processor1 Processor2 ProcessorP HPC Fall 2010 4

Partitioned Global Address Space (PGAS) Model Global address space with two-level memory model that supports locality management Local memory (private variables) Remote memory (shared variables) Programmer controls critical decisions Data partitioning (by data placement in PGAS memory) Communication (implicitly, via remote PGAS memory access) Suitable for mapping to a range of parallel architectures Shared memory, message passing, and hybrid Languages: CAF (Fortran), UPC (C), Titanium (Java) HPC Fall 2010 5

PGAS Model vs Implementation PGAS is an abstract model Implementations differ with respect to details: Address space partitioned by processors Physically: at the memory address level (= DSM, e.g. Cray T3D/E) Logically: at the variable level, where each variable can be arbitrarily placed in local memory on remote processor Local caching of remote memory? Coherence protocol Communication One-sided, e.g. DMA, is usually faster Two-sided, e.g. MPI send/recv Bulk memory copy operations or individual copies HPC Fall 2010 6

Co-Array Fortran (CAF) Explicitly-parallel extension of Fortran 90/95 Commercial compiler from Cray/SGI Open source compiler from Rice University Partitioned global address space SPMD with two-level model that supports locality management Local memory (private variables) Remote memory (shared variables) As usual, programmer controls critical decisions Data partitioning Communication HPC Fall 2010 7

CAF: Co-Arrays A co-array is an array extended with an image dimension REAL, DIMENSION(N)[*] :: X,Y X[:] = Y[Q] Y(1..N) Y(1..N) Y(1..N) Image 1 Image Q Image P X(1..N) X(1..N) X(1..N) Image 1 Image 2 Image P HPC Fall 2010 8

CAF: Array Syntax and Implicit Remote Memory Operations REAL, DIMENSION(N) :: X REAL, DIMENSION(N)[*] :: Y REAL, DIMENSION(N,P)[*] :: Z X = Y[PE]! get from Y[PE] Y[PE] = X! put into Y[PE] Y[:] = X! broadcast X Y[LIST] = X! broadcast X over subset! of PE's in array LIST Z(:) = Y[:]! all-gather, collect all Y S = MINVAL(Y[:])! min (reduce) all Y Z[:] = S! S scalar, promoted to array! of shape (1:N,1:P) HPC Fall 2010 9

CAF: Synchronization COMMON/XCTILB4/ B(N,4)[*] SAVE /XCTILB4/ ME = THIS_IMAGE() IF (ME > 1.AND. ME < NUM_IMAGES()) THEN CALL SYNC_ALL( WAIT=(/ME-1,ME+1/) ) B(:,1) = B(:,3)[ME-1] B(:,4) = B(:,2)[ME+1] CALL SYNC_ALL( WAIT=(/ME-1,ME+1/) ) ENDIF Wait for processors on the left and right Image 1 Image 2 Image P HPC Fall 2010 10

Unified Parallel C (UPC) UPC is an explicit extension of ANSI C Commercial compilers from Cray/SGI, HP Open source compiler from LBNL/UCB/MTU/UF and GCC-UPC project Follows the C language philosophy Programmers are clever and careful and may need to work close to the hardware level to get performance, but can get into trouble! Concise and efficient syntax UPC is a PGAS language Global address space with private and shared variables Private/shared pointers to private/shared variables Array data distributions (block/cyclic) Forall worksharing loops Barriers and locks Bulk copy operations between shared and private memory HPC Fall 2010 11

UPC: Shared Variables Private by default C variables and objects are allocated in private memory space for each thread Shared variables are explicitly declared and allocated once (by thread 0) Shared variables must be globally declared (i.e. static) shared int ours; int mine; Global address space ours mine mine mine shared private Processor1 Processor2 ProcessorP HPC Fall 2010 12

UPC: Simple Example Monte Carlo pi Calculation int hit() { int const rand_max = 0xFFFFFF; double x = ((double) rand()) / RAND_MAX; double y = ((double) rand()) / RAND_MAX; return ((x*x + y*y) <= 1.0); } Randomly throw darts at (x,y) positions in a unit circle, if x 2 + y 2 < 1, then point is inside circle Compute ratio of points inside/total, then π = 4*ratio r =1 HPC Fall 2010 13

UPC: Simple Example Monte Carlo pi Calculation #include <upc.h> shared int hits = 0; main() { int i; int my_trials, trials = ; my_trials = (trials + THREADS - 1)/THREADS; srand(mythread*17); for (i=0; i < my_trials; i++) hits += hit(); if (MYTHREAD == 0) printf( pi estimated to %g\n, 4*(double)hits/(double)trials); } Divide the work Score hits What can go wrong? HPC Fall 2010 14

UPC: Simple Example Monte Carlo pi Calculation shared int hits = 0; main() { int i, my_trials, trials = ; upc_lock_t *hit_lock = upc_all_lock_alloc(); my_trials = (trials + THREADS - 1)/THREADS; srand(mythread*17); for (i=0; i < my_trials; i++) { upc_lock(hit_lock); hits += hit(); upc_unlock(hit_lock); } upc_barrier; if (MYTHREAD == 0) printf( pi estimated to %g\n, 4*(double)hits/(double)trials); upc_lock_free(hit_lock); } Score hits Synchronize Anything wrong here? HPC Fall 2010 15

UPC: Simple Example Monte Carlo pi Calculation shared int hits[threads] = { 0 }; main() { int i, my_trials, trials = ; my_trials = (trials + THREADS - 1)/THREADS; srand(mythread*17); for (i=0; i < my_trials; i++) hits[mythread] += hit(); upc_barrier; if (MYTHREAD == 0) { for (i=1; i < THREADS; i++) hits[0] += hits[i]; tot_trials = THREADS*my_trials; printf( pi estimated to %g\n, 4*(double)hits[0]/(double)tot_trials); } } Score hits Sync Sum hits Corrected HPC Fall 2010 16

UPC: Forall Work Sharing shared int v1[n], v2[n], sum[n]; int i; upc_forall(i=0; i<n; i++; i) sum[i] = v1[i] + v2[i]; Affinity: here it forces owner-computes rule Default distribution: cyclic Assume THREADS=4 for(i=0; i<n; i++) if (MYTHREAD == i%threads) sum[i] = v1[i] + v2[i]; Elements with affinity to processor 0 are red HPC Fall 2010 17

UPC: Pointers Where does the referenced value reside? Where does the pointer reside? Local Shared Private PP (p1) PS (p3) Shared SP (p2) SS (p4) int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Shared pointer to private local memory is not recommended HPC Fall 2010 18

UPC: Pointers Global address space p3: p4: p1: p2: p1: p2: p1: p2: Shared Private int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ HPC Fall 2010 19

UPC: Pointer Example shared int v1[n], v2[n], sum[n]; int i; shared int *p1, *p2; p1 = v1; p2 = v2; upc_forall(i=0; i<n; i++, p1++, p2++; i) sum[i] = *p1 + *p2; HPC Fall 2010 20

UPC: Pointers In UPC pointers to shared objects have three fields: thread number local address of block (for blocked data distributions) phase (specifies position in the block) Phase Thread Virtual Address 63 49 48 38 37 0 HPC Fall 2010 21

UPC: Shared Variable Layout Non-array shared variables have affinity with thread 0 Array layouts are cyclic or blocked: shared double x[n]; /* cyclic */ shared [b] double y[n]; /* blocked */ where b is the block size For blocked layouts, element i has affinity with thread: (i/b) % THREADS therefore use i/b in forall (owner-computes): upc_forall(i=0; i<n; i++; i/b) y[i] = HPC Fall 2010 22

UPC: Consistency Model The consistency model of shared memory accesses are controlled by qualifiers Strict: will always appear in order Relaxed: may appear out of order to other threads strict: { x = y; z = y+1; } Use strict on variables that are used as synchronization Thread1 Thread2 flag = 0; while (flag) data = ; = data; flag = 1; Select the default consistency model with: #include <upc_strict.h> #include <upc_relaxed.h> HPC Fall 2010 23

UPC: Fence UPC provides a fence construct Syntax upc_fence; equivalent to a null strict reference strict { } Ensures that all shared references issued before the upc_fence are complete HPC Fall 2010 24

Further Reading CAF: www.co-array.org UPC: upc.gwu.edu HPC Fall 2010 25