Lecture 32: Partitioned Global Address Space (PGAS) programming models

Similar documents
Unified Parallel C (UPC)

New Programming Paradigms: Partitioned Global Address Space Languages

Partitioned Global Address Space (PGAS) Model. Bin Bao

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Unified Parallel C, UPC

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

Implementing a Scalable Parallel Reduction in Unified Parallel C

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Local-View Array Library for Partitioned Global Address Space C++ Programs

Lecture 15: Data-Driven Tasks, Point-to-Point Synchronization with Phasers

PGAS Languages (Par//oned Global Address Space) Marc Snir

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chapel Introduction and

IBM XL Unified Parallel C User s Guide

COMP 322: Fundamentals of Parallel Programming. Lecture 37: Distributed Computing, Apache Spark

XLUPC: Developing an optimizing compiler UPC Programming Language

Lecture 12: Barrier Synchronization

Steve Deitz Chapel project, Cray Inc.

Draft. v 1.0 THE GEORGE WASHINGTON UNIVERSITY. High Performance Computing Laboratory. UPC Manual

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Parallel Programming with Coarray Fortran

AMS 250: An Introduction to High Performance Computing PGAS. Shawfeng Dong

Parallel programming models (cont d)

Parallel Programming Libraries and implementations

Lecture 35: Eureka-style Speculative Task Parallelism

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Parallel Programming. Libraries and Implementations

Lecture V: Introduction to parallel programming with Fortran coarrays

CS 470 Spring Parallel Languages. Mike Lam, Professor

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Lecture 24: Java Threads,Java synchronized statement

EE/CSCI 451: Parallel and Distributed Computation

Evaluating the Portability of UPC to the Cell Broadband Engine

COMP 322 / ELEC 323: Fundamentals of Parallel Programming

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Parallel Numerical Algorithms

PGAS: Partitioned Global Address Space

Evaluation of PGAS Communication Paradigms With Geometric Multigrid

MATH 676. Finite element methods in scientific computing

Integrating Asynchronous Task Parallelism with OpenSHMEM

Overview: Emerging Parallel Programming Models

Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB

Portable, MPI-Interoperable! Coarray Fortran

Parallelisation. Michael O Boyle. March 2014

Parallel programming models. Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008

COMP 322: Fundamentals of Parallel Programming

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Comparing One-Sided Communication with MPI, UPC and SHMEM

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES

Survey on High Productivity Computing Systems (HPCS) Languages

Overview: The OpenMP Programming Model

Models and languages of concurrent computation

SHMEM/PGAS Developer Community Feedback for OFI Working Group

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

. Programming in Chapel. Kenjiro Taura. University of Tokyo

Interfacing Chapel with traditional HPC programming languages

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Scalable Shared Memory Programing

C PGAS XcalableMP(XMP) Unified Parallel

COMP 322: Fundamentals of Parallel Programming

Exploring XcalableMP. Shun Liang. August 24, 2012

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 9

Linear Algebra Programming Motifs

LLVM-based Communication Optimizations for PGAS Programs

Friday, May 25, User Experiments with PGAS Languages, or

Parallel Programming. Libraries and implementations

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Parallel Performance Wizard: A Performance System for the Analysis of Partitioned Global-Address-Space Applications

Parallel Performance Wizard: A Performance System for the Analysis of Partitioned Global-Address-Space Applications

Portable, MPI-Interoperable! Coarray Fortran

A Pluggable Framework for Composable HPC Scheduling Libraries

Shared Memory Parallelism - OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Givy, a software DSM runtime using raw pointers

Appendix D. Fortran quick reference

Introduction to Standard OpenMP 3.1

[Potentially] Your first parallel application

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

CS 267 Unified Parallel C (UPC)

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Advanced Shared-Memory Programming

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenACC Course. Office Hour #2 Q&A

Parallel dense linear algebra computations (1)

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Session 4: Parallel Programming with OpenMP

Lecture 6: Finish Accumulators

Parallel Languages: Past, Present and Future

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Allows program to be incrementally parallelized

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

Transcription:

COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP 322 Lecture 32 4 April 2018

Worksheet #31: PageRank Example Name: Net ID: In the space below, indicate what you expect the relative ranking to be for the three pages below (with the given links). Show your computation (approximations are fine). Final, after 7 iterations: (1) Amazon = 1.22 (2) Yahoo = 1.15 (3) Microsoft = 0.65 2

Partitioned Global Address Space Languages Global address space one-sided communication (GET/PUT) Programmer has control over performance-critical factors data distribution and locality control computation partitioning communication placement Data movement and synchronization as language primitives amenable to compiler-based communication optimization Global view rather than local view simpler than two-sided message passing in MPI lacking in thread-based models HJ places (Lecture 34) help with locality control but not with data distribution Global View Local View (4 processes) 3

Partitioned Global Address Space (PGAS) Languages Unified Parallel C (C) http://upc.wikinet.org Titanium (early Java) http://titanium.cs.berkeley.edu Coarray Fortran 2.0 (Fortran) http://caf.rice.edu UPC++ (C++) https://bitbucket.org/upcxx Habanero-UPC++ (C++) http://habanero-rice.github.io/habanero-upc/ OpenSHMEM (C) http://openshmem.org/site/ Related efforts: newer languages developed since 2003 as part of the DARPA High Productivity Computing Systems (HPCS) program IBM: X10 (starting point for Habanero-Java) Cray: Chapel Oracle/Sun: Fortress 4

PGAS model A collection of threads (like MPI processes) operating in a partitioned global address space that is logically distributed across threads. Each thread has affinity with a portion of the globally shared address space. Each thread has also a private space. Elements in the partitioned global space co-located with a thread are said to have affinity to that thread. 5

Multiple threads working independently in a SPMD fashion MYTHREAD specifies thread index (0..THREADS-1) Like MPI processes and ranks # threads specified at compile-time or program launch time Partitioned Global Address Space (different from MPI) A pointer-to-shared can reference all locations in the shared space A pointer-to-local ( plain old C pointer ) may only reference addresses in its private space or addresses in its portion of the shared space Static and dynamic memory allocations are supported for both shared and private memory Threads synchronize as necessary using synchronization primitives shared variables Unified Parallel C (UPC) Execution Model 6

Shared and Private Data Static and dynamic memory allocation of each type of data Shared objects placed in memory based on affinity shared scalars have affinity to thread 0 here, a scalar means a non-array instance of any type (could be a struct, for example) by default, elements of shared arrays are allocated round robin among memory modules co-located with each thread (cyclic distribution) 7

A One-dimensional Shared Array Consider the following data layout directive shared int y[2 * THREADS + 1]; For THREADS = 3, we get the following cyclic layout Thread 0 Thread 1 Thread 2 y[0] y[3] y[1] y[4] y[2] y[5] z y[6] 8

shared int A[4][THREADS]; A Multi-dimensional Shared Array For THREADS = 3, we get the following cyclic layout Thread 0 Thread 1 Thread 2 A[0][0] A[1][0] A[2][0] A[3][0] A[0][1] A[1][1] A[2][1] A[3][1] A[0][2] A[1][2] A[2][2] A[3][2] 9

Shared and Private Data Consider the following data layout directives shared int x; // x has affinity to thread 0 shared int y[threads]; int z; // private For THREADS = 3, we get the following layout: Thread 0 Thread 1 Thread 2 x y[0] z y[1] z y[2] z z 10

Controlling the Layout of Shared Arrays Can specify a blocking factor for shared arrays to obtain block-cyclic distributions default block size is 1 element cyclic distribution Shared arrays are distributed on a block per thread basis, round robin allocation of block size chunks Example layout using block size specifications e.g., shared [2] int a[16] Thread 0 Thread 1 Thread 2 Block Size 11 a[0] a[1] a[6] a[7] a[12] a[13] a[2] a[3] a[8] a[9] a[14] a[15] a[4] a[5] a[10] a[11]

Consider the data declaration shared [3] int A[4][THREADS]; Blocking Multi-dimensional Data When THREADS = 4, this results in the following data layout Thread 0 Thread 1 Thread 2 Thread 3 A[0][0] A[0][1] A[0][2] A[3][0] A[3][1] A[3][2] A[0][3] A[1][0] A[1][1] A[3][3] A[1][2] A[1][3] A[2][0] A[2][1] A[2][2] A[2][3] 12

A Simple UPC Program: Vector Addition //vect_add.c #include <upc_relaxed.h> #define N 100*THREADS shared int v1[n], v2[n], v1plusv2[n]; Iteration #: Thread 0 0 2 Thread 1 1 3 void main() { i n t i ; for(i=0; i<n; i++) if (MYTHREAD == i % THREADS) v1plusv2[i]=v1[i]+v2[i]; } v1[0] v1[2] v2[0] v2[2] v1plusv2[0] v1plusv2[2] v1[1] v1[3] v2[1] v2[3] v1plusv2[1] v1plusv2[3] Shared Space Each thread executes each iteration to check if it has work 13

A More Efficient Vector Addition //vect_add.c #include <upc_relaxed.h> #define N 100*THREADS shared int v1[n], v2[n], v1plusv2[n]; Iteration #: Thread 0 0 2 v1[0] Thread 1 1 3 v1[1] void main() { i n t i ; for(i = MYTHREAD; i < N; i += THREADS) v1plusv2[i]=v1[i]+v2[i]; } v1[2] v2[0] v2[2] v1plusv2[0] v1plusv2[2] v1[3] v2[1] v2[3] v1plusv2[1] v1plusv2[3] Shared Space Each thread executes only its own iterations 14

Worksharing with upc_forall Distributes independent iterations across threads Simple C-like syntax and semantics upc_forall(init; test; loop; affinity) Affinity is used to enable locality control usually, the goal is to map iteration to thread where (all/most of) the iteration s data resides Affinity can be an integer expression (with implicit mod on NUMTHREADS), or a reference to (address of) a shared object 15

Work Sharing + Affinity with upc_forall Example 1: explicit data affinity using shared references shared int a[100],b[100], c[100]; int i; upc_forall (i=0; i<100; i++; &a[i]) // Execute iteration i at a[i] s thread a[i] = b[i] * c[i]; Example 2: implicit data affinity with integer expressions shared int a[100],b[100], c[100]; int i; upc_forall (i=0; i<100; i++; i) // Execute iteration i at thread i%threads a[i] = b[i] * c[i]; Both yield a round-robin distribution of iterations 16

Work Sharing + Affinity with upc_forall Example 3: implicit affinity by chunks shared [25] int a[100],b[100], c[100]; int i; upc_forall (i=02 i<100; i++; (i*threads)/100) a[i] = b[i] * c[i]; Assuming 4 threads, the distribution of upc_forall iterations is as follows: iteration i 0..24 25..49 50..74 75..99 i*threads 0..96 100..196 200..296 300..396 i*threads/100 0 1 2 3 17

Barriers (blocking) upc_barrier like next operation in HJ Split-phase barriers (non-blocking) upc_notify Synchronization in UPC like explicit (non-blocking) signal on an HJ phaser upc_wait upc_wait is like explicit wait on an HJ phaser Lock primitives void upc_lock(upc_lock_t *l) int upc_lock_attempt(upc_lock_t *l) // like trylock() void upc_unlock(upc_lock_t *l) 18

UPC++ library: a Compiler-Free Approach for PGAS (source: LBNL) UPC++ Template Header Files UPC++ idioms are translated to C++ Leverage C++ standards and compilers - Implement UPC++ as a C++ template library UPC++ Program C++ Compiler - C++ templates can be used as a mini-language to extend C++ syntax UPC++ Runtime GASNet System Libs Linker Exe Object file w. runtime calls Many new features in C++11 - E.g., type inference, variadic templates, lambda functions, r- value references - C++ 11 is well-supported by major compilers 19

Habanero-UPC++: Extending UPC++ with Task Parallelism (LBNL, Rice) 1. finish ( [capture_list1] () { 2. // Any Habanero dynamic tasking constructs 3.... // finish, async, asyncawait 4.... 5. // Remote function invocation 6. asyncat ( destplace, [capture_list2] ( ) { 7. Statements; 8. }); 9.... 10. // Remote copy with completion signal in result 11. asynccopy ( src, dest, count, ddf=null ); 12.... 13. asyncawait(ddf,.); // local 14.}); // waits for all local/remote async s to complete HabaneroUPC++: A Compiler-free PGAS Library. V. Kumar, Y. Zheng, V. Cavé, Z. Budimlić, V. Sarkar, PGAS 2015. 20

Example code structure from an application run on ORNL supercomputer (LSMS) MPI version: // Post MPI_IRecv() calls... // Post MPI_ISend() calls... // Perform all MPI_Wait() // calls... // Perform tasks // Each task needs results // from two MPI_IRecv() calls... async( ) Habanero-UPC++ version: // Issue one-sided // asynccopy() calls... // Issue data-driven tasks // in any order without any // wait/barrier operations hcpp::asyncawait(... result1, result2, [=]() { task body }); MPI version waits for all IRecv() calls to complete before executing all tasks (like a barrier) Habanero-UPC++ version specifies that each asyncawait() task can complete when its two results become available from asynccopy() calls 21