XLUPC: Developing an optimizing compiler UPC Programming Language

Size: px

Start display at page:

Download "XLUPC: Developing an optimizing compiler UPC Programming Language"

Shanon Walton
5 years ago
Views:

1 IBM Software Group Rational XLUPC: Developing an optimizing compiler UPC Programming Language Ettore Tiotto IBM Toronto Laboratory

2 IBM Disclaimer Copyright IBM Corporation All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others. Please Note: IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

3 Outline 1. Overview of the PGAS programming model 2. The UPC Programming Language 3. Compiler optimizations for PGAS Languages

4 IBM Software Group Rational 1. Overview of the PGAS programming model Ettore Tiotto IBM Toronto Laboratory

5 Partitioned Global Address Space Explicitly parallel, shared-memory like programming model Global addressable space Allows programmers to declare and directly access data distributed across machines Partitioned address space Memory is logically partitioned between local and remote (a two-level hierarchy) Forces the programmer to pay attention to data locality, by exposing the inherent NUMA-ness of current architectures Single Processor Multiple Data (SPMD) execution model All threads of control execute the same program Number of threads fixed at startup Newer languages such as X10 escape this model, allowing fine-grain threading Different language implementations: UPC (C-based), Coarray Fortran (Fortran-based), Titanium and X10 (Java-based)

6 Partitioned Global Address Space Process/Thread Address Space Computation is performed in multiple places. A place contains data that can be operated on remotely. Data lives in the place it was created, for its lifetime. Message passing MPI Shared Memory OpenMP PGAS UPC, CAF, X10 A datum in one place may point to a datum in another place. Data-structures (e.g. arrays) may be distributed across many places. Places may have different computational properties (mapping to a hierarchy of compute engines) A place expresses locality.

7 UPC Execution Model A number of threads working independently in a SPMD fashion Number of threads available as program variable THREADS MYTHREAD specifies thread index (0..THREADS-1) upc_barrier is a global synchronization: all wait There is a form of worksharing loop that we will see later Two compilation modes Static Threads mode: THREADS is specified at compile time by the user (compiler option) The program may use THREADS as a compile-time constant The compiler can generate more efficient code Dynamic threads mode: Compiled code may be run with varying numbers of threads THREADS is specified at runtime time by the user (via env. variable)

8 Hello World in UPC Any legal C program is also a legal UPC program If you compile and run it as UPC with N threads, it will run N copies of the program (Single Program executed by all threads). #include <upc.h> #include <stdio.h> int main() { printf("thread %d of %d: Hello UPC world\n", MYTHREAD, THREADS); return 0; hello > xlupc helloworld.upc hello > env UPC_NTHREADS=4./a.out Thread 1 of 4: Hello UPC world Thread 0 of 4: Hello UPC world Thread 3 of 4: Hello UPC world Thread 2 of 4: Hello UPC world

9 IBM Software Group Rational 2. The UPC Programming Language Ettore Tiotto IBM Toronto Laboratory Some slides adapted with permission from Kathy Yelick

10 An Overview of the UPC Language Private vs. Shared Variables in UPC Shared Array Declaration and Layouts Parallel Work sharing Loop Synchronization Pointers in UPC Shared data transfer and collectives

11 Private vs. Shared Variables in UPC Normal C variables and objects are allocated in the private memory space for each thread (thread local variables) Shared variables are allocated only once, on thread 0 shared int ours; int mine; Shared variables may not be declared automatic, i.e., may not occur in a function definition, except as static. Thread 0 Thread 1 Thread n Global address space ours: mine: mine: mine: Shared Private

12 Shared Arrays Shared arrays are distributed over the threads, in a cyclic fashion shared int x[threads]; /* 1 element per thread */ shared int y[3][threads]; /* 3 elements per thread */ shared int z[3][3]; /* 2 or 3 elements per thread */ Assuming THREADS = 4 (colors map to threads): x y z Think of a linearized C array, then map roundrobin on THREADS

13 Distribution of a shared array in UPC Elements are distributed in block-cyclic fashion Each thread owns blocks of adjacent elements shared [2] int X[10]; Blocking factor 2 threads th0 th0 th1 th1 th0 th0 th1 th1 th0 th

14 Layouts in General All non-array objects have affinity with thread zero. Array layouts are controlled by layout specifiers: Empty (cyclic layout) [*] (blocked layout) [0] or [] (indefinitely layout, all on 1 thread) [b] or [b1][b2] [bn] = [b1*b2* bn] (fixed block size) The affinity of an array element is defined in terms of: block size, a compile-time constant and THREADS. Element i has affinity with thread (i / block_size) % THREADS In 2D and higher, linearize the elements as in a C representation, and then use above mapping

15 Physical layout of shared arrays shared [2] int X[10]; Logical Distribution 2 threads th0 th0 th1 th1 th0 th0 th1 th1 th0 th Physical Distribution th0 th0 th0 th0 th0 th0 th1 th1 th1 th

16 Work Sharing with upc_forall() UPC adds a special type of loop upc_forall(init; test; loop; affinity) statement; Affinity expression indicates which iterations to run on each thread. It may have one of two types: Integer: affinity%threads ismythread upc_forall(i=0; i<n; ++i; i) Address: upc_threadof(affinity) ismythread upc_forall(i=0; i<n; ++i; &A[i]) Programmer indicates the iterations are independent Undefined if there are dependencies across threads No early exit or entry into the loop are allowed

17 Work Sharing with upc_forall() Similar to C for loop, 4 th field indicates the affinity Thread that owns elem. A[i] executes iteration shared int A[10],B[10],C[10]; upc_forall(i=0; i < 10; i++; &A[i]) { A[i] = B[i] + C[i]; 2 threads th0 No communication th1 i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9

18 Vector Addition with upc_forall #define N 100*THREADS shared int v1[n], v2[n], sum[n]; void main() { int i; upc_forall(i=0; i<n; i++; i) sum[i]=v1[i]+v2[i]; Equivalent code could use &sum[i] for affinity The code would be correct but slow if the affinity expression werei+1 rather thani. Why?

19 UPC Global Synchronization Synchronization between threads is necessary to control relative execution of threads The programmer is responsible for synchronization UPC has two basic forms of global barriers: Barrier: block until all other threads arrive upc_barrier Split-phase barriers upc_notify; this thread is ready for barrier Porform local computation upc_wait; wait for others to be ready

20 Recap: Shared Variables, Work sharing and Synchronization With what you ve seen until now, you can write a barebones data-parallel program Shared variables are distributed and visible to all threads Shared scalars have affinity to thread 0 Shared arrays are distributed (cyclically by default) on all threads Execution model is SPMD with theupc_forall provided to share work Barriers and split barriers provide global synchronization

21 Pointers to Shared vs. Arrays In the C tradition, array can be access through pointers Here is a vector addition example using pointers #define N 100*THREADS shared int v1[n], v2[n], sum[n]; void main() { int i; shared int *p1, *p2; p1=&v1[0]; p2=&v2[0]; v1 p1 upc_forall(i=0; i<n; i++; &sum[i]){ sum[i]= p1[i] + p2[i];

22 Common Uses for UPC Pointer Types int *p1; These pointers are fast (just like C pointers) Use to access local data in part of code performing local work Often cast a pointer-to-shared to one of these to get faster access to shared data that is local shared int *p2; Use to refer to remote data Larger and slower due to test-for-local + possible communication int * shared p3; Not recommended Why not? shared int * shared p4; Use to build shared linked structures, e.g., a linked list

23 UPC Pointers Where does the pointer point to? Where does the pointer reside? Private Shared Local PP (p1) SP (p2) Shared PS (p3) SS (p4) int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Shared to private is not recommended. Why?

24 UPC Pointers Thread 0 Thread 1 Thread n Global address space p3: p3: p3: p4: p4: p4: p1: p1: p1: p2: p2: p2: Shared Private int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Pointers to shared often require more storage and are more costly to dereference than local pointers; they may refer to local or remote memory.

25 UPC Pointers Pointer arithmetic supports blocked and non-blocked array distributions Casting of shared to private pointers is allowed but not vice versa! When casting a pointer to shared to a private pointer, the thread number of the pointer to shared may be lost Casting of shared to private is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast shared [BF] int A[N]; shared [BF] int *p = &A[BF*MYTHREAD]; int *lp = (int *) p; // cast to a local pointer (careful here!) for (i=0; i<bf; i++) lp[i] = i; // valid _only_ if p points to the local portion of the array!

26 Data movement Fine grain (array element, by array element access) are easy to program in an imperative way. However, especially on distributed memory machines, block transfers are more efficient UPC provides library functions for shared data movement: upc_memset Set a block of values in shared memory upc_memget, upc_memput Transfer blocks of data from shared memory to/from private memory upc_memcpy Transfer blocks of data from shared memory to shared memory

27 Data movement Collectives Used to move shared data across threads: - upc_all_broadcast(shared void* dst, shared void* src, size_t nbytes, ) A thread copies a block of memory it owns and sends it to all threads - upc_all_scatter(shared void* dst, shared void *src, size_t nbytes, ) A single thread splits memory in blocks and sends each block to a different thread - upc_all_gather(shared void* dst, shared void *src, size_t nbytes, ) Each thread copies a block of memory it owns and sends it to a single thread - upc_all_gather_all(shared void* dst, shared void *src, size_t nbytes, ) Each threads copies a block of memory it owns and sends it to all threads - upc_all_exchange(shared void* dst, shared void *src, size_t nbytes, ) Each threads splits memory in blocks and sends each block to all thread - upc_all_permute(shared void* dst, shared void *src, shared int* perm, size_t nbytes, ) Each threads copies a block of memory and sends it to thread in perm[i]

28 upc_all_broadcast THREAD 0 src dst THREAD i dst THREAD n dst Thread 0 copies a block of memory and sends it to all threads

29 upc_all_scatter THREAD 0 src dst THREAD i dst THREAD n dst Thread 0 sends a unique block to all threads

30 upc_all_gather THREAD 0 dst src THREAD i src THREAD n src Each thread sends a block to thread 0

31 Computational Collectives Used to perform data reductions - upc_all_reducet(shared void* dst, shared void* src, upc_op_t op, ) - upc_all_prefix_reducet(shared void* dst, shared void *src, upc_op_t op, ) One version for each type T (22 versions in total) Many operations are supported: OP can be: +, *, &,, xor, &&,, min, max perform OP on all elements of src array and place result in dst array

32 upc_all_reduce THREAD 0 THREAD i THREAD n src 45 dst src src Threads perform partial sums, each partial sum added and result stored on thread 0

33 Simple sum: Introduction Simple forall loop to add values into a sum shared int values[n], sum; sum = 0; upc_forall (int i=0; i<n; i++; &values[i]) sum += values[i]; Is there a problem with this code? Implementation is broken: sum is not guarded by a lock Write-after-write hazard; will get different answers every time

34 Synchronization primitives We have seen upc_barrier, upc_notify and upc_wait UPC supports locks: Represented by an opaque type:upc_lock_t Must be allocated before use: upc_lock_t *upc_all_lock_alloc(void); allocates 1 lock, pointer to all threads upc_lock_t *upc_global_lock_alloc(void); allocates 1 lock, pointer to all threads To use a lock: void upc_lock(upc_lock_t *l) void upc_unlock(upc_lock_t *l) use at start and end of critical region Locks can be freed when not in use void upc_lock_free(upc_lock_t *ptr);

35 Dynamic memory allocation As in C, memory can be dynamically allocated UPC provides several memory allocation routines to obtain space in the shared heap shared void* upc_all_alloc(size_t nblocks, size_t nbytes) a collective operation that allocates memory on all threads layout equivalent to: shared [nbytes] char[nblocks * nbytes] shared void* upc_global_alloc(size_t nblocks, size_t nbytes) A non-collective operation, invoked by one thread to allocate memory on all threads layout equivalent to: shared [nbytes] char[nblocks * nbytes] shared void* upc_alloc(size_t nbytes) A non-collective operation to obtain memory in the thread s shared section of memory void upc_free(size_t nbytes) A non-collective operation to free data allocated in shared memory Why do we need just one version?

36 IBM Software Group Rational 3. Compiler Optimizations for PGAS Languages Ettore Tiotto IBM Toronto Laboratory

XLUPC Compiler Architecture Built to leverage the IBM XLC technology Common Front End for C and UPC (extends C technology to UPC, generates WCode IL) Common High Level Optimizer (common

37 XLUPC Compiler Architecture Built to leverage the IBM XLC technology Common Front End for C and UPC (extends C technology to UPC, generates WCode IL) Common High Level Optimizer (common infrastructure with C, UPC specific optimizations developed) Common Low Level Optimizer (vastly unmodified, UPC artifacts handled upstream) PGAS Runtime (exploitation of P7IH specialized network hardware)

38 XLUPC Optimizer Architecture Data and Control Flow Optimizer Loop Optimizer Constant Propagation Dead Code Elimination Loop Unrolling Copy Propagation Expression simplification Loop Normalization Loop Unswitching Dead store elimination Backward and Forward store motion Redundant Condition Elimination Traditional Loop Optimizations Shared Objects Coalescing Shared Objects Privatization Remote Update Shared Array Idioms Opt. Prefetching Remote Shared Data Parallel Work- Sharing Opt. Optimization Strategy 1) Remove overhead -Parallel Work-Sharing Loop Opt. -Shared pointer arithmetic Inlining 2) Exploit Locality Information - Shared Objects Privatization 3) Reduce communication Requirements - Shared Objects Coalescing - Remote Update - Shared Array Idioms Recognition Code Gen. UPC Code Gen. Thread Local Storage 4) P7IH Exploitation (through PAMI) PGAS Runtime System PAMI - HFI Immediate Send Instruction support - Collective Acceleration Unit support - HFI remote updates support - PAMI network endpoints

39 XL UPC Compiler optimizations 3.0 Translation of shared object accesses Preview of compiler optimized code 3.1 Optimizing a Work Sharing Loop Removing the affinity branch in a work sharing loop 3.2 Locality Analysis for Work Sharing Loops 3.3 Shared Array Optimizations Optimizations for Local Accesses (privatization) Optimizations for Remote Accesses (coalescing) Optimizing Commonly Used Code Patterns 3.4 Optimizing Shared Accesses through UPC pointers Optimizing Shared Accesses through pointers (versioning)

40 3.0 Translation of shared object accesses Achieved by translating shared accesses into calls to our PGAS runtime system: Pseudocode generated Sourceby Code the compiler shared int a; int foo () { int i, res; if (MYTHREAD==0) for (i=0; i<n; i++) res += a[i]; upc_barrier; return res; 3 int foo() 5 { if (!( mythread == 0)) goto lab_1; 6 i = 0; do { 7... xlupc_incr_shared_addr(&shared_addr.a, i); xlupc_ptrderef(shared_addr.a, &tmp); res = res + tmp; 6 i = i + 1; while (i < 10); lab_1: 8 xlupc_barrier(0,1) 9 return res; 10

41 3.0 Translation of shared object accesses To translate the shared array access a[i] the compiler generates: A PGAS RT call to position a pointer to shared memory to point to the correct array element (&a[i]): xlupc_incr_shared_addr(&shared_addr.a, i); A PGAS RT call retrieve the value of a[i] trough the pointer: xlupc_ptrderef(shared_addr.a, &tmp); The value of a[i] is now in a local buffer tmp Performance considerations Function call overhead (setup arguments, branch to runtime routine) Runtime implementation cost Receiving arguments, performing local computation, Communication cost (if the target array elements is not local) Pointer increment cost (actually inlined by the compiler for common cases)

42 3.0 Preview of compiler optimized code Beside code generation the compiler concerns itself with generation of efficient code (remove overhead). For example, certain shared accesses could be proven local (at compile time) therefore the compiler can: Cast the shared array element address to a local C pointer (on each thread) Use the local C pointer to dereference the array element directly Source Code shared int A[THREADS]; int foo(void) { A[MYTHREAD] = MYTHREAD; Optimized Code (-O3) int foo(){ threadspernode = xlupc_threadspernode(); shr_address = xlupc_get_array_base_address(a); shr_address[((( mythread % 4) % threadspernode) * 4 + return; /* function */ ( mythread/4)*4)] = mythread;

43 3.1 Optimizing a Work Sharing Loop A UPC work sharing loop (upc_forall) is equivalent to a for loop where the execution of the loop body is controlled by the affinity expression Example: Equivalent Original upc_forall loop loop shared int a[n], b[n], c[n]; void foo () { int i; upc_forall (i=0; i<n; i++; &a[i]) a[i] = b[i] + c[i]; shared int a[n], b[n], c[n]; void foo() { int i; for (i=0; i<n; i++){ if (upc_threadof(&a[i])==mythread) a[i] = b[i] + c[i]; The affinity branch is evaluated by every thread on each loop iteration What is the performance implication of this translation scheme?

44 3.1 Optimizing a Work Sharing Loop To eliminate the affinity branch the compiler transforms the upc_forall loop into a loop next comprising two for loops: Equivalent Original (optimized) upc_forall forloop shared int a[n], b[n], c[n]; void foo () { int i; upc_forall (i=0; i<n; i++; &a[i]) a[i] = b[i] + c[i]; shared int a[n], b[n], c[n]; void foo() { int i, j; for (i=mythread; i<n; i+=threads) for (j=i; j<=i; j++) a[j]=b[j]+c[j]; For each thread the outer loop cycles through each block of elements while the inner loop cycles through each element in a block Thread 0 Thread 1 Thread 0 Thread 1 Thread 0 Thread 1 i = 1 i = 0 i = 1 j = 0 j = 1 i = 2 j = 2 i = 3 j = 3 i = 4 j = 4 i = 5 j = 5

45 3.1 Optimizing a Work Sharing Loop Objective: remove the affinity branch overhead in upc_forall loops MB/s Stream Triad 20 million elem. address affinity Unoptimized (base line) Optimized (strip-mined) 8.5X improvement shared [BF] int A[N],B[N],C[N]; upc_forall (i=0; i < N; i++; &A[i]) A[i] = B[i] + k*c[i]; naïve translation shared [BF] int A[N],B[N],C[N]; for (i=0; i < N; i++) if (upc_threadof(a[i]) == MYTHREAD) A[i] = B[i] + k*c[i]; branch Affinity branch evaluation optimized loop 500 limits scalability shared [BF] int A[N], B[N], C[N]; Threads for (i=mythread*bf; i<n; i+=threads*bf) for (j=i; j < i+bf; j++) A[j] = B[j] + k*c[j];

46 3.2 Locality Analysis for Work Sharing Loops A Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread local accesses, 1 remote access 4 Threads / 4 Nodes shared [4] int A[2][16]; int main ( ) { int i, j; for (i =0; i<2 ; i++) { // row traversal upc_forall (j=0; j<16; j++; &A[i][j]) { A[i][j+1] = 2; Each block has 4 elements upc_forall loop used to traverse the inner array dimension Note the index [j+1], causing ¼ of accesses to the next block Each 4 th loop iteration we have one remote access, the first 3 accesses are shared local Can we somehow isolate the remote access from the local accesses?

47 3.2 Locality Analysis for Work Sharing Loops A Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3 4 Threads / 4 Nodes A1 A2 shared [4] int A[2][16]; int main ( ) { for (int i=0; i < 2; i++) upc_forall (int j=0; j < 16; j++; &A[i][j]) { if (j % 4 < 3) A[i][j+1] = 2; / / A1 else A[i][j+1] = 2; / / A2 Region 1 Region 2 The compiler identifies the point in the loop iteration space where the locality if shared accesses changes The compiler then splits the loop body into 2 regions + Region 1: shared access is always local + Region 2: shared access is always remote Now shared local accesses (region 1) can be privatized, and shared remote accesses could be coalesced (more later on this)

48 3.3 Optimizations for Local Accesses After the compiler has proven (via locality analysis) that a shared array access is local, shared object access privatization can be applied: upc_forall (int j=0; j < 16; j++; &A[j]) { if (j % 4 < 3) A[j+1] = 2; // shared local else A[j+1] = 2; // shared remote Privatized Original shared access A_address = xlupc_get_array_address(a); upc_forall (int j=0; j < 16; j++; &A[j]) { if (j % 4 < 3) else A_address [ new_index ] = 2; // privatized A[j+1] = 2; // shared remote The compiler translates the shared local access so that no call to the UPC runtime system is necessary (no call overhead) the base address of the shared array is retrieved outside the loop the index into the local portion of the shared array is computed the shared array access is then just a local memory dereference

49 3.3 Optimizations for Remote Accesses Locality Analysis tells us whether a shared reference is local or remote Local shared references are privatized, how can remote shared reference be optimized? Example of remote accesses shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<n; i++; &A[i]) A[i] = B[i+BF]; Observations (1) A[i] is shared local, therefore the compiler privatizes the access (2) B[i+BF] is always remote (for every i ) so it cannot be privatized In order to read the elements of shared array B the compiler can generate a call to the UPC runtime system for each element of B What is the performance implication of this translation scheme? Can we reduce the number of calls required to read the elements of array B?

50 3.3 Optimizations for Remote Accesses The compiler transforms the parallel loop to prefetches multiple elements of array B (reducing the number of messages required) #define BF (N/THREADS) shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<n; i++; &A[i]) A[i] = B[i+BF]; Optimized Original ParallelLoop #define BF (N/THREADS) shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<threads; i++; &A[BF*i]) { int stride = sizeof(int); int nelems = BF; // prefetch stride // n. of elems to prefetch xlupc_ptrderef_strided (&B[BF*i+BF], &B_buffer, sizeof(int), stride, nelems); for (int j=bf*i; j < min(bf*i, N-BF); j++) A[j] = B_buffer[j BF*i]; The outer loop reads BF elements from B (one msg) and fills in a buffer The inner loop consumes the buffer, communication cost is reduced by ~ a BF factor

51 3.3 Optimizations for Remote Accesses shared [ROWS*COLUMNS/THREADS] BYTE orig[rows][columns]; upc_forall (i=1; i < ROWS-1; i++; &OrigShared[i][0]) for (k=0; k < nblocks; k++) { coalesce_call(orig, &buf, index, offset, stride, bsize); l = k * bsize; for (j=0; j < bsize; j++) { gx = (int) orig[i-1][j+l+2] - orig[i-1][j+l]; gx += ((int) orig[i][j+l+2] - orig[i][j+l]) * 2; gx += (int) buf[j+2] buf[j]; gy = (int) buf[j] - orig[i-1][j+l]; gy += ((int) buf[j+1] - orig[i-1][j+l+1]) * 2; gy += (int) buf[j+2] - orig[i-1][j+l+2]; edge[i][j+l+1] = (BYTE) gradient; Managed by the compiler scheduled outside inner loop MYTHREAD MYTHREAD+1

52 Sobel Edge Detection banded layout Time (sec) Sobel (1 thread per node) DISTRIBUTED Image Size: 4096 X 4096 No Opt. Privatization Coalescing Full Opt Time (sec) + 5.7X Coalescing Improvement X Sobel (16 threads per node) HYBRID Image Size: 4096 X 4096 Coalescing No Opt. Privatization Coalescing Full Opt Improvement X Nodes Nodes Lower is better + 130X Distributed: Coalescing accounts for a 3X improvement Hybrid: Coalescing accounts for a 74X improvement

53 3.3 Optimizing Commonly Used Code Patterns Transformation detects common initialization code patterns and changes them into one of the 4 string handling functions provided by the UPC standard (upc_memset, upc_memget, upc_memput, upc_memcpy). The compiler is also able to recognize non-shared initialization idiom and change them to one of the C standard string handling functions Code generatedsource by thecode XLUPCSnippet compiler (-O2) #define N #define BF 512 shared [BF] int A[N]; shared [BF] int B[N]; int a[n], b[n]; void func(void) { for (int i = 0; i < N; i++) { A[i] = a[i]; b[i] = B[i]; After Array Init = 0; do = 512ll 50 upc_memput(&a + 4ll*@CIV1,&a + 4ll*@CIV1,2048ll) 51 upc_memget(&b + 4ll*@CIV1,&B + 4ll*@CIV1,2048ll) do + 1ll; while (@CIV1 < min(@bciv2 * 512ll + 512ll,64000ll)); + 1; while (@BCIV2 < 125ll);

54 3.4 Optimizing Shared Accesses through pointers In order to determine whether a shared access in a parallel loop is local or remote locality analysis has to compare it to the loop affinity expression (to determine its distance) If the access is through a pointer to shared memory the distance computation depends on where the pointer points to Observations Shared accesses via pointers-to-shared shared [BF] int A[N], B[N]; void foo(shared [BF] int *p1, shared [BF] int *p2) { upc_forall (int i=0; i<n; ++i; &p1[i]) p1[i] = p2[i+1]; int main() { foo (&A[0], &B[0]); // call site 1 foo (&A[0], &B[BF-1]); // call site 2 (1) p1[i] is trivially a shared local access for every value of i because the affinity expression is &p[i] (2) what is the distance of p2[i+1] from the affinity expression? call 1: p2 points to the first element of B, the relative distance 1 (each BF accesses 1 accesses is remote) call 2: p2 points to the last element of the 1 st block of B, the relative distance is BF (all accesses remote) (3) How can privatize p2[i+1] to handle the call from site 1 without breaking the call at site 2?

55 3.4 Optimizing Shared Accesses through pointers Key insight from the prior example: privatization needs to be sensitive to different calling contexts (call site 1 vs. call site 2) The approach we implemented is to create different versions of the parallel loop based on the value of a variable (ver_p2 below) Optimized Shared Parallel accesses Loop via pointers-to-shared shared [BF] int A[N], B[N]; void foo(shared [BF] int *p1, shared [BF] int *p2) { upc_forall (int i=0; i<n; ++i; &p1[i]) p1[i] = p2[i+1]; int main() { foo (&A[0], &B[0]); // case 1 foo (&A[0], &B[BF-1]); // case 2 void foo(shared [BF] int *p1, shared [BF] int *p2) { _Bool ver_p2=(upc_phaseof(p2)==0)?1:0; if (ver_p2) { // p2 points to first elem. in a block upc_forall (int i=0; i<n; i++; &p1[i]) p1[i] = p2[i+1]; // locality analysis possible else { upc_forall (int i=0; i<n; i++; &p1[i]) p1[i] = p2[i+1]; // p2[] will not be privatized

56 Research Directions Advanced Prefetching Optimizations Applicable to for loops as well as upc_forall loops Hybrid in nature Collection phase: shared accesses in loop are described to the runtime system Scheduling Phase: runtime analyzes access patterns, coalesce related accesses and prefetches them Execution Phase: prefetched buffers are consumed in the loop Applicable in a dynamic execution environment (threads not known at compile time) 100 Throughput (MB/sec) Millions iterations, 228 MB Power 7 HW (1 node) RHEL6 Baseline (-O3) Prefetch ~15X speedup Threads A[i] = B[i+1] + C[i+1] Prefetched Baseline (-O3)

57 Research Directions Advanced Runtime Diagnostics -qcheck Compiler instruments UPC application Errors detected while application runs UPC Program UPC Compiler Code Analysis Code Optimizer Code Generator UPC Checker Diagnostic API Interface PGAS Runtime Diagnostician Instrumented Executable 1 #include <upc.h> 2 #include <upc_collective.h> 3 #define BF shared [BF] int A[BF * THREADS], *ptr; 6 shared [BF*THREADS] int B[THREADS][BF*THREADS]; 7 shared int sh_num_bytes; 8 9 int main() { 10 if (sh_num_bytes>=0) ptr = &A[2*BF]; 11 else ptr = A; 12 upc_all_gather_all(b, ptr, sizeof(int)*bf, UPC_IN_ALLSYNC UPC_OUT_ALLSYNC); 13 return 0; 14 % xlupc -qcheck coll_diag.upc -qupc=threads=4 %./a.out -procs Invalid target affinity detected at line 12 in file 'coll_diag.upc'. Parameter 2 of function 'upc_all_gather_all' should have affinity to thread 0. Baseline (-O3)

58 Research Directions #include <upc.h> #include "mpi.h" extern int xlpgas_tsp_thrd_id (void); shared int BigArray[THREADS]; // let's do something fun with MPI. BigArray[MYTHREAD] = MYTHREAD + 1; int * localbigarray = (int *) &BigArray[MYTHREAD]; int main(int argc, char ** argv) { // We are in a UPC program, but we can // initialize MPI: int mpirank, mpisize, result; if (xlpgas_tsp_thrd_id()==0) MPI_Init (&argc, &argv); upc_barrier; MPI_Comm_rank (MPI_COMM_WORLD, &mpirank); MPI_Comm_size (MPI_COMM_WORLD, &mpisize); // let's dump our MPI and UPC identification: printf ("I'm UPC thread %d/%d and MPI task %d/%d\n", MYTHREAD, THREADS, mpirank, mpisize); MPI_Allreduce(localBigArray,&result,1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); if (MYTHREAD==0) printf ("The sum = %d\n", result); upc_barrier; if (xlpgas_tsp_thrd_id()==0) MPI_Finalize(); %./a.out -msg_api pgas,mpi -procs 4 I'm UPC thread 0 of 4 and MPI task 0 of 4 I'm UPC thread 1 of 4 and MPI task 1 of 4 I'm UPC thread 2 of 4 and MPI task 2 of 4 I'm UPC thread 3 of 4 and MPI task 3 of 4 The sum = 10

59 IBM Software Group Rational Thank You!!! Ettore Tiotto IBM Toronto Laboratory

Unified Parallel C, UPC

Unified Parallel C, UPC Jarmo Rantakokko Parallel Programming Models MPI Pthreads OpenMP UPC Different w.r.t. Performance/Portability/Productivity 1 Partitioned Global Address Space, PGAS Thread 0 Thread