XLUPC: Developing an optimizing compiler UPC Programming Language

Size: px
Start display at page:

Download "XLUPC: Developing an optimizing compiler UPC Programming Language"

Transcription

1 IBM Software Group Rational XLUPC: Developing an optimizing compiler UPC Programming Language Ettore Tiotto IBM Toronto Laboratory

2 IBM Disclaimer Copyright IBM Corporation All rights reserved. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others. Please Note: IBM s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

3 Outline 1. Overview of the PGAS programming model 2. The UPC Programming Language 3. Compiler optimizations for PGAS Languages

4 IBM Software Group Rational 1. Overview of the PGAS programming model Ettore Tiotto IBM Toronto Laboratory

5 Partitioned Global Address Space Explicitly parallel, shared-memory like programming model Global addressable space Allows programmers to declare and directly access data distributed across machines Partitioned address space Memory is logically partitioned between local and remote (a two-level hierarchy) Forces the programmer to pay attention to data locality, by exposing the inherent NUMA-ness of current architectures Single Processor Multiple Data (SPMD) execution model All threads of control execute the same program Number of threads fixed at startup Newer languages such as X10 escape this model, allowing fine-grain threading Different language implementations: UPC (C-based), Coarray Fortran (Fortran-based), Titanium and X10 (Java-based)

6 Partitioned Global Address Space Process/Thread Address Space Computation is performed in multiple places. A place contains data that can be operated on remotely. Data lives in the place it was created, for its lifetime. Message passing MPI Shared Memory OpenMP PGAS UPC, CAF, X10 A datum in one place may point to a datum in another place. Data-structures (e.g. arrays) may be distributed across many places. Places may have different computational properties (mapping to a hierarchy of compute engines) A place expresses locality.

7 UPC Execution Model A number of threads working independently in a SPMD fashion Number of threads available as program variable THREADS MYTHREAD specifies thread index (0..THREADS-1) upc_barrier is a global synchronization: all wait There is a form of worksharing loop that we will see later Two compilation modes Static Threads mode: THREADS is specified at compile time by the user (compiler option) The program may use THREADS as a compile-time constant The compiler can generate more efficient code Dynamic threads mode: Compiled code may be run with varying numbers of threads THREADS is specified at runtime time by the user (via env. variable)

8 Hello World in UPC Any legal C program is also a legal UPC program If you compile and run it as UPC with N threads, it will run N copies of the program (Single Program executed by all threads). #include <upc.h> #include <stdio.h> int main() { printf("thread %d of %d: Hello UPC world\n", MYTHREAD, THREADS); return 0; hello > xlupc helloworld.upc hello > env UPC_NTHREADS=4./a.out Thread 1 of 4: Hello UPC world Thread 0 of 4: Hello UPC world Thread 3 of 4: Hello UPC world Thread 2 of 4: Hello UPC world

9 IBM Software Group Rational 2. The UPC Programming Language Ettore Tiotto IBM Toronto Laboratory Some slides adapted with permission from Kathy Yelick

10 An Overview of the UPC Language Private vs. Shared Variables in UPC Shared Array Declaration and Layouts Parallel Work sharing Loop Synchronization Pointers in UPC Shared data transfer and collectives

11 Private vs. Shared Variables in UPC Normal C variables and objects are allocated in the private memory space for each thread (thread local variables) Shared variables are allocated only once, on thread 0 shared int ours; int mine; Shared variables may not be declared automatic, i.e., may not occur in a function definition, except as static. Thread 0 Thread 1 Thread n Global address space ours: mine: mine: mine: Shared Private

12 Shared Arrays Shared arrays are distributed over the threads, in a cyclic fashion shared int x[threads]; /* 1 element per thread */ shared int y[3][threads]; /* 3 elements per thread */ shared int z[3][3]; /* 2 or 3 elements per thread */ Assuming THREADS = 4 (colors map to threads): x y z Think of a linearized C array, then map roundrobin on THREADS

13 Distribution of a shared array in UPC Elements are distributed in block-cyclic fashion Each thread owns blocks of adjacent elements shared [2] int X[10]; Blocking factor 2 threads th0 th0 th1 th1 th0 th0 th1 th1 th0 th

14 Layouts in General All non-array objects have affinity with thread zero. Array layouts are controlled by layout specifiers: Empty (cyclic layout) [*] (blocked layout) [0] or [] (indefinitely layout, all on 1 thread) [b] or [b1][b2] [bn] = [b1*b2* bn] (fixed block size) The affinity of an array element is defined in terms of: block size, a compile-time constant and THREADS. Element i has affinity with thread (i / block_size) % THREADS In 2D and higher, linearize the elements as in a C representation, and then use above mapping

15 Physical layout of shared arrays shared [2] int X[10]; Logical Distribution 2 threads th0 th0 th1 th1 th0 th0 th1 th1 th0 th Physical Distribution th0 th0 th0 th0 th0 th0 th1 th1 th1 th

16 Work Sharing with upc_forall() UPC adds a special type of loop upc_forall(init; test; loop; affinity) statement; Affinity expression indicates which iterations to run on each thread. It may have one of two types: Integer: affinity%threads ismythread upc_forall(i=0; i<n; ++i; i) Address: upc_threadof(affinity) ismythread upc_forall(i=0; i<n; ++i; &A[i]) Programmer indicates the iterations are independent Undefined if there are dependencies across threads No early exit or entry into the loop are allowed

17 Work Sharing with upc_forall() Similar to C for loop, 4 th field indicates the affinity Thread that owns elem. A[i] executes iteration shared int A[10],B[10],C[10]; upc_forall(i=0; i < 10; i++; &A[i]) { A[i] = B[i] + C[i]; 2 threads th0 No communication th1 i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9

18 Vector Addition with upc_forall #define N 100*THREADS shared int v1[n], v2[n], sum[n]; void main() { int i; upc_forall(i=0; i<n; i++; i) sum[i]=v1[i]+v2[i]; Equivalent code could use &sum[i] for affinity The code would be correct but slow if the affinity expression werei+1 rather thani. Why?

19 UPC Global Synchronization Synchronization between threads is necessary to control relative execution of threads The programmer is responsible for synchronization UPC has two basic forms of global barriers: Barrier: block until all other threads arrive upc_barrier Split-phase barriers upc_notify; this thread is ready for barrier Porform local computation upc_wait; wait for others to be ready

20 Recap: Shared Variables, Work sharing and Synchronization With what you ve seen until now, you can write a barebones data-parallel program Shared variables are distributed and visible to all threads Shared scalars have affinity to thread 0 Shared arrays are distributed (cyclically by default) on all threads Execution model is SPMD with theupc_forall provided to share work Barriers and split barriers provide global synchronization

21 Pointers to Shared vs. Arrays In the C tradition, array can be access through pointers Here is a vector addition example using pointers #define N 100*THREADS shared int v1[n], v2[n], sum[n]; void main() { int i; shared int *p1, *p2; p1=&v1[0]; p2=&v2[0]; v1 p1 upc_forall(i=0; i<n; i++; &sum[i]){ sum[i]= p1[i] + p2[i];

22 Common Uses for UPC Pointer Types int *p1; These pointers are fast (just like C pointers) Use to access local data in part of code performing local work Often cast a pointer-to-shared to one of these to get faster access to shared data that is local shared int *p2; Use to refer to remote data Larger and slower due to test-for-local + possible communication int * shared p3; Not recommended Why not? shared int * shared p4; Use to build shared linked structures, e.g., a linked list

23 UPC Pointers Where does the pointer point to? Where does the pointer reside? Private Shared Local PP (p1) SP (p2) Shared PS (p3) SS (p4) int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Shared to private is not recommended. Why?

24 UPC Pointers Thread 0 Thread 1 Thread n Global address space p3: p3: p3: p4: p4: p4: p1: p1: p1: p2: p2: p2: Shared Private int *p1; /* private pointer to local memory */ shared int *p2; /* private pointer to shared space */ int *shared p3; /* shared pointer to local memory */ shared int *shared p4; /* shared pointer to shared space */ Pointers to shared often require more storage and are more costly to dereference than local pointers; they may refer to local or remote memory.

25 UPC Pointers Pointer arithmetic supports blocked and non-blocked array distributions Casting of shared to private pointers is allowed but not vice versa! When casting a pointer to shared to a private pointer, the thread number of the pointer to shared may be lost Casting of shared to private is well defined only if the object pointed to by the pointer to shared has affinity with the thread performing the cast shared [BF] int A[N]; shared [BF] int *p = &A[BF*MYTHREAD]; int *lp = (int *) p; // cast to a local pointer (careful here!) for (i=0; i<bf; i++) lp[i] = i; // valid _only_ if p points to the local portion of the array!

26 Data movement Fine grain (array element, by array element access) are easy to program in an imperative way. However, especially on distributed memory machines, block transfers are more efficient UPC provides library functions for shared data movement: upc_memset Set a block of values in shared memory upc_memget, upc_memput Transfer blocks of data from shared memory to/from private memory upc_memcpy Transfer blocks of data from shared memory to shared memory

27 Data movement Collectives Used to move shared data across threads: - upc_all_broadcast(shared void* dst, shared void* src, size_t nbytes, ) A thread copies a block of memory it owns and sends it to all threads - upc_all_scatter(shared void* dst, shared void *src, size_t nbytes, ) A single thread splits memory in blocks and sends each block to a different thread - upc_all_gather(shared void* dst, shared void *src, size_t nbytes, ) Each thread copies a block of memory it owns and sends it to a single thread - upc_all_gather_all(shared void* dst, shared void *src, size_t nbytes, ) Each threads copies a block of memory it owns and sends it to all threads - upc_all_exchange(shared void* dst, shared void *src, size_t nbytes, ) Each threads splits memory in blocks and sends each block to all thread - upc_all_permute(shared void* dst, shared void *src, shared int* perm, size_t nbytes, ) Each threads copies a block of memory and sends it to thread in perm[i]

28 upc_all_broadcast THREAD 0 src dst THREAD i dst THREAD n dst Thread 0 copies a block of memory and sends it to all threads

29 upc_all_scatter THREAD 0 src dst THREAD i dst THREAD n dst Thread 0 sends a unique block to all threads

30 upc_all_gather THREAD 0 dst src THREAD i src THREAD n src Each thread sends a block to thread 0

31 Computational Collectives Used to perform data reductions - upc_all_reducet(shared void* dst, shared void* src, upc_op_t op, ) - upc_all_prefix_reducet(shared void* dst, shared void *src, upc_op_t op, ) One version for each type T (22 versions in total) Many operations are supported: OP can be: +, *, &,, xor, &&,, min, max perform OP on all elements of src array and place result in dst array

32 upc_all_reduce THREAD 0 THREAD i THREAD n src 45 dst src src Threads perform partial sums, each partial sum added and result stored on thread 0

33 Simple sum: Introduction Simple forall loop to add values into a sum shared int values[n], sum; sum = 0; upc_forall (int i=0; i<n; i++; &values[i]) sum += values[i]; Is there a problem with this code? Implementation is broken: sum is not guarded by a lock Write-after-write hazard; will get different answers every time

34 Synchronization primitives We have seen upc_barrier, upc_notify and upc_wait UPC supports locks: Represented by an opaque type:upc_lock_t Must be allocated before use: upc_lock_t *upc_all_lock_alloc(void); allocates 1 lock, pointer to all threads upc_lock_t *upc_global_lock_alloc(void); allocates 1 lock, pointer to all threads To use a lock: void upc_lock(upc_lock_t *l) void upc_unlock(upc_lock_t *l) use at start and end of critical region Locks can be freed when not in use void upc_lock_free(upc_lock_t *ptr);

35 Dynamic memory allocation As in C, memory can be dynamically allocated UPC provides several memory allocation routines to obtain space in the shared heap shared void* upc_all_alloc(size_t nblocks, size_t nbytes) a collective operation that allocates memory on all threads layout equivalent to: shared [nbytes] char[nblocks * nbytes] shared void* upc_global_alloc(size_t nblocks, size_t nbytes) A non-collective operation, invoked by one thread to allocate memory on all threads layout equivalent to: shared [nbytes] char[nblocks * nbytes] shared void* upc_alloc(size_t nbytes) A non-collective operation to obtain memory in the thread s shared section of memory void upc_free(size_t nbytes) A non-collective operation to free data allocated in shared memory Why do we need just one version?

36 IBM Software Group Rational 3. Compiler Optimizations for PGAS Languages Ettore Tiotto IBM Toronto Laboratory

37 XLUPC Compiler Architecture Built to leverage the IBM XLC technology Common Front End for C and UPC (extends C technology to UPC, generates WCode IL) Common High Level Optimizer (common infrastructure with C, UPC specific optimizations developed) Common Low Level Optimizer (vastly unmodified, UPC artifacts handled upstream) PGAS Runtime (exploitation of P7IH specialized network hardware)

38 XLUPC Optimizer Architecture Data and Control Flow Optimizer Loop Optimizer Constant Propagation Dead Code Elimination Loop Unrolling Copy Propagation Expression simplification Loop Normalization Loop Unswitching Dead store elimination Backward and Forward store motion Redundant Condition Elimination Traditional Loop Optimizations Shared Objects Coalescing Shared Objects Privatization Remote Update Shared Array Idioms Opt. Prefetching Remote Shared Data Parallel Work- Sharing Opt. Optimization Strategy 1) Remove overhead -Parallel Work-Sharing Loop Opt. -Shared pointer arithmetic Inlining 2) Exploit Locality Information - Shared Objects Privatization 3) Reduce communication Requirements - Shared Objects Coalescing - Remote Update - Shared Array Idioms Recognition Code Gen. UPC Code Gen. Thread Local Storage 4) P7IH Exploitation (through PAMI) PGAS Runtime System PAMI - HFI Immediate Send Instruction support - Collective Acceleration Unit support - HFI remote updates support - PAMI network endpoints

39 XL UPC Compiler optimizations 3.0 Translation of shared object accesses Preview of compiler optimized code 3.1 Optimizing a Work Sharing Loop Removing the affinity branch in a work sharing loop 3.2 Locality Analysis for Work Sharing Loops 3.3 Shared Array Optimizations Optimizations for Local Accesses (privatization) Optimizations for Remote Accesses (coalescing) Optimizing Commonly Used Code Patterns 3.4 Optimizing Shared Accesses through UPC pointers Optimizing Shared Accesses through pointers (versioning)

40 3.0 Translation of shared object accesses Achieved by translating shared accesses into calls to our PGAS runtime system: Pseudocode generated Sourceby Code the compiler shared int a; int foo () { int i, res; if (MYTHREAD==0) for (i=0; i<n; i++) res += a[i]; upc_barrier; return res; 3 int foo() 5 { if (!( mythread == 0)) goto lab_1; 6 i = 0; do { 7... xlupc_incr_shared_addr(&shared_addr.a, i); xlupc_ptrderef(shared_addr.a, &tmp); res = res + tmp; 6 i = i + 1; while (i < 10); lab_1: 8 xlupc_barrier(0,1) 9 return res; 10

41 3.0 Translation of shared object accesses To translate the shared array access a[i] the compiler generates: A PGAS RT call to position a pointer to shared memory to point to the correct array element (&a[i]): xlupc_incr_shared_addr(&shared_addr.a, i); A PGAS RT call retrieve the value of a[i] trough the pointer: xlupc_ptrderef(shared_addr.a, &tmp); The value of a[i] is now in a local buffer tmp Performance considerations Function call overhead (setup arguments, branch to runtime routine) Runtime implementation cost Receiving arguments, performing local computation, Communication cost (if the target array elements is not local) Pointer increment cost (actually inlined by the compiler for common cases)

42 3.0 Preview of compiler optimized code Beside code generation the compiler concerns itself with generation of efficient code (remove overhead). For example, certain shared accesses could be proven local (at compile time) therefore the compiler can: Cast the shared array element address to a local C pointer (on each thread) Use the local C pointer to dereference the array element directly Source Code shared int A[THREADS]; int foo(void) { A[MYTHREAD] = MYTHREAD; Optimized Code (-O3) int foo(){ threadspernode = xlupc_threadspernode(); shr_address = xlupc_get_array_base_address(a); shr_address[((( mythread % 4) % threadspernode) * 4 + return; /* function */ ( mythread/4)*4)] = mythread;

43 3.1 Optimizing a Work Sharing Loop A UPC work sharing loop (upc_forall) is equivalent to a for loop where the execution of the loop body is controlled by the affinity expression Example: Equivalent Original upc_forall loop loop shared int a[n], b[n], c[n]; void foo () { int i; upc_forall (i=0; i<n; i++; &a[i]) a[i] = b[i] + c[i]; shared int a[n], b[n], c[n]; void foo() { int i; for (i=0; i<n; i++){ if (upc_threadof(&a[i])==mythread) a[i] = b[i] + c[i]; The affinity branch is evaluated by every thread on each loop iteration What is the performance implication of this translation scheme?

44 3.1 Optimizing a Work Sharing Loop To eliminate the affinity branch the compiler transforms the upc_forall loop into a loop next comprising two for loops: Equivalent Original (optimized) upc_forall forloop shared int a[n], b[n], c[n]; void foo () { int i; upc_forall (i=0; i<n; i++; &a[i]) a[i] = b[i] + c[i]; shared int a[n], b[n], c[n]; void foo() { int i, j; for (i=mythread; i<n; i+=threads) for (j=i; j<=i; j++) a[j]=b[j]+c[j]; For each thread the outer loop cycles through each block of elements while the inner loop cycles through each element in a block Thread 0 Thread 1 Thread 0 Thread 1 Thread 0 Thread 1 i = 1 i = 0 i = 1 j = 0 j = 1 i = 2 j = 2 i = 3 j = 3 i = 4 j = 4 i = 5 j = 5

45 3.1 Optimizing a Work Sharing Loop Objective: remove the affinity branch overhead in upc_forall loops MB/s Stream Triad 20 million elem. address affinity Unoptimized (base line) Optimized (strip-mined) 8.5X improvement shared [BF] int A[N],B[N],C[N]; upc_forall (i=0; i < N; i++; &A[i]) A[i] = B[i] + k*c[i]; naïve translation shared [BF] int A[N],B[N],C[N]; for (i=0; i < N; i++) if (upc_threadof(a[i]) == MYTHREAD) A[i] = B[i] + k*c[i]; branch Affinity branch evaluation optimized loop 500 limits scalability shared [BF] int A[N], B[N], C[N]; Threads for (i=mythread*bf; i<n; i+=threads*bf) for (j=i; j < i+bf; j++) A[j] = B[j] + k*c[j];

46 3.2 Locality Analysis for Work Sharing Loops A Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread local accesses, 1 remote access 4 Threads / 4 Nodes shared [4] int A[2][16]; int main ( ) { int i, j; for (i =0; i<2 ; i++) { // row traversal upc_forall (j=0; j<16; j++; &A[i][j]) { A[i][j+1] = 2; Each block has 4 elements upc_forall loop used to traverse the inner array dimension Note the index [j+1], causing ¼ of accesses to the next block Each 4 th loop iteration we have one remote access, the first 3 accesses are shared local Can we somehow isolate the remote access from the local accesses?

47 3.2 Locality Analysis for Work Sharing Loops A Thread 0 Thread 1 Thread 2 Thread 3 Thread 0 Thread 1 Thread 2 Thread 3 4 Threads / 4 Nodes A1 A2 shared [4] int A[2][16]; int main ( ) { for (int i=0; i < 2; i++) upc_forall (int j=0; j < 16; j++; &A[i][j]) { if (j % 4 < 3) A[i][j+1] = 2; / / A1 else A[i][j+1] = 2; / / A2 Region 1 Region 2 The compiler identifies the point in the loop iteration space where the locality if shared accesses changes The compiler then splits the loop body into 2 regions + Region 1: shared access is always local + Region 2: shared access is always remote Now shared local accesses (region 1) can be privatized, and shared remote accesses could be coalesced (more later on this)

48 3.3 Optimizations for Local Accesses After the compiler has proven (via locality analysis) that a shared array access is local, shared object access privatization can be applied: upc_forall (int j=0; j < 16; j++; &A[j]) { if (j % 4 < 3) A[j+1] = 2; // shared local else A[j+1] = 2; // shared remote Privatized Original shared access A_address = xlupc_get_array_address(a); upc_forall (int j=0; j < 16; j++; &A[j]) { if (j % 4 < 3) else A_address [ new_index ] = 2; // privatized A[j+1] = 2; // shared remote The compiler translates the shared local access so that no call to the UPC runtime system is necessary (no call overhead) the base address of the shared array is retrieved outside the loop the index into the local portion of the shared array is computed the shared array access is then just a local memory dereference

49 3.3 Optimizations for Remote Accesses Locality Analysis tells us whether a shared reference is local or remote Local shared references are privatized, how can remote shared reference be optimized? Example of remote accesses shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<n; i++; &A[i]) A[i] = B[i+BF]; Observations (1) A[i] is shared local, therefore the compiler privatizes the access (2) B[i+BF] is always remote (for every i ) so it cannot be privatized In order to read the elements of shared array B the compiler can generate a call to the UPC runtime system for each element of B What is the performance implication of this translation scheme? Can we reduce the number of calls required to read the elements of array B?

50 3.3 Optimizations for Remote Accesses The compiler transforms the parallel loop to prefetches multiple elements of array B (reducing the number of messages required) #define BF (N/THREADS) shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<n; i++; &A[i]) A[i] = B[i+BF]; Optimized Original ParallelLoop #define BF (N/THREADS) shared [BF] int A[N], B[N+BF]; void loop() { upc_forall(int i=0; i<threads; i++; &A[BF*i]) { int stride = sizeof(int); int nelems = BF; // prefetch stride // n. of elems to prefetch xlupc_ptrderef_strided (&B[BF*i+BF], &B_buffer, sizeof(int), stride, nelems); for (int j=bf*i; j < min(bf*i, N-BF); j++) A[j] = B_buffer[j BF*i]; The outer loop reads BF elements from B (one msg) and fills in a buffer The inner loop consumes the buffer, communication cost is reduced by ~ a BF factor

51 3.3 Optimizations for Remote Accesses shared [ROWS*COLUMNS/THREADS] BYTE orig[rows][columns]; upc_forall (i=1; i < ROWS-1; i++; &OrigShared[i][0]) for (k=0; k < nblocks; k++) { coalesce_call(orig, &buf, index, offset, stride, bsize); l = k * bsize; for (j=0; j < bsize; j++) { gx = (int) orig[i-1][j+l+2] - orig[i-1][j+l]; gx += ((int) orig[i][j+l+2] - orig[i][j+l]) * 2; gx += (int) buf[j+2] buf[j]; gy = (int) buf[j] - orig[i-1][j+l]; gy += ((int) buf[j+1] - orig[i-1][j+l+1]) * 2; gy += (int) buf[j+2] - orig[i-1][j+l+2]; edge[i][j+l+1] = (BYTE) gradient; Managed by the compiler scheduled outside inner loop MYTHREAD MYTHREAD+1

52 Sobel Edge Detection banded layout Time (sec) Sobel (1 thread per node) DISTRIBUTED Image Size: 4096 X 4096 No Opt. Privatization Coalescing Full Opt Time (sec) + 5.7X Coalescing Improvement X Sobel (16 threads per node) HYBRID Image Size: 4096 X 4096 Coalescing No Opt. Privatization Coalescing Full Opt Improvement X Nodes Nodes Lower is better + 130X Distributed: Coalescing accounts for a 3X improvement Hybrid: Coalescing accounts for a 74X improvement

53 3.3 Optimizing Commonly Used Code Patterns Transformation detects common initialization code patterns and changes them into one of the 4 string handling functions provided by the UPC standard (upc_memset, upc_memget, upc_memput, upc_memcpy). The compiler is also able to recognize non-shared initialization idiom and change them to one of the C standard string handling functions Code generatedsource by thecode XLUPCSnippet compiler (-O2) #define N #define BF 512 shared [BF] int A[N]; shared [BF] int B[N]; int a[n], b[n]; void func(void) { for (int i = 0; i < N; i++) { A[i] = a[i]; b[i] = B[i]; After Array Init = 0; do = 512ll 50 upc_memput(&a + 4ll*@CIV1,&a + 4ll*@CIV1,2048ll) 51 upc_memget(&b + 4ll*@CIV1,&B + 4ll*@CIV1,2048ll) do + 1ll; while (@CIV1 < min(@bciv2 * 512ll + 512ll,64000ll)); + 1; while (@BCIV2 < 125ll);

54 3.4 Optimizing Shared Accesses through pointers In order to determine whether a shared access in a parallel loop is local or remote locality analysis has to compare it to the loop affinity expression (to determine its distance) If the access is through a pointer to shared memory the distance computation depends on where the pointer points to Observations Shared accesses via pointers-to-shared shared [BF] int A[N], B[N]; void foo(shared [BF] int *p1, shared [BF] int *p2) { upc_forall (int i=0; i<n; ++i; &p1[i]) p1[i] = p2[i+1]; int main() { foo (&A[0], &B[0]); // call site 1 foo (&A[0], &B[BF-1]); // call site 2 (1) p1[i] is trivially a shared local access for every value of i because the affinity expression is &p[i] (2) what is the distance of p2[i+1] from the affinity expression? call 1: p2 points to the first element of B, the relative distance 1 (each BF accesses 1 accesses is remote) call 2: p2 points to the last element of the 1 st block of B, the relative distance is BF (all accesses remote) (3) How can privatize p2[i+1] to handle the call from site 1 without breaking the call at site 2?

55 3.4 Optimizing Shared Accesses through pointers Key insight from the prior example: privatization needs to be sensitive to different calling contexts (call site 1 vs. call site 2) The approach we implemented is to create different versions of the parallel loop based on the value of a variable (ver_p2 below) Optimized Shared Parallel accesses Loop via pointers-to-shared shared [BF] int A[N], B[N]; void foo(shared [BF] int *p1, shared [BF] int *p2) { upc_forall (int i=0; i<n; ++i; &p1[i]) p1[i] = p2[i+1]; int main() { foo (&A[0], &B[0]); // case 1 foo (&A[0], &B[BF-1]); // case 2 void foo(shared [BF] int *p1, shared [BF] int *p2) { _Bool ver_p2=(upc_phaseof(p2)==0)?1:0; if (ver_p2) { // p2 points to first elem. in a block upc_forall (int i=0; i<n; i++; &p1[i]) p1[i] = p2[i+1]; // locality analysis possible else { upc_forall (int i=0; i<n; i++; &p1[i]) p1[i] = p2[i+1]; // p2[] will not be privatized

56 Research Directions Advanced Prefetching Optimizations Applicable to for loops as well as upc_forall loops Hybrid in nature Collection phase: shared accesses in loop are described to the runtime system Scheduling Phase: runtime analyzes access patterns, coalesce related accesses and prefetches them Execution Phase: prefetched buffers are consumed in the loop Applicable in a dynamic execution environment (threads not known at compile time) 100 Throughput (MB/sec) Millions iterations, 228 MB Power 7 HW (1 node) RHEL6 Baseline (-O3) Prefetch ~15X speedup Threads A[i] = B[i+1] + C[i+1] Prefetched Baseline (-O3)

57 Research Directions Advanced Runtime Diagnostics -qcheck Compiler instruments UPC application Errors detected while application runs UPC Program UPC Compiler Code Analysis Code Optimizer Code Generator UPC Checker Diagnostic API Interface PGAS Runtime Diagnostician Instrumented Executable 1 #include <upc.h> 2 #include <upc_collective.h> 3 #define BF shared [BF] int A[BF * THREADS], *ptr; 6 shared [BF*THREADS] int B[THREADS][BF*THREADS]; 7 shared int sh_num_bytes; 8 9 int main() { 10 if (sh_num_bytes>=0) ptr = &A[2*BF]; 11 else ptr = A; 12 upc_all_gather_all(b, ptr, sizeof(int)*bf, UPC_IN_ALLSYNC UPC_OUT_ALLSYNC); 13 return 0; 14 % xlupc -qcheck coll_diag.upc -qupc=threads=4 %./a.out -procs Invalid target affinity detected at line 12 in file 'coll_diag.upc'. Parameter 2 of function 'upc_all_gather_all' should have affinity to thread 0. Baseline (-O3)

58 Research Directions #include <upc.h> #include "mpi.h" extern int xlpgas_tsp_thrd_id (void); shared int BigArray[THREADS]; // let's do something fun with MPI. BigArray[MYTHREAD] = MYTHREAD + 1; int * localbigarray = (int *) &BigArray[MYTHREAD]; int main(int argc, char ** argv) { // We are in a UPC program, but we can // initialize MPI: int mpirank, mpisize, result; if (xlpgas_tsp_thrd_id()==0) MPI_Init (&argc, &argv); upc_barrier; MPI_Comm_rank (MPI_COMM_WORLD, &mpirank); MPI_Comm_size (MPI_COMM_WORLD, &mpisize); // let's dump our MPI and UPC identification: printf ("I'm UPC thread %d/%d and MPI task %d/%d\n", MYTHREAD, THREADS, mpirank, mpisize); MPI_Allreduce(localBigArray,&result,1, MPI_INTEGER, MPI_SUM, MPI_COMM_WORLD); if (MYTHREAD==0) printf ("The sum = %d\n", result); upc_barrier; if (xlpgas_tsp_thrd_id()==0) MPI_Finalize(); %./a.out -msg_api pgas,mpi -procs 4 I'm UPC thread 0 of 4 and MPI task 0 of 4 I'm UPC thread 1 of 4 and MPI task 1 of 4 I'm UPC thread 2 of 4 and MPI task 2 of 4 I'm UPC thread 3 of 4 and MPI task 3 of 4 The sum = 10

59 IBM Software Group Rational Thank You!!! Ettore Tiotto IBM Toronto Laboratory

Unified Parallel C, UPC

Unified Parallel C, UPC Unified Parallel C, UPC Jarmo Rantakokko Parallel Programming Models MPI Pthreads OpenMP UPC Different w.r.t. Performance/Portability/Productivity 1 Partitioned Global Address Space, PGAS Thread 0 Thread

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

Unified Parallel C (UPC)

Unified Parallel C (UPC) Unified Parallel C (UPC) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 21 March 27, 2008 Acknowledgments Supercomputing 2007 tutorial on Programming using

More information

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 32: Partitioned Global Address Space (PGAS) programming models COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP

More information

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation.

Abstract. Negative tests: These tests are to determine the error detection capabilities of a UPC compiler implementation. UPC Compilers Testing Strategy v1.03 pre Tarek El-Ghazawi, Sébastien Chauvi, Onur Filiz, Veysel Baydogan, Proshanta Saha George Washington University 14 March 2003 Abstract The purpose of this effort is

More information

Productive Parallel Programming in PGAS: Unified Parallel C

Productive Parallel Programming in PGAS: Unified Parallel C BSC-PRACE code porting and optimization workshop October 9 Barcelona SPAIN Productive Parallel Programming in PGAS: Unified Parallel C Montse Farreras BSC Barcelona Supercomputing Center UPC Universitat

More information

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen Parallel Programming Languages HPC Fall 2010 Prof. Robert van Engelen Overview Partitioned Global Address Space (PGAS) A selection of PGAS parallel programming languages CAF UPC Further reading HPC Fall

More information

Partitioned Global Address Space (PGAS) Model. Bin Bao

Partitioned Global Address Space (PGAS) Model. Bin Bao Partitioned Global Address Space (PGAS) Model Bin Bao Contents PGAS model introduction Unified Parallel C (UPC) introduction Data Distribution, Worksharing and Exploiting Locality Synchronization and Memory

More information

Implementing a Scalable Parallel Reduction in Unified Parallel C

Implementing a Scalable Parallel Reduction in Unified Parallel C Implementing a Scalable Parallel Reduction in Unified Parallel C Introduction A reduction is the process of combining elements of a vector (or array) to yield a single aggregate element. It is commonly

More information

Parallel programming models (cont d)

Parallel programming models (cont d) Parallel programming models (cont d) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.06] Thursday, January 24, 2008 1 Sources for today s material Mike Heath at UIUC

More information

AMS 250: An Introduction to High Performance Computing PGAS. Shawfeng Dong

AMS 250: An Introduction to High Performance Computing PGAS. Shawfeng Dong AMS 250: An Introduction to High Performance Computing PGAS Shawfeng Dong shaw@ucsc.edu (831) 502-7743 Applied Mathematics & Statistics University of California, Santa Cruz Outline PGAS Overview Coarray

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

IBM XL Unified Parallel C User s Guide

IBM XL Unified Parallel C User s Guide IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel C User s Guide Version 11.0 IBM XL Unified Parallel C for AIX, V11.0 (Technology Preview) IBM XL Unified Parallel

More information

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo

. Programming Distributed Memory Machines in MPI and UPC. Kenjiro Taura. University of Tokyo .. Programming Distributed Memory Machines in MPI and UPC Kenjiro Taura University of Tokyo 1 / 57 Distributed memory machines chip (socket, node, CPU) (physical) core hardware thread (virtual core, CPU)

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

CS 267 Unified Parallel C (UPC)

CS 267 Unified Parallel C (UPC) CS 267 Unified Parallel C (UPC) Kathy Yelick http://upc.lbl.gov Slides adapted from some by Tarek El-Ghazawi (GWU) 2/23/09 CS267 Lecture: UPC 1 UPC Outline 1. Background 2. UPC Execution Model 3. Basic

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES

DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES DEVELOPING AN OPTIMIZED UPC COMPILER FOR FUTURE ARCHITECTURES Tarek El-Ghazawi, François Cantonnet, Yiyi Yao Department of Electrical and Computer Engineering The George Washington University tarek@gwu.edu

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A.

Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A. Introduction to the PGAS (Partitioned Global Address Space) Languages Coarray Fortran (CAF) and Unified Parallel C (UPC) Dr. R. Bader Dr. A. Block January 2011 Applying PGAS to classical HPC languages

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

PGAS Languages (Par//oned Global Address Space) Marc Snir

PGAS Languages (Par//oned Global Address Space) Marc Snir PGAS Languages (Par//oned Global Address Space) Marc Snir Goal Global address space is more convenient to users: OpenMP programs are simpler than MPI programs Languages such as OpenMP do not provide mechanisms

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title UPC Language Specifications V1.2 Permalink https://escholarship.org/uc/item/7qv6w1rk Author UPC Consortium Publication

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

CSE 160 Lecture 18. Message Passing

CSE 160 Lecture 18. Message Passing CSE 160 Lecture 18 Message Passing Question 4c % Serial Loop: for i = 1:n/3-1 x(2*i) = x(3*i); % Restructured for Parallelism (CORRECT) for i = 1:3:n/3-1 y(2*i) = y(3*i); for i = 2:3:n/3-1 y(2*i) = y(3*i);

More information

Chapter 4. Message-passing Model

Chapter 4. Message-passing Model Chapter 4 Message-Passing Programming Message-passing Model 2 1 Characteristics of Processes Number is specified at start-up time Remains constant throughout the execution of program All execute same program

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

PGAS: Partitioned Global Address Space

PGAS: Partitioned Global Address Space .... PGAS: Partitioned Global Address Space presenter: Qingpeng Niu January 26, 2012 presenter: Qingpeng Niu : PGAS: Partitioned Global Address Space 1 Outline presenter: Qingpeng Niu : PGAS: Partitioned

More information

A brief introduction to C programming for Java programmers

A brief introduction to C programming for Java programmers A brief introduction to C programming for Java programmers Sven Gestegård Robertz September 2017 There are many similarities between Java and C. The syntax in Java is basically

More information

Lab DSE Designing User Experience Concepts in Multi-Stream Configuration Management

Lab DSE Designing User Experience Concepts in Multi-Stream Configuration Management Lab DSE-5063 Designing User Experience Concepts in Multi-Stream Configuration Management February 2015 Please Note IBM s statements regarding its plans, directions, and intent are subject to change or

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Pointer Arithmetic and Lexical Scoping. CS449 Spring 2016

Pointer Arithmetic and Lexical Scoping. CS449 Spring 2016 Pointer Arithmetic and Lexical Scoping CS449 Spring 2016 Review Pitfall 1 from previous lecture void foo(char *s) { s = "World"; int main() { char *str = "Hello"; foo(str); printf("%s\n", str); return

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Blue Waters Programming Environment

Blue Waters Programming Environment December 3, 2013 Blue Waters Programming Environment Blue Waters User Workshop December 3, 2013 Science and Engineering Applications Support Documentation on Portal 2 All of this information is Available

More information

Draft. v 1.0 THE GEORGE WASHINGTON UNIVERSITY. High Performance Computing Laboratory. UPC Manual

Draft. v 1.0 THE GEORGE WASHINGTON UNIVERSITY. High Performance Computing Laboratory. UPC Manual Draft v 1.0 THE GEORGE WASHINGTON UNIVERSITY High Performance Computing Laboratory UPC Manual THE GEORGE WASHINGTON UNIVERSITY UPC Manual Sébastien Chauvin Proshanta Saha François Cantonnet Smita Annareddy

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Programming refresher and intro to C programming

Programming refresher and intro to C programming Applied mechatronics Programming refresher and intro to C programming Sven Gestegård Robertz sven.robertz@cs.lth.se Department of Computer Science, Lund University 2018 Outline 1 C programming intro 2

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Supercomputing in Plain English Part IV: Henry Neeman, Director

Supercomputing in Plain English Part IV: Henry Neeman, Director Supercomputing in Plain English Part IV: Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday September 19 2007 Outline! Dependency Analysis! What is

More information

POINTER AND ARRAY SUNU WIBIRAMA

POINTER AND ARRAY SUNU WIBIRAMA POINTER AND ARRAY SUNU WIBIRAMA Presentation Outline Basic Pointer Arrays Dynamic Memory Allocation Basic Pointer 3 Pointers A pointer is a reference to another variable (memory location) in a program

More information

Innovate 2013 Automated Mobile Testing

Innovate 2013 Automated Mobile Testing Innovate 2013 Automated Mobile Testing Marc van Lint IBM Netherlands 2013 IBM Corporation Please note the following IBM s statements regarding its plans, directions, and intent are subject to change or

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Review: C Strings. A string in C is just an array of characters. Lecture #4 C Strings, Arrays, & Malloc

Review: C Strings. A string in C is just an array of characters. Lecture #4 C Strings, Arrays, & Malloc CS61C L4 C Pointers (1) inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #4 C Strings, Arrays, & Malloc Albert Chae Instructor 2008-06-26 Review: C Strings A string in C is just an array

More information

Steve Deitz Chapel project, Cray Inc.

Steve Deitz Chapel project, Cray Inc. Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc. Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted

More information

Pointers. Pointers. Pointers (cont) CS 217

Pointers. Pointers. Pointers (cont) CS 217 Pointers CS 217 Pointers Variables whose values are the addresses of variables Operations address of (reference) & indirection (dereference) * arithmetic +, - Declaration mimics use char *p; *p is a char,

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - II Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter May 20-22, 2013, IHPC 2013, Iowa City 2,500 BC: Military Invents Parallelism Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to

More information

CSCI 171 Chapter Outlines

CSCI 171 Chapter Outlines Contents CSCI 171 Chapter 1 Overview... 2 CSCI 171 Chapter 2 Programming Components... 3 CSCI 171 Chapter 3 (Sections 1 4) Selection Structures... 5 CSCI 171 Chapter 3 (Sections 5 & 6) Iteration Structures

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015 Oct 1: Introduction to OpenACC Oct 6: Office Hours Oct 15: Profiling and Parallelizing with the OpenACC Toolkit

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Computer Science Technical Report. Pairwise synchronization in UPC Ernie Fessenden and Steven Seidel

Computer Science Technical Report. Pairwise synchronization in UPC Ernie Fessenden and Steven Seidel Computer Science echnical Report Pairwise synchronization in UPC Ernie Fessenden and Steven Seidel Michigan echnological University Computer Science echnical Report CS-R-4- July, 4 Department of Computer

More information

CS C Primer. Tyler Szepesi. January 16, 2013

CS C Primer. Tyler Szepesi. January 16, 2013 January 16, 2013 Topics 1 Why C? 2 Data Types 3 Memory 4 Files 5 Endianness 6 Resources Why C? C is exteremely flexible and gives control to the programmer Allows users to break rigid rules, which are

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

HW1 due Monday by 9:30am Assignment online, submission details to come

HW1 due Monday by 9:30am Assignment online, submission details to come inst.eecs.berkeley.edu/~cs61c CS61CL : Machine Structures Lecture #2 - C Pointers and Arrays Administrivia Buggy Start Lab schedule, lab machines, HW0 due tomorrow in lab 2009-06-24 HW1 due Monday by 9:30am

More information

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines Network card Network card 1 COSC 6374 Parallel Computation Message Passing Interface (MPI ) I Introduction Edgar Gabriel Fall 015 Distributed memory machines Each compute node represents an independent

More information

ECE 454 Computer Systems Programming

ECE 454 Computer Systems Programming ECE 454 Computer Systems Programming The Edward S. Rogers Sr. Department of Electrical and Computer Engineering Final Examination Fall 2011 Name Student # Professor Greg Steffan Answer all questions. Write

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor

CS 261 Fall C Introduction. Variables, Memory Model, Pointers, and Debugging. Mike Lam, Professor CS 261 Fall 2017 Mike Lam, Professor C Introduction Variables, Memory Model, Pointers, and Debugging The C Language Systems language originally developed for Unix Imperative, compiled language with static

More information

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Introduction to MPI Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Topics Introduction MPI Model and Basic Calls MPI Communication Summary 2 Topics Introduction

More information

Lecture 15. Parallel Programming Languages

Lecture 15. Parallel Programming Languages Lecture 15 Parallel Programming Languages Announcements Wednesday s office hour is cancelled 11/7/06 Scott B. Baden/CSE 160/Fall 2006 2 Today s lecture Revisiting processor topologies in MPI Parallel programming

More information

OpenMP 4.5: Threading, vectorization & offloading

OpenMP 4.5: Threading, vectorization & offloading OpenMP 4.5: Threading, vectorization & offloading Michal Merta michal.merta@vsb.cz 2nd of March 2018 Agenda Introduction The Basics OpenMP Tasks Vectorization with OpenMP 4.x Offloading to Accelerators

More information

Parallel programming models. Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008

Parallel programming models. Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008 Parallel programming models Prof. Rich Vuduc Georgia Tech Summer CRUISE Program June 6, 2008 1 Attention span In conclusion 5 min 10 min 50 min Patterson s Law of Attention Span 2 Problem parallel algorithm

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Message Passing Interface

Message Passing Interface Message Passing Interface DPHPC15 TA: Salvatore Di Girolamo DSM (Distributed Shared Memory) Message Passing MPI (Message Passing Interface) A message passing specification implemented

More information

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix

More information

UPC: A Portable High Performance Dialect of C

UPC: A Portable High Performance Dialect of C UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome Parallelism on the Rise

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

Agenda. Peer Instruction Question 1. Peer Instruction Answer 1. Peer Instruction Question 2 6/22/2011

Agenda. Peer Instruction Question 1. Peer Instruction Answer 1. Peer Instruction Question 2 6/22/2011 CS 61C: Great Ideas in Computer Architecture (Machine Structures) Introduction to C (Part II) Instructors: Randy H. Katz David A. Patterson http://inst.eecs.berkeley.edu/~cs61c/sp11 Spring 2011 -- Lecture

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives Understanding how MPI programs execute Familiarity with fundamental MPI functions

More information

Memory management. Johan Montelius KTH

Memory management. Johan Montelius KTH Memory management Johan Montelius KTH 2017 1 / 22 C program # include int global = 42; int main ( int argc, char * argv []) { if( argc < 2) return -1; int n = atoi ( argv [1]); int on_stack

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

ROSE-CIRM Detecting C-Style Errors in UPC Code

ROSE-CIRM Detecting C-Style Errors in UPC Code ROSE-CIRM Detecting C-Style Errors in UPC Code Peter Pirkelbauer 1 Chunhuah Liao 1 Thomas Panas 2 Daniel Quinlan 1 1 2 Microsoft Parallel Data Warehouse This work was funded by the Department of Defense

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

Arrays and Pointers in C. Alan L. Cox

Arrays and Pointers in C. Alan L. Cox Arrays and Pointers in C Alan L. Cox alc@rice.edu Objectives Be able to use arrays, pointers, and strings in C programs Be able to explain the representation of these data types at the machine level, including

More information

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing OpenMP: Open Multiprocessing Erik Schnetter June 7, 2012, IHPC 2012, Iowa City Outline 1. Basic concepts, hardware architectures 2. OpenMP Programming 3. How to parallelise an existing code 4. Advanced

More information