Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1
PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 2
Memory models Distributed memory Network Proc Proc Proc Proc Memory Memory Memory Memory Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 3
OpenMP Frequently used API for shared memory parallel computing in high performance computing FX10 supports OpenMP version 3.0 Shared memory, global view Describe whole data structure and whole computations Parallel Numerical Algorithms / IST / UTokyo 4
Weak Consistency Compiler can reorder operations, as long as it does not change the meaning in sequential execution Hardware can reorder operations, as long as it does not change the meaning in sequential execution Weak consistency The order of operations are guaranteed only at special commands Flush is the special command in OpenMP Usually used as implicitly in parallel, barrier, atomic, etc. Parallel Numerical Algorithms / IST / UTokyo 5
The solution int data; Barrier #pragma omp parallel if (producer) { produce_data data = produce_data(); #pragma omp barrier } else { // consumer #pragma omp barrier consume_data(data); consume_data Flush is not enough Flush of the producer must be earlier than flush of the consumer } Parallel Numerical Algorithms / IST / UTokyo 6
Barrier should be inserted Before writing data After writing data Before reading data After reading data Any thread reads data Any thread reads data One thread writes data One thread writes data Barrier Barrier Barrier Barrier Barrier Parallel Numerical Algorithms / IST / UTokyo 7
Performance Issues Mutual exclusion Synchronization Load imbalance Memory access congestion More issues Parallel Numerical Algorithms / IST / UTokyo 8
Mutual exclusion Atomic operation (#pragma omp atomic) Operation is done in an inseparable way Limited to a small number of operations Maybe done in hardware, so possibly very fast Critical section (#pragma omp critical) Any block of code can be declared as critical section While one thread resides in a critical section, no other thread can enter critical section Software implementation, so slower Lock (omp_set_lock(), omp_unset_lock(), etc.) While one thread keep the lock, no other thread can get the lock Software implementation, so slower Parallel Numerical Algorithms / IST / UTokyo 9
Atomic and Critical Section #pragma omp atomic x += a; #pragma omp atomic y *= b; #pragma omp critical x += a; #pragma omp critical y *= b; x += a and y *= b can be done in parallel Maybe hardware supported x += a and y *= b cannot be done in parallel Perhaps software implemented Use atomic if applicable Parallel Numerical Algorithms / IST / UTokyo 10
Synchronization #pragma omp barrier Wait until all the threads reach barrier Barrier Barrier Barrier Barrier Barrier takes time ~ 1μμμμ Amount to many thousands of operations Load imbalance produces idle times Parallel Numerical Algorithms / IST / UTokyo 11
Load balancing = equal time? Load balancing: Assign same amount of computation for each thread When threads do the same computations, do they consume the same time? Actually not, on shared memory processors Arbitration at atomic operations and critical sections, collision at memory access, OS tasks etc. More fluctuation if all cores are used Sometimes, dynamic load balancing is better than perfect static load balancing Parallel Numerical Algorithms / IST / UTokyo 12
Loop Scheduling clauses Schedule clause in omp-for #pragma omp for schedule(kind [, chunk_size]) Kinds of OpenMP Loop scheduling static dynamic guided auto Schedule decided by compiler or system runtime Environment variable OMP_SCHEDULE Function omp_set_schedule(kind, modifier) Parallel Numerical Algorithms / IST / UTokyo 13
Loop Scheduling 1. Static: round-robin assignments of fixed chunks Block Cyclic in distributed computing 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2. Dynamic: dynamic assignments of fixed chunks An idle thread asks a new chunk to compute 0 1 2 3 3 1 0 2 0 3 1 2 0 1 3. Guided: dynamic assignments, starting with big chunk size, reducing into the given chunk size 0 1 2 3 3 2 1 0 3 2 Parallel Numerical Algorithms / IST / UTokyo 14
Performance tips Use atomic rather than critical (if possible) Reduce synchronization Balance the loads, consider dynamic load balancing Choose best performing loop scheduling Parallel Numerical Algorithms / IST / UTokyo 15
Memory models Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem In NUMA, memory access costs (latency and bandwidth) are different from core s own memory and other cores memory Parallel Numerical Algorithms / IST / UTokyo 16
First touch principle In whose memory it is allocated? First touch principle At the allocation (malloc etc.), the physical position is not determined At the first access to the allocated memory (must be a write), it is allocated on the memory space of the accessing core Assignment is usually in the unit of pages (4KB etc) Proc Proc Proc Proc Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 17
Affinity Use (mostly) static scheduling, so that each thread always accesses the same (similar) memory area Stop OS from moving threads among cores Use the same scheduling for initialization #pragma omp parallel for (static, 512) for (i = 0; i < n; i ++) a[i] = 0.0; Proc Proc Proc Proc Mem Mem Mem Mem If possible, design so that the each thread uses a memory size of a multiple of the page size If possible, align starting address to the page boundary Parallel Numerical Algorithms / IST / UTokyo 18
Affinity Use (mostly) static scheduling, so that each thread always accesses the same (similar) memory area Stop OS from moving threads among cores Use the same scheduling for initialization #pragma omp parallel for (static, 512) for (i = 0; i < n; i ++) a[i] = 0.0; Proc Proc Proc Proc Mem Mem Mem Mem If possible, design so that the each thread uses a memory size of a multiple of the page size If possible, align starting address to the page boundary Parallel Numerical Algorithms / IST / UTokyo 19
Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 20
Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 21
Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 22
False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 23
False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 24
False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 25
False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 26
Performance tips Remote memory access False sharing Solution: Locality! Collect data used by each thread into one place Block distribution Parallel Numerical Algorithms / IST / UTokyo 27
Locality! On Shared Memory Systems CPU CPU CPU CPU CPU CPU CPU CPU $ $ $ $ $ $ $ $ Memory Memory computation mem comp memory access Parallel Numerical Algorithms / IST / UTokyo 29
Locality! Do your computations on cache Explore locality! Remember 3rd lecture Computational intensity High: matrix-matrix multiply OO mm 1.5 Middle: stencil OO(kk) Relatively low: FFT OO(log mm) Low: matrix-vector multiply, reduction OO(1) mm : data size kk : number of iterations Parallel Numerical Algorithms / IST / UTokyo 30
Remember cache Key data size Line size Association CPU Main Memory Accessed data is automatically stored in cache Old data is evicted if line is full Parallel Numerical Algorithms / IST / UTokyo 31
Padding Array a[n][m] Array a[n][m+2] Last 2 elements in each row are not used Parallel Numerical Algorithms / IST / UTokyo 32
Tiling Matrix-matrix multiply: C = C + A B for ii = 1 to nn for jj = 1 to nn for kk = 1 to nn cc iiii = cc iiii + aa iiii bb kkkk AA 11 AA 12 AA 13 AA 14 AA 21 AA 22 AA 23 AA 24 AA 31 AA 32 AA 33 AA 34 Similar for B and C for ss ii = 1 to nn step bb ii for ss jj = 1 to nn step bb jj for ss kk = 1 to nn step bb kk for ii = ss ii to ss ii + bb ii for jj = ss jj to ss jj + bb jj for kk = ss kk to ss kk + bb kk cc iiii = cc iiii + aa iiii bb kkkk AA 41 AA 42 AA 43 AA 44 CC iiii = CC iiii + AA iiii BB kkkk Parallel Numerical Algorithms / IST / UTokyo 33 kk Choose block sizes so that CC iiii, AA iiii and BB kkkk can be stored on cache at once
Oblivious Algorithm Matrix-matrix multiply AA 11 AA 12 AA 21 AA 22 MMM(A, B, C) { if (small enough) compute directly; else { divide A, B, C into four submatrices; for i = 1, 2 for j = 1, 2 for k = 1, 2 MMM(AA iiii, BB kkkk, CC iiii ) } } Reformulate it in divide-and-conquer Some level fits into the cache Parallel Numerical Algorithms / IST / UTokyo 34
Array dimension / loop exchange for i = 0 to n 1 for j = 0 to n 1 a[i][j] = ; for i = 0 to n 1 for j = 0 to n 1 a[j][i] = ; for j = 0 to n 1 for i = 0 to n 1 a[j][i] = ; for j = 0 to n 1 for i = 0 to n 1 a[i][j] = ; Higher temporal locality Lower temporal locality Parallel Numerical Algorithms / IST / UTokyo 35
Array of structures / structure of arrays typedef struct { double x, y, z; } point; point p[n]; struct { double x[n]; double y[n]; double z[n]; } p; Array of structures Structure of arrays Case 1: increase x of all elements by 1 Case 2: compute norm sqrt(x*x + y*y + z*z) for each elements Parallel Numerical Algorithms / IST / UTokyo 36
Loop fusion Loop fission for i = 0 to n 1 compute1(i); for i = 0 to n 1 compute2(i); Fusion Fission for i = 0 to n 1 { compute1(i); compute2(i); } Fusion If compute1(i) and compute2(i) access same (or near) addresses, then locality is improved May remove array temporal Fission Reduces working set size, may fit in cache for i = 0 to n 1 b[i] = 2 * a[i]; for i = 0 to n 1 c[i] = sqrt(b[i]); Parallel Numerical Algorithms / IST / UTokyo 37
Tiled data / Space-filling curve Tiled data structure Space-filling curve (Z-curve)
Debugging is hard! Debugging of shared memory parallel program is harder than distributed memory parallel program Unintentional data race happens Wrong results comes non-deterministically One thread writes data Any thread reads data One thread writes data Any thread reads data Barrier Barrier Barrier Barrier Barrier Parallel Numerical Algorithms / IST / UTokyo 39
Hybrid Parallelization Flat MPI model One MPI rank per core Use only distributed memory model Hybrid parallel programming Use both OpenMP and MPI Node MPI rank OpenMP thread Flat MPI Hybrid Parallel Numerical Algorithms / IST / UTokyo 40
Pros and Cons Pros of Flat MPI model Simpler programming, need less learning Sometimes faster than hybrid Cons of Flat MPI model Partially duplicated memory allocation Contentions of messages (network is shared) Too many MPI ranks in supercomputers nowadays Pros of Hybrid model Less duplicated memory Message contention can be avoided Less MPI ranks Cons of Hybrid model Must learn both MPI and OpenMP Sometimes not faster than flat MPI Hybrid model is recommended for high parallelism Parallel Numerical Algorithms / IST / UTokyo 41
PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 42