Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 8 ] OpenMP Parallel Numerical Algorithms / IST / UTokyo 1
PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Cache Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 2
Memory models Distributed memory Network Proc Proc Proc Proc Memory Memory Memory Memory Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 3
Parallel Computer Nowadays Node Network System Core Node Memory Core PU Register Shared memory, SIMD Distributed memory, MIMD Shared memory, MIMD Processor: any computing part (PU, Core, or Node) Computer: may be equivalent to system Socket: set of cores on the same die / module CPU: can be a socket or a core Sequential or Serial: Antonym of Parallel Parallel Numerical Algorithms / IST / UTokyo 4
OpenMP Frequently used API for shared memory parallel computing in high performance computing FX10 supports OpenMP version 3.0 Shared memory, global view Describe whole data structure and whole computations Is not an automatic parallelization! It parallelizes only where you explicitly parallelize Does not guarantee correctness! It runs just as your code (not as your intention) Parallel Numerical Algorithms / IST / UTokyo 5
OpenMP Summary Available on the OpenMP web site Parallel Numerical Algorithms / IST / UTokyo 6
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 7
A tiny code with OpenMP #include <stdio.h> #include <omp.h> Include this header file int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 8
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); Number of threads is set #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 9
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel Run next code in parallel (duplicated) { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 10
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), My thread ID omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 11
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); Number of threads (must be 8) return 0; Parallel Numerical Algorithms / IST / UTokyo 12
A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); I am 1 out of 8 threads I am 7 out of 8 threads I am 0 out of 8 threads I am 2 out of 8 threads I am 3 out of 8 threads I am 4 out of 8 threads I am 5 out of 8 threads I am 6 out of 8 threads return 0; Parallel Numerical Algorithms / IST / UTokyo 13
Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 14
Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for parallel for-loop for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 15
Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am 2 executed by 1 I am 3 executed by 1 I am 0 executed by 0 I am 1 executed by 0 I am 4 executed by 2 I am 5 executed by 2 I am 8 executed by 4 I am 9 executed by 4 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 16
Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); Do the following in parallel (duplicated) Assign one thread per iteration return 0; Parallel Numerical Algorithms / IST / UTokyo 17
Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 18
Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); all threads do this one thread per iteration I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 19
Another tiny code (again) #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; This is actually a combination #pragma omp parallel for of parallel and for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 20
Start parallel computations #pragma omp parallel Execute the following computation in parallel Following computation can be a sentence or a block A team of threads is created A; #pragma omp parallel B; C; A B B B B C Parallel Numerical Algorithms / IST / UTokyo 21
Setting number of threads There are three ways 1. Environment variable OMP_NUM_THREADS Weak 2. Function void omp_set_num_threads(int) 3. Clause #pragma omp parallel num_threads(8) Strong Parallel Numerical Algorithms / IST / UTokyo 22
Work-sharing #pragma omp for Assign one thread per iteration for the following for-loop For-loop must be something like for (i=0; i< n; i++) #pragma omp single Only one of the threads executes the following computation B B B B C D D D D Parallel Numerical Algorithms / IST / UTokyo 23
Some functions void omp_set_num_threads(int); Set the number of threads (for next parallel exec) int omp_get_num_threads(void); Returns the number of threads (for this parallel exec) int omp_get_thread_num(void); Returns my thread ID double omp_wtime(void); Returns wall-clock time (in second) Parallel Numerical Algorithms / IST / UTokyo 24
Synchronization #pragma omp barrier Wait until all the threads reach barrier Barrier Barrier Barrier Barrier Timing a code #pragma omp barrier t0 = omp_wtime(); do_computations(); #pragma omp barrier time = omp_wtime() t0; Parallel Numerical Algorithms / IST / UTokyo 25
BREAK Parallel Numerical Algorithms / IST / UTokyo 26
Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 27
Disclosing the trick (again) #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 28
What happens if? #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); All threads loop for 10 iterations? Completely Different! //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 29
What happens if? #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; Shared variable #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); ++ ++ ++ ++ //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); i return 0; Parallel Numerical Algorithms / IST / UTokyo 30
Thread-private variable #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { int i; printf("i am thread %d n", omp_get_thread_num()); Private variable ++ ++ ++ ++ //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); i i i i return 0; Parallel Numerical Algorithms / IST / UTokyo 31
Shared and private variables Shared variable The storage is accessible from all threads Must be careful to update Private variable Different storage is allocated for each thread Allocated when the thread starts, and destroyed when the thread stops Parallel Numerical Algorithms / IST / UTokyo 32
Shared or private Shared by default: Global variables Static variables Variables declared before omp parallel Private by default: Variables declared within omp parallel Loop induction variable of omp for int func(int k, int *m) { // but *m is n and thus shared int x; static int c = 0; int q = 1024; int main(void) { int n = 32; #pragma omp parallel { int z = func(q, &n); Parallel Numerical Algorithms / IST / UTokyo 33
Clauses for parallel construct #pragma omp parallel [clause[[,] clause] ] private(variable, ) Declares listed variables as private shared(variable, ) Declares listed variables as shared firstprivate(variable, ) Declares as private and initializes with the value just before omp parallel And more clauses Parallel Numerical Algorithms / IST / UTokyo 34
My recommendation Extract parallel part as a function Depend on the default setting of shared / private void do_comp(arg0, arg1) { #pragma omp parallel do_comp(arg0, arg1); Necessary and sufficient information is passed as arguments Reduced accidental side effects Assignments to the arguments does not effect on the caller s variables Side effects (update of shared variable) are possible only via pointers, global variables, etc. Parallel Numerical Algorithms / IST / UTokyo 35
Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 36
Race condition Count up solutions for each type int counter; #pragma omp parallel { if (found) { type = get_type(); counter ++; counter ++ ++ ++ ++ ++ Race Condition Multiple threads access to the same variable concurrently Parallel Numerical Algorithms / IST / UTokyo 37
Reduction clause reduction(operation: variable) Produces a code to reduction operation int counter; #pragma omp parallel reduction(+: counter) { if (found) { type = get_type(); counter ++; Applicable only for scalar variables, no vector reduction Parallel Numerical Algorithms / IST / UTokyo 38
Vector updates Count up solutions for each type int counter[n]; #pragma omp parallel { if (found) { type = get_type(); counter[type] ++; counter ++ ++ ++ ++ ++ Parallel Numerical Algorithms / IST / UTokyo 39
Atomic Operation #pragma omp atomic Execute it as an inseparable single operation Allowed operations: x binop= expr; x ++; ++x; x--; --x; int counter[n]; #pragma omp parallel { if (found) { type = get_type(); #pragma omp atomic counter[type] ++; Parallel Numerical Algorithms / IST / UTokyo 40
Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 41
Producer-Consumer Signal This is not provided by OpenMP Don t do the following! int data, flag = 0; #pragma omp parallel num_threads(2) if (producer) { Producer Data Consumer data = generate_data(); flag = 1; else { // consumer while (flag == 0); // wait until flag is set consume_data(data); Parallel Numerical Algorithms / IST / UTokyo 42
Freedom of execution order Compiler can reorder operations, as long as it does not change the meaning in sequential execution Compiler can keep the data on registers, not writing it on the main memory, as long time as it wants Hardware can reorder operations, as long as it does not change the meaning in sequential execution Hardware can keep the data on cache, not writing it on the main memory, as long time as it wants In short, the program doesn t run as it is written! Parallel Numerical Algorithms / IST / UTokyo 43
Weak consistency Consistency A set of restrictions on the execution of concurrent programs so that the concurrent execution is similar to the sequential ones But every trial resulted in severe degradation of performance But we need some control over execution order Weak consistency The order of operations are guaranteed only at special commands Parallel Numerical Algorithms / IST / UTokyo 44
Memory synchronization #pragma omp flush Every memory read and write operations before flush are made complete No memory read or write operation after flush is not started yet Rarely used by itself Automatically inserted At barrier, atomic and lock operations At entry to and exit from: parallel, critical and ordered Parallel Numerical Algorithms / IST / UTokyo 45
The solution int data; Barrier #pragma omp parallel if (producer) { produce_data data = produce_data(); #pragma omp barrier else { // consumer #pragma omp barrier consume_data(data); consume_data Flush is not enough Flush of the producer must be earlier than flush of the consumer Parallel Numerical Algorithms / IST / UTokyo 46
Barrier should be inserted Before writing data Wait for the threads that need to read the old data actually read the old data After writing data To make the threads that will read the new data wait for writing the new data Before reading data Wait for the thread that produces the new data actually produce the new data After reading data To keep the other threads from updating the data Parallel Numerical Algorithms / IST / UTokyo 47
Self check questions Explain private and shared variables Which variables are private/shared by default? What is Suda s recommended style? What is race condition? Show a few methods to solve race conditions What is weak consistency? What flush does? Implicit flushes inserted where, and where not? Explain why barrier is needed {before and after {reading and writing shared data Parallel Numerical Algorithms / IST / UTokyo 48
PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Cache Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 49