Parallel Numerical Algorithms

Similar documents
Parallel Numerical Algorithms

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Numerical Algorithms

Distributed Systems + Middleware Concurrent Programming with OpenMP

ECE 574 Cluster Computing Lecture 10

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Lecture 4: OpenMP Open Multi-Processing

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Multithreading in C with OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming using OpenMP

Parallel Programming using OpenMP

OpenMP Fundamentals Fork-join model and data environment

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session

Concurrent Programming with OpenMP

Shared Memory Programming Model

Parallel and Distributed Programming. OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Parallel Programming

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Shared Memory Programming Models I

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

CS691/SC791: Parallel & Distributed Computing

Parallel Numerical Algorithms

Overview: The OpenMP Programming Model

Scientific Computing

Introduc)on to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Shared Memory Programming with OpenMP. Lecture 8: Memory model, flush and atomics

OpenMP Library Functions and Environmental Variables. Most of the library functions are used for querying or managing the threading environment

Parallel Algorithm Engineering

CS 5220: Shared memory programming. David Bindel

Shared Memory Parallelism - OpenMP

Parallel programming using OpenMP

OpenMP threading: parallel regions. Paolo Burgio

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP: Open Multiprocessing

OpenMP: Open Multiprocessing

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

High Performance Computing: Tools and Applications

Introduction to. Slides prepared by : Farzana Rahman 1

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Data Environment: Default storage attributes

Introduction to Standard OpenMP 3.1

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Parallel Programming

GLOSSARY. OpenMP. OpenMP brings the power of multiprocessing to your C, C++, and. Fortran programs. BY WOLFGANG DAUTERMANN

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Computing. Lecture 13: OpenMP - I

OpenMP Programming. Aiichiro Nakano

Parallel Programming with OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

CSL 860: Modern Parallel

Tieing the Threads Together

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Programming in C with MPI and OpenMP

Introduction to OpenMP.

Parallel Programming: OpenMP

Raspberry Pi Basics. CSInParallel Project

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

15-418, Spring 2008 OpenMP: A Short Introduction

A brief introduction to OpenMP

Shared memory parallel computing

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming in C with MPI and OpenMP

Session 4: Parallel Programming with OpenMP

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

The Art of Parallel Processing

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallelization, OpenMP

OPENMP OPEN MULTI-PROCESSING

Data Handling in OpenMP

Allows program to be incrementally parallelized

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP

Transcription:

Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 8 ] OpenMP Parallel Numerical Algorithms / IST / UTokyo 1

PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Cache Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 2

Memory models Distributed memory Network Proc Proc Proc Proc Memory Memory Memory Memory Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 3

Parallel Computer Nowadays Node Network System Core Node Memory Core PU Register Shared memory, SIMD Distributed memory, MIMD Shared memory, MIMD Processor: any computing part (PU, Core, or Node) Computer: may be equivalent to system Socket: set of cores on the same die / module CPU: can be a socket or a core Sequential or Serial: Antonym of Parallel Parallel Numerical Algorithms / IST / UTokyo 4

OpenMP Frequently used API for shared memory parallel computing in high performance computing FX10 supports OpenMP version 3.0 Shared memory, global view Describe whole data structure and whole computations Is not an automatic parallelization! It parallelizes only where you explicitly parallelize Does not guarantee correctness! It runs just as your code (not as your intention) Parallel Numerical Algorithms / IST / UTokyo 5

OpenMP Summary Available on the OpenMP web site Parallel Numerical Algorithms / IST / UTokyo 6

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 7

A tiny code with OpenMP #include <stdio.h> #include <omp.h> Include this header file int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 8

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); Number of threads is set #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 9

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel Run next code in parallel (duplicated) { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 10

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), My thread ID omp_get_num_threads()); return 0; Parallel Numerical Algorithms / IST / UTokyo 11

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); Number of threads (must be 8) return 0; Parallel Numerical Algorithms / IST / UTokyo 12

A tiny code with OpenMP #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { printf("i am %d out of %d threads n", omp_get_thread_num(), omp_get_num_threads()); I am 1 out of 8 threads I am 7 out of 8 threads I am 0 out of 8 threads I am 2 out of 8 threads I am 3 out of 8 threads I am 4 out of 8 threads I am 5 out of 8 threads I am 6 out of 8 threads return 0; Parallel Numerical Algorithms / IST / UTokyo 13

Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 14

Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for parallel for-loop for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 15

Another tiny code #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am 2 executed by 1 I am 3 executed by 1 I am 0 executed by 0 I am 1 executed by 0 I am 4 executed by 2 I am 5 executed by 2 I am 8 executed by 4 I am 9 executed by 4 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 16

Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); Do the following in parallel (duplicated) Assign one thread per iteration return 0; Parallel Numerical Algorithms / IST / UTokyo 17

Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 18

Disclosing the trick #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); all threads do this one thread per iteration I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 19

Another tiny code (again) #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; This is actually a combination #pragma omp parallel for of parallel and for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 20

Start parallel computations #pragma omp parallel Execute the following computation in parallel Following computation can be a sentence or a block A team of threads is created A; #pragma omp parallel B; C; A B B B B C Parallel Numerical Algorithms / IST / UTokyo 21

Setting number of threads There are three ways 1. Environment variable OMP_NUM_THREADS Weak 2. Function void omp_set_num_threads(int) 3. Clause #pragma omp parallel num_threads(8) Strong Parallel Numerical Algorithms / IST / UTokyo 22

Work-sharing #pragma omp for Assign one thread per iteration for the following for-loop For-loop must be something like for (i=0; i< n; i++) #pragma omp single Only one of the threads executes the following computation B B B B C D D D D Parallel Numerical Algorithms / IST / UTokyo 23

Some functions void omp_set_num_threads(int); Set the number of threads (for next parallel exec) int omp_get_num_threads(void); Returns the number of threads (for this parallel exec) int omp_get_thread_num(void); Returns my thread ID double omp_wtime(void); Returns wall-clock time (in second) Parallel Numerical Algorithms / IST / UTokyo 24

Synchronization #pragma omp barrier Wait until all the threads reach barrier Barrier Barrier Barrier Barrier Timing a code #pragma omp barrier t0 = omp_wtime(); do_computations(); #pragma omp barrier time = omp_wtime() t0; Parallel Numerical Algorithms / IST / UTokyo 25

BREAK Parallel Numerical Algorithms / IST / UTokyo 26

Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 27

Disclosing the trick (again) #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); #pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); I am thread 0 I am 0 executed by 0 I am 1 executed by 0 I am thread 1 I am 2 executed by 1 I am 3 executed by 1 I am thread 2 I am 4 executed by 2 I am 5 executed by 2 I am thread 5 I am thread 6 I am thread 7 I am thread 4 I am 8 executed by 4 I am 9 executed by 4 I am thread 3 I am 6 executed by 3 I am 7 executed by 3 return 0; Parallel Numerical Algorithms / IST / UTokyo 28

What happens if? #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); All threads loop for 10 iterations? Completely Different! //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); return 0; Parallel Numerical Algorithms / IST / UTokyo 29

What happens if? #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); int i; Shared variable #pragma omp parallel { printf("i am thread %d n", omp_get_thread_num()); ++ ++ ++ ++ //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); i return 0; Parallel Numerical Algorithms / IST / UTokyo 30

Thread-private variable #include <stdio.h> #include <omp.h> int main(void) { omp_set_num_threads(8); #pragma omp parallel { int i; printf("i am thread %d n", omp_get_thread_num()); Private variable ++ ++ ++ ++ //#pragma omp for for (i = 0; i < 10; i++) printf("i am %d executed by %d n", i, omp_get_thread_num()); i i i i return 0; Parallel Numerical Algorithms / IST / UTokyo 31

Shared and private variables Shared variable The storage is accessible from all threads Must be careful to update Private variable Different storage is allocated for each thread Allocated when the thread starts, and destroyed when the thread stops Parallel Numerical Algorithms / IST / UTokyo 32

Shared or private Shared by default: Global variables Static variables Variables declared before omp parallel Private by default: Variables declared within omp parallel Loop induction variable of omp for int func(int k, int *m) { // but *m is n and thus shared int x; static int c = 0; int q = 1024; int main(void) { int n = 32; #pragma omp parallel { int z = func(q, &n); Parallel Numerical Algorithms / IST / UTokyo 33

Clauses for parallel construct #pragma omp parallel [clause[[,] clause] ] private(variable, ) Declares listed variables as private shared(variable, ) Declares listed variables as shared firstprivate(variable, ) Declares as private and initializes with the value just before omp parallel And more clauses Parallel Numerical Algorithms / IST / UTokyo 34

My recommendation Extract parallel part as a function Depend on the default setting of shared / private void do_comp(arg0, arg1) { #pragma omp parallel do_comp(arg0, arg1); Necessary and sufficient information is passed as arguments Reduced accidental side effects Assignments to the arguments does not effect on the caller s variables Side effects (update of shared variable) are possible only via pointers, global variables, etc. Parallel Numerical Algorithms / IST / UTokyo 35

Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 36

Race condition Count up solutions for each type int counter; #pragma omp parallel { if (found) { type = get_type(); counter ++; counter ++ ++ ++ ++ ++ Race Condition Multiple threads access to the same variable concurrently Parallel Numerical Algorithms / IST / UTokyo 37

Reduction clause reduction(operation: variable) Produces a code to reduction operation int counter; #pragma omp parallel reduction(+: counter) { if (found) { type = get_type(); counter ++; Applicable only for scalar variables, no vector reduction Parallel Numerical Algorithms / IST / UTokyo 38

Vector updates Count up solutions for each type int counter[n]; #pragma omp parallel { if (found) { type = get_type(); counter[type] ++; counter ++ ++ ++ ++ ++ Parallel Numerical Algorithms / IST / UTokyo 39

Atomic Operation #pragma omp atomic Execute it as an inseparable single operation Allowed operations: x binop= expr; x ++; ++x; x--; --x; int counter[n]; #pragma omp parallel { if (found) { type = get_type(); #pragma omp atomic counter[type] ++; Parallel Numerical Algorithms / IST / UTokyo 40

Three Pitfalls 1. Shared and Private Variables 2. Race Condition 3. Weak Consistency Parallel Numerical Algorithms / IST / UTokyo 41

Producer-Consumer Signal This is not provided by OpenMP Don t do the following! int data, flag = 0; #pragma omp parallel num_threads(2) if (producer) { Producer Data Consumer data = generate_data(); flag = 1; else { // consumer while (flag == 0); // wait until flag is set consume_data(data); Parallel Numerical Algorithms / IST / UTokyo 42

Freedom of execution order Compiler can reorder operations, as long as it does not change the meaning in sequential execution Compiler can keep the data on registers, not writing it on the main memory, as long time as it wants Hardware can reorder operations, as long as it does not change the meaning in sequential execution Hardware can keep the data on cache, not writing it on the main memory, as long time as it wants In short, the program doesn t run as it is written! Parallel Numerical Algorithms / IST / UTokyo 43

Weak consistency Consistency A set of restrictions on the execution of concurrent programs so that the concurrent execution is similar to the sequential ones But every trial resulted in severe degradation of performance But we need some control over execution order Weak consistency The order of operations are guaranteed only at special commands Parallel Numerical Algorithms / IST / UTokyo 44

Memory synchronization #pragma omp flush Every memory read and write operations before flush are made complete No memory read or write operation after flush is not started yet Rarely used by itself Automatically inserted At barrier, atomic and lock operations At entry to and exit from: parallel, critical and ordered Parallel Numerical Algorithms / IST / UTokyo 45

The solution int data; Barrier #pragma omp parallel if (producer) { produce_data data = produce_data(); #pragma omp barrier else { // consumer #pragma omp barrier consume_data(data); consume_data Flush is not enough Flush of the producer must be earlier than flush of the consumer Parallel Numerical Algorithms / IST / UTokyo 46

Barrier should be inserted Before writing data Wait for the threads that need to read the old data actually read the old data After writing data To make the threads that will read the new data wait for writing the new data Before reading data Wait for the thread that produces the new data actually produce the new data After reading data To keep the other threads from updating the data Parallel Numerical Algorithms / IST / UTokyo 47

Self check questions Explain private and shared variables Which variables are private/shared by default? What is Suda s recommended style? What is race condition? Show a few methods to solve race conditions What is weak consistency? What flush does? Implicit flushes inserted where, and where not? Explain why barrier is needed {before and after {reading and writing shared data Parallel Numerical Algorithms / IST / UTokyo 48

PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Cache Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 49