Parallel Programming using OpenMP

Similar documents
Parallel Programming using OpenMP

Multithreading in C with OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming in C with MPI and OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OPENMP OPEN MULTI-PROCESSING

Parallel and Distributed Programming. OpenMP

CS691/SC791: Parallel & Distributed Computing

DPHPC: Introduction to OpenMP Recitation session

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to. Slides prepared by : Farzana Rahman 1

Parallel Programming in C with MPI and OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Shared Memory Programming Model

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Multi-core Architecture and Programming

Parallel Programming: Background Information

Parallel Programming: Background Information

Introduction to OpenMP.

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Parallel Programming in C with MPI and OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Programming

Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming Project Notes

More Advanced OpenMP. Saturday, January 30, 16

Distributed Systems + Middleware Concurrent Programming with OpenMP

ECE 574 Cluster Computing Lecture 10

Parallel Numerical Algorithms

Parallel Computing. Prof. Marco Bertini

Parallel Computing Parallel Programming Languages Hwansoo Han

CS 5220: Shared memory programming. David Bindel

Lecture 4: OpenMP Open Multi-Processing

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

EPL372 Lab Exercise 5: Introduction to OpenMP

Parallel programming using OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

A brief introduction to OpenMP

CME 213 S PRING Eric Darve

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Concurrent Programming with OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Data Handling in OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Allows program to be incrementally parallelized

Data Environment: Default storage attributes

OpenMP Fundamentals Fork-join model and data environment

Introduction to OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Introduction to OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Session 4: Parallel Programming with OpenMP

Introduc)on to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CSL 860: Modern Parallel

Shared Memory Parallelism - OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

An Introduction to OpenMP

Introduction to Standard OpenMP 3.1

Shared Memory Parallelism using OpenMP

Parallel Computing. Lecture 13: OpenMP - I

Shared Memory programming paradigm: openmp

Computer Architecture

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Introduction to OpenMP. Motivation

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Shared Memory: Virtual Shared Memory, Threads & OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

CS4961 Parallel Programming. Lecture 9: Task Parallelism in OpenMP 9/22/09. Administrative. Mary Hall September 22, 2009.

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Transcription:

1 Parallel Programming using OpenMP Mike Bailey mjb@cs.oregonstate.edu openmp.pptx OpenMP Multithreaded Programming 2 OpenMP stands for Open Multi-Processing OpenMP is a multi-vendor (see next page) standard to perform shared-memory multithreading OpenMP uses the fork-join model OpenMP is both directive- and library-based OpenMP threads share a single executable, global memory, and heap (malloc, new) Each OpenMP thread has its own stack (function arguments, function return address, local variables) Using OpenMP requires no dramatic code changes OpenMP probably gives you the biggest multithread benefit per amount of work you have to put in to using it Much of your use of OpenMP will be accomplished by issuing C/C++ pragmas to tell the compiler how to build the threads into the executable #pragma omp directive [clause] 1

Who is in the OpenMP Consortium? 3 What OpenMP Isn t: 4 OpenMP doesn t check for data dependencies, data conflicts, deadlocks, or race conditions. You are responsible for avoiding those yourself OpenMP doesn t check for non-conforming code sequences OpenMP doesn t guarantee identical behavior across vendors or hardware, or even between multiple runs on the same vendor s hardware OpenMP doesn t guarantee the order in which threads execute, just that they do execute OpenMP is not overhead-free OpenMP does not prevent you from writing false-sharing code (in fact, it makes it really easy) We will get to false sharing in the cache notes 2

Memory Allocation in a Multithreaded Program 5 One-thread Multiple-threads Stack Stack Program Executable Globals Don t take this completely literally. The exact arrangement depends on the operating system and the compiler. For example, sometimes the stack and heap are arranged so that they grow towards each other. Stack Common Program Executable Common Globals Heap Common Heap Using OpenMP on Linux 6 g++ -o proj proj.cpp -lm -fopenmp icpc -o proj proj.cpp -lm -openmp -align -qopt-report=3 -qopt-report-phase=vec Using OpenMP in Microsoft Visual Studio 1. Go to the Project menu Project Properties 2. Change the setting Configuration Properties C/C++ Language OpenMP Support to "Yes (/openmp)" Seeing if OpenMP is Supported on Your System #ifndef _OPENMP fprintf( stderr, OpenMP is not supported sorry!\n ); exit( 0 ); #endif 3

Number of OpenMP threads 7 Two ways to specify how many OpenMP threads you want to have available: 1. Set the OMP_NUM_THREADS environment variable 2. Call omp_set_num_threads( num ); Asking how many cores this program has access to: num = omp_get_num_procs( ); Setting the number of threads to the exact number of cores available: num = omp_set_num_threads( omp_get_num_procs( ) ); Asking how many OpenMP threads this program is using right now: num = omp_get_num_threads( ); Asking which thread this one is: me = omp_get_thread_num( ); Creating an OpenMP Team of Threads 8 #pragma omp parallel default(none) This creates a team of threads Each thread then executes all lines of code in this block. Think of it this way: #pragma omp parallel default(none) 4

Creating an OpenMP Team of Threads 9 #include <stdio.h> #include <omp.h> int main( ) omp_set_num_threads( 8 ); #pragma omp parallel default(none) printf( Hello, World, from thread #%d! \n, omp_get_thread_num( ) ); return 0; Hint: run it several times in a row. What do you see? Why? Uh-oh 10 First Run Hello, World, from thread #6! Hello, World, from thread #1! Hello, World, from thread #7! Hello, World, from thread #5! Hello, World, from thread #4! Hello, World, from thread #3! Hello, World, from thread #2! Hello, World, from thread #0! Second Run Hello, World, from thread #0! Hello, World, from thread #7! Hello, World, from thread #4! Hello, World, from thread #6! Hello, World, from thread #1! Hello, World, from thread #3! Hello, World, from thread #5! Hello, World, from thread #2! Third Run Hello, World, from thread #2! Hello, World, from thread #5! Hello, World, from thread #0! Hello, World, from thread #7! Hello, World, from thread #1! Hello, World, from thread #3! Hello, World, from thread #4! Hello, World, from thread #6! Fourth Run Hello, World, from thread #1! Hello, World, from thread #3! Hello, World, from thread #5! Hello, World, from thread #2! Hello, World, from thread #4! Hello, World, from thread #7! Hello, World, from thread #6! Hello, World, from thread #0! There is no guarantee of thread execution order! 5

Creating OpenMP threads in Loops 11 #include <omp.h> omp_set_num_threads( NUMT ); #pragma omp parallel for default(none) for( int i = 0; i < num; i++ ) The code starts out executing in a single thread This sets how many threads will be in the thread pool This creates a team of threads from the thread pool and divides the for-loop passes up among those threads There is an implied barrier at the end where a thread waits until all are done, then the code continues in a single thread This tells the compiler to parallelize the for-loop into multiple threads. Each thread automatically gets its own personal copy of the variable i because it is defined within the for-loop body. The default(none) directive forces you to explicitly declare all variables declared outside the parallel region to be either private or shared while they are in the parallel region. Variables declared within the for-loop body are automatically private OpenMP for-loop Rules 12 #pragma omp parallel for default(none), shared( ), private( ) for( int index = start ; index terminate condition; index changed ) The index must be an int or a pointer The start and terminate conditions must have compatible types Neither the start nor the terminate conditions can be changed during the execution of the loop The index can only be modified by the changed expression (i.e., not modified inside the loop itself) There can be no inter-loop data dependencies such as: a[ i ] = a[ i-1 ] + 1.; because what if these two lines end up being given to two different threads a[101] = a[100] + 1.; a[102] = a[101] + 1.; 6

OpenMP For-Loop Rules 13 for( index = start ; index < end index <= end index > end index >= end ; index++ ++index index-- --index index += incr index = index + incr index = incr + index index -= decr index = index - decr ) OpenMP Directive Data Types 14 I recommend that you use: default(none) in all your OpenMP directives. This will force you to explicitly flag all of your inside variables as shared or private. This will help prevent mistakes. private(x) Means that each thread will have its own copy of the variable x shared(x) Means that all threads will share a common x. This is potentially dangerous. Example: #pragma omp parallel for default(none),private(i,j),shared(x) 7

Single Program Multiple Data (SPMD) in OpenMP 15 #define NUM 1000000 float A[NUM], B[NUM], C[NUM]; total = omp_get_num_threads( ); #pragma omp parallel default(none),private(me),shared(total) me = omp_get_thread_num( ); DoWork( me, total ); void DoWork( int me, int total ) int first = NUM * me / total; int last = NUM * (me+1)/total - 1; for( int i = first; i <= last; i++ ) C[i] = A[i] * B[i]; OpenMP Allocation of Work to Threads 16 Static Threads All work is allocated and assigned at runtime Dynamic Threads Consists of one Master and a pool of threads The pool is assigned some of the work at runtime, but not all of it When a thread from the pool becomes idle, the Master gives it a new assignment Round-robin assignments OpenMP Scheduling schedule(static [,chunksize]) schedule(dynamic [,chunksize]) Defaults to static chunksize defaults to 1 In static, the iterations are assigned to threads before the loop starts 8

OpenMP Allocation of Work to Threads 17 #pragma omp parallel for default(none),schedule(static,chunksize) for( int index = 0 ; index < 12 ; index++ ) Static,1 0 0,3,6,9 1 1,4,7,10 2 2,5,8,11 Static,2 0 0,1,6,7 1 2,3,8,9 2 4,5,10,11 Static,4 0 0,1,2,3 1 4,5,6,7 2 8,9,10,11 chunksize = 1 Each thread is assigned one iteration, then the assignments start over chunksize = 2 Each thread is assigned two iterations, then the assignments start over chunksize = 4 Each thread is assigned four iterations, then the assignments start over Arithmetic Operations Among Threads A Problem #pragma omp parallel for private(mypartialsum),shared(sum) for( int i = 0; i < N; i++ ) float mypartialsum = 18 sum = sum + mypartialsum; There is no guarantee when each thread will execute this line correctly There is not even a guarantee that each thread will finish this line before some other thread interrupts it. (Remember that each line of code usually generates multiple lines of assembly.) This is non-deterministic! Assembly code: Load sum Add mypartialsum Store sum What if the scheduler decides to switch threads right here? Conclusion: Don t do it this way! 9

Here s a trapezoid integration example (covered in another note set). 19 The partial sums are added up, as shown on the previous page. The integration was done 30 times. The answer is supposed to be exactly 2. None of the 30 answers is even close. And, not only are the answers bad, they are not even consistently bad! 0.469635 0.517984 0.438868 0.437553 0.398761 0.506564 0.489211 0.584810 0.476670 0.530668 0.500062 0.672593 0.411158 0.408718 0.523448 0.398893 0.446419 0.431204 0.501783 0.334996 0.484124 0.506362 0.448226 0.434737 0.444919 0.442432 0.548837 0.363092 0.544778 0.356299 Here s a trapezoid integration example (covered in another note set). 20 The partial sums are added up, as shown on the previous page. The integration was done 30 times. The answer is supposed to be exactly 2. None of the 30 answers is even close. And, not only are the answers bad, they are not even consistently bad! sum Trial # Don t do it this way! 10

Arithmetic Operations Among Threads Three Solutions 1 2 3 #pragma omp atomic sum = sum + mypartialsum; Fixes the non-deterministic problem But, serializes the code Operators include +, -, *, /, ++, --, >>, <<, ^, Operators include +=, -=, *=, /=, etc. #pragma omp critical sum = sum + mypartialsum; Also fixes it But, serializes the code Tries to use a built-in hardware instruction. Disables scheduler interrupts during the critical section. #pragma omp parallel for reduction(+:sum),private(mypartialsum) sum = sum + mypartialsum; Performs (sum,product,and,or,...) in O(log 2 N) time instead of O(N) Operators include +, -, *, /, ++, -- Operators include +=, -=, *=, /= Operators include ^=, =, &= 21 Secretly creates a sum private variable for each thread, fills them all separately, then adds them together on O(log 2 N) time. If You Understand NCAA Basketball Brackets, You Understand Reduction 22 Source: ESPN 11

Reduction vs. Atomic vs. Critical 23 Evaluations per Second Why Not Do Reduction by Creating Your Own sums Array, one for each Thread? float *sums = new float [ omp_get_num_threads( ) ]; for( int I = 0; i < omp_get_num_threads( ); i++ ) sums[i] = 0.; #pragma omp parallel for private(mypartialsum),shared(sums) for( int i = 0; i < N; i++ ) float mypartialsum = 24 sums[ omp_get_thread_num( ) ] += mypartialsum; float sum = 0.; for( int i= 0; i < omp_get_num_threads( ); i++ ) sum += sums[i]; delete [ ] sums; This seems perfectly reasonable, it works, and it gets rid of the problem of multiple threads trying to write into the same reduction variable. The reason we don t do this is that this method provokes a problem called False Sharing. We will get to that when we discuss caching. 12

Synchronization 25 Mutual Exclusion Locks (Mutexes) omp_init_lock( omp_lock_t * ); omp_set_lock( omp_lock_t * ); omp_unset_lock( omp_lock_t * ); omp_test_lock( omp_lock_t * ); ( omp_lock_t is really an array of 4 unsigned chars ) Critical sections #pragma omp critical Restricts execution to one thread at a time #pragma omp single Restricts execution to a single thread ever If the lock is not available, returns 0 If the lock is available, sets it and returns!0 Barriers #pragma omp barrier Forces each thread to wait here until all threads arrive Blocks if the lock is not available Then sets it and returns when it is available (Note: there is an implied barrier after parallel for loops and OpenMP sections, unless the nowait clause is used) Synchronization Examples 26 omp_lock_t Sync; omp_init_lock( &Sync ); omp_set_lock( &Sync ); << code that needs the mutual exclusion >> omp_unset_lock( &Sync ); while( omp_test_lock( &Sync ) == 0 ) DoSomeUsefulWork( ); 13

Creating Sections of OpenMP Code 27 Sections are independent blocks of code, able to be assigned to separate threads if they are available. #pragma omp parallel sections #pragma omp section Task 1 #pragma omp section Task 2 There is an implied barrier at the end What do OpenMP Sections do for You? 28 omp_set_num_threads( 1 ); Task 1 Task 2 Task 3 omp_set_num_threads( 2 ); Task 1 Task 2 Task 3 omp_set_num_threads( 3 ); Task 1 Task 2 Task 3 14

OpenMP Tasks 29 An OpenMP task is a single line of code or a structured block which is immediately assigned to one thread in the current thread team The task can be executed immediately, or it can be placed on its thread s list of things to do. If the if clause is used and the argument evaluates to 0, then the task is executed immediately, superseding whatever else that thread is doing. There has to be an existing parallel thread team for this to work. Otherwise one thread ends up doing all tasks. One of the best uses of this is to make a function call. That function then runs concurrently until it completes. #pragma omp task Watch_For_Internet_Input( ); You can create a task barrier with: #pragma omp taskwait Tasks are very much like OpenMP Sections, but Sections are more static, that is, trhe number of sections is set when you write the code, whereas Tasks can be created anytime, and in any number, under control of your program s logic. OpenMP Task Example: Processing each element of a linked list 30 #pragma omp parallel #pragma omp single default(none) element *p = listhead; while( p!= NULL ) #pragma omp task private(p) Process( p ); p = p->next; 15