ME759 High Performance Computing for Engineering Applications

Similar documents
ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Programming with OpenMP* Intel Software College

ME964 High Performance Computing for Engineering Applications

Multi-core Architecture and Programming

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Assignment 1 OpenMP Tutorial Assignment

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Distributed Systems + Middleware Concurrent Programming with OpenMP

Programming with OpenMP*

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Advanced OpenMP. Tasks

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Multicore Parallel Computing with OpenMP. Dan Negrut University of Wisconsin - Madison

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Introduction to OpenMP.

Cluster Computing 2008


Multithreading in C with OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Introduction to OpenMP

Computational Mathematics

Parallel Programming using OpenMP

Advanced Computing for Engineering Applications

Parallel Programming using OpenMP

DPHPC: Introduction to OpenMP Recitation session

Introduction to OpenMP

Concurrent Programming with OpenMP

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

Alfio Lazzaro: Introduction to OpenMP

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Lecture 4: OpenMP Open Multi-Processing

Introduction to OpenMP

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

ECE 574 Cluster Computing Lecture 10

EPL372 Lab Exercise 5: Introduction to OpenMP

A brief introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

DPHPC: Introduction to OpenMP Recitation session

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Introduction to OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Tasking in OpenMP 4. Mirko Cestari - Marco Rorro -

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

CS 470 Spring Mike Lam, Professor. OpenMP

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Shared Memory Programming Paradigm!

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

Shared Memory Programming Model

Parallel programming using OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

High Performance Computing: Tools and Applications

OPENMP OPEN MULTI-PROCESSING

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Session 4: Parallel Programming with OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Parallel Computing Parallel Programming Languages Hwansoo Han

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Programming Shared-memory Platforms with OpenMP. Xu Liu

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

INTRODUCTION TO OPENMP (PART II)

HPCSE - II. «OpenMP Programming Model - Tasks» Panos Hadjidoukas

Shared Memory Parallelism - OpenMP

Parallel Programming

Shared memory parallel computing

POSIX Threads and OpenMP tasks

Parallel Programming in C with MPI and OpenMP

Shared Memory Parallelism using OpenMP

Parallel Programming in C with MPI and OpenMP

Introduction to Standard OpenMP 3.1

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Tasking and OpenMP Success Stories

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

OPENMP TIPS, TRICKS AND GOTCHAS

19.1. Unit 19. OpenMP Library for Parallelism

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Masterpraktikum - High Performance Computing

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Data Environment: Default storage attributes

OpenMP Examples - Tasking

Introduction to Programming with OpenMP

ME964 High Performance Computing for Engineering Applications

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Lab on Nested Parallelism and Tasks

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

15-418, Spring 2008 OpenMP: A Short Introduction

Transcription:

ME759 High Performance Computing for Engineering Applications Parallel Computing on Multicore CPUs October 25, 2013 Dan Negrut, 2013 ME964 UW-Madison A programming language is low level when its programs require attention to the irrelevant. -- Allan Perlis

Before We Get Started Last time Parallel computing on the CPU Got started with OpenMP for parallel computing on multicore CPUs Today: Continue OpenMP discussion: sections tasks data scoping synchronization Miscellaneous Exam on Monday, November 25 at 7:15 PM. Room 1163ME Review session held during regular class hour that Monday 2

Function Level Parallelism a = alice(); b = bob(); s = boss(a, b); c = cy(); printf ("%6.2f\n", bigboss(s,c)); alice boss bob cy alice,bob, and cy can be computed in parallel bigboss 3 [IOMPP]

omp sections There is an s here #pragma omp sections Must be inside a parallel region Precedes a code block containing N sub-blocks of code that may be executed concurrently by N threads Encompasses each omp section, see below There is no s here #pragma omp section Precedes each sub-block of code within the encompassing block described above Enclosed program segments are distributed for parallel execution among available threads 4 [IOMPP]

Functional Level Parallelism Using omp sections #pragma omp parallel sections #pragma omp section double a = alice(); #pragma omp section double b = bob(); #pragma omp section double c = cy(); alice boss bob bigboss cy double s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); 5 [IOMPP]

Advantage of Parallel Sections [IOMPP] Independent sections of code can execute concurrently reduces execution time #pragma omp parallel sections #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); Time (Execution Flow) Serial Parallel The pink and green tasks are executed at no additional time-penalty in the 6 shadow of the purple task

sections, Example #include <stdio.h> #include <omp.h> int main() printf("start with 2 procs\n"); #pragma omp parallel sections num_threads(2) #pragma omp section printf("start work 1\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 2.0); printf("finish work 1\n"); #pragma omp section printf("start work 2\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 2.0); printf("finish work 2\n"); #pragma omp section printf("start work 3\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 2.0); printf("finish work 3\n"); return 0; 7

sections, Example: 2 threads 8

sections, Example #include <stdio.h> #include <omp.h> int main() printf("start with 4 procs\n"); #pragma omp parallel sections num_threads(4) #pragma omp section printf("start work 1\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 2.0); printf("finish work 1\n"); #pragma omp section printf("start work 2\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 6.0); printf("finish work 2\n"); #pragma omp section printf("start work 3\n"); double starttime = omp_get_wtime(); while( (omp_get_wtime() - starttime) < 2.0); printf("finish work 3\n"); return 0; 9

sections, Example: 4 threads 10

Work Plan What is OpenMP? Parallel regions Work sharing Tasks Data environment Synchronization Advanced topics 11

OpenMP Tasks Task Most important feature added in the 3.0 version of OpenMP Allows parallelization of irregular problems Unbounded loops Recursive algorithms Producer/consumer 12 [IOMPP]

Tasks: What Are They? Tasks are independent units of work A thread is assigned to perform a task Tasks might be executed immediately or might be deferred The OS & runtime decide which of the above Time Tasks are composed of code to execute data environment internal control variables (ICV) Serial Parallel 13 [IOMPP]

Tasks: What Are They? [More specifics ] Code to execute The literal code in your program enclosed by the task directive Data environment The shared & private data manipulated by the task Internal control variables Thread scheduling and environment variable type controls A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct Two activities: (1) packaging, and (2) execution A thread packages new instances of a task (code and data) Some thread in the team executes the task at some later time 14

using namespace std ; typedef list<double> LISTDBL; void dosomething(listdbl::iterator& itrtr) *itrtr *= 2.; int main() LISTDBL test; // default constructor LISTDBL::iterator it; #include <omp.h> #include <list> #include <iostream> #include <math.h> for( int i=0;i<4;++i) for( int j=0;j<8;++j) test.insert(test.end(), pow(10.0,i+1)+j); for( it = test.begin(); it!= test.end(); it++ ) cout << *it << endl; it = test.begin(); #pragma omp parallel num_threads(8) #pragma omp single while( it!= test.end() ) #pragma omp task private(it) dosomething(it); it++; for( it = test.begin(); it!= test.end(); it++ ) cout << *it << endl; return 0; 15

Initial values Final values Compile like: $ g++ -o testomp.exe testomp.cpp 16

Task Construct Explicit Task View A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop call this thread L Thread L runs the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread #pragma omp parallel //threads are ready to go now #pragma omp single // block 1 node *p = head_of_list; while (p!=listend) //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 All tasks complete at the barrier at the end of the parallel region s construct Each task has its own stack space that will be destroyed when the task is completed See an example in a bit [IOMPP] 17

Why are tasks useful? Have potential to parallelize irregular patterns and recursive function calls Single Threaded Thr1 Thr2 Thr3 Thr4 #pragma omp parallel //threads are ready to go now #pragma omp single // block 1 node *p = head_of_list; while (p) //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 Block 1 Block 2 Task 1 Block 3 Block 2 Task 2 Block 3 Block 2 Task 3 Time Block 1 Block 3 Block 3 Idle Time Saved Block 2 Task 1 Idle Block 2 Task 2 Block 2 Task 3 How about synchronization issues? [IOMPP] 18

Tasks: Synchronization Issues Setup: Assume Task B specifically relies on completion of Task A You need to be in a position to guarantee completion of Task A before invoking the execution of Task B Tasks are guaranteed to be complete at thread or task barriers: At the directive: #pragma omp barrier At the directive: #pragma omp taskwait 19 [IOMPP]

Task Completion Example Multiple foo tasks created here one for each thread #pragma omp parallel #pragma omp task foo(); #pragma omp barrier #pragma omp single #pragma omp task bar(); All foo tasks guaranteed to be completed here One bar task created here bar task guaranteed to be completed here 20 [IOMPP]

Comments: sections vs. tasks sections have a static attribute: things are mostly settled at compile time The tasks construct is more recent and more sophisticated They have a dynamic attribute: things are figured out at run time and the construct counts under the hood on the presence of a scheduling agent They can encapsulate any block of code Can handle nested loops and scenarios when the number of jobs is not clear The run time system generates and executes the tasks, either at implicit synchronization points in the program or under explicit control of the programmer NOTE: It s the developer responsibility to ensure that different tasks can be executed concurrently 21

Work Plan What is OpenMP? Parallel regions Work sharing Data scoping Synchronization Advanced topics 22

Data Scoping What s shared OpenMP uses a shared-memory programming model Shared variable - a variable that can be read or written by multiple threads shared clause can be used to make items explicitly shared Global variables are shared by default among tasks Other examples of variables being shared among threads File scope variables Namespace scope variables Variables with const-qualified type having no mutable member Static variables which are declared in a scope inside the construct 23 [IOMPP]

Data Scoping What s Private Not everything is shared... Examples of implicitly determined PRIVATE variables: Stack (local) variables in functions called from parallel regions Automatic variables within a statement block Loop iteration variables Implicitly declared private variables within tasks will be treated as firstprivate firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct 24 [IOMPP]

Data Scoping The Basic Rule When in doubt, explicitly indicate who s what Data scoping: one of the most common sources of errors in OpenMP 25

#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n",tid); #pragma omp sections nowait #pragma omp section printf("thread %d doing section 1\n",tid); for (i=0; i<n; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %f\n",tid,i,c[i]); #pragma omp section printf("thread %d doing section 2\n",tid); for (i=0; i<n; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %f\n",tid,i,d[i]); /* end of sections */ When in doubt, explicitly indicate who s what printf("thread %d done.\n",tid); /* end of parallel section */ 26

A Data Environment Example float A[10]; main () int index[10]; #pragma omp parallel Work (index); printf ("%d\n", index[1]); extern float A[10]; void Work (int *index) float temp[10]; static integer count; <...> Assumed to be in another translation unit A, index, count Which A, index, variables andcount are shared shared and which by all threads, variables buttemp are private? is local to each thread temp temp temp A, index, count 27 Includes material from IOMPP

Data Scoping Issue: fib Example Assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); #pragma omp task y = fib(n-2); #pragma omp taskwait return x+y; n is private in both tasks x is a private variable y is a private variable This is very important here What s wrong here? Values of the private variables not available outside of tasks 28 Credit: IOMPP

Data Scoping Issue: fib Example int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); #pragma omp task y = fib(n-2); #pragma omp taskwait x is a private variable y is a private variable return x+y Credit: IOMPP Values of the private variables not available outside of tasks 29

Data Scoping Issue: fib Example int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x+y; n is private in both tasks x & y are now shared we need both values to compute the sum Credit: IOMPP The values of the x & y variables will be available outside each task construct after the taskwait 30

Work Plan What is OpenMP? Parallel regions Work sharing Data environment Synchronization Advanced topics 31 Credit: IOMPP

Implicit Barriers Several OpenMP constructs have implicit barriers parallel necessary barrier cannot be removed for single Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: for clause single clause 32 Credit: IOMPP

Nowait Clause #pragma omp for nowait for(...)...; #pragma single nowait [...] Use when threads unnecessarily wait between independent computations #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigfunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigfunc2(j); 33 Credit: IOMPP

Barrier Construct Explicit barrier synchronization Each thread waits until all threads arrive #pragma omp parallel shared(a, B, C) DoSomeWork(A,B); // Processed A into B #pragma omp barrier DoSomeWork(B,C); // Processed B into C 34 Credit: IOMPP

Atomic Construct Applies only to simple update of memory location Special case of a critical section, to be discussed shortly Atomic introduces less overhead than critical index[0] = 2; index[1] = 3; index[2] = 4; index[3] = 0; index[4] = 5; index[5] = 5; index[6] = 5; index[7] = 1; #pragma omp parallel for shared(x, y, index) for (i = 0; i < n; i++) #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); 35 Credit: IOMPP

Example: Dot Product float dot_prod(float* a, float* b, int N) float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) sum += a[i] * b[i]; return sum; What is Wrong? 36 Credit: IOMPP

Race Condition A race condition is nondeterministic behavior produced when two or more threads access a shared variable at the same time For example, suppose that area is shared and both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x); Credit: IOMPP 37

Two Possible Scenarios Value of area Thread A Thread B Value of area Thread A Thread B 11.667 11.667 +3.765 +3.765 15.432 11.667 15.432 15.432 + 3.563 + 3.563 18.995 15.230 Order of thread execution causes non determinant behavior in a data race 38 Credit: IOMPP

Protect Shared Data The critical construct: protects access to shared, modifiable data The critical section allows only one thread to enter it at a given time float dot_prod(float* a, float* b, int N) float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) #pragma omp critical sum += a[i] * b[i]; return sum; Credit: IOMPP 39