ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Similar documents
ME759 High Performance Computing for Engineering Applications

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

ME964 High Performance Computing for Engineering Applications

Multi-core Architecture and Programming

Programming with OpenMP*

Programming with OpenMP* Intel Software College

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Data Environment: Default storage attributes

ME759 High Performance Computing for Engineering Applications

/Users/engelen/Sites/HPC folder/hpc/openmpexamples.c

Introduction to OpenMP

Assignment 1 OpenMP Tutorial Assignment

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Cluster Computing 2008


Introduction to OpenMP.

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

ECE 574 Cluster Computing Lecture 10

Lecture 4: OpenMP Open Multi-Processing

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Advanced Computing for Engineering Applications

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

Distributed Systems + Middleware Concurrent Programming with OpenMP

Introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

Multithreading in C with OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Multicore Parallel Computing with OpenMP. Dan Negrut University of Wisconsin - Madison

CS 470 Spring Mike Lam, Professor. OpenMP

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Alfio Lazzaro: Introduction to OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Computing. Lecture 17: OpenMP Last Touch

Computational Mathematics

Lecture 2 A Hand-on Introduction to OpenMP

Introduction to OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Introduction to OpenMP

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

DPHPC: Introduction to OpenMP Recitation session

Introduction to OpenMP. Motivation

DPHPC: Introduction to OpenMP Recitation session

Raspberry Pi Basics. CSInParallel Project

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming in C with MPI and OpenMP

Shared memory parallel computing

Shared Memory Programming Paradigm!

Parallel programming using OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Introduction to OpenMP

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Tasking and OpenMP Success Stories

Shared Memory Programming Model

Tieing the Threads Together

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Hybrid MPI and OpenMP Parallel Programming

Synchronisation in Java - Java Monitor

Parallel Programming in C with MPI and OpenMP

Computer Architecture

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming in C with MPI and OpenMP

Introduc)on to OpenMP

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

A brief introduction to OpenMP

High Performance Computing: Tools and Applications

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP Programming. Aiichiro Nakano

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

High Performance Computing: Tools and Applications

OpenMP. Today s lecture. Scott B. Baden / CSE 160 / Wi '16

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Shared Memory Parallel Programming

Parallel Numerical Algorithms

Transcription:

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications Variable Sharing in OpenMP OpenMP synchronization issues OpenMP performance issues November 4, 2015 Lecture 22 Dan Negrut, 2015 ECE/ME/EMA/CS 759 UW-Madison

Quote of the Day You have power over your mind - not outside events. Realize this, and you will find strength. -- Marcus Aurelius, Roman Emperor 121 AD -- 180 AD 2

Before We Get Started Issues covered last time: Work sharing in OpenMP Parallel for Parallel sections Parallel tasks Data sharing OpenMP Today s topics Data sharing in OpenMP, wrap up OpenMP, caveats Other issues: Assignment: HW07 - due today at 11:59 PM HW08 assigned today, posted on the class website Final project proposal: 2 pages, due on 11/13 at 11:59 pm (Learn@UW dropbox) 3

Functional Level Parallelism Using omp sections #pragma omp parallel sections #pragma omp section double a = alice(); #pragma omp section double b = bob(); #pragma omp section double k = kate(); alice boss bob bigboss kate double s = boss(a, b); printf ("%6.2f\n", bigboss(s,k)); 4 [IOMPP]

Left: not good. a value garbage outside scope of the section Right: good. #pragma omp parallel sections #pragma omp section double a = alice(); #pragma omp section double b = bob(); #pragma omp section double k = kate(); #pragma omp parallel sections #pragma omp section a = alice(); #pragma omp section b = bob(); #pragma omp section k = kate(); double s = boss(a, b); printf ("%6.2f\n", bigboss(s,k)); double s = boss(a, b); printf ("%6.2f\n", bigboss(s,k)); 5 [IOMPP]

using namespace std ; typedef list<double> LISTDBL; void dosomething(listdbl::iterator& itrtr) *itrtr *= 2.; int main() LISTDBL test; // default constructor LISTDBL::iterator it; #include <omp.h> #include <list> #include <iostream> #include <math.h> for( int i=0;i<4;++i) for( int j=0;j<8;++j) test.insert(test.end(), pow(10.0,i+1)+j); for( it = test.begin(); it!= test.end(); it++ ) cout << *it << endl; it = test.begin(); #pragma omp parallel num_threads(8) #pragma omp single while( it!= test.end() ) #pragma omp task firstprivate(it) dosomething(it); it++; for( it = test.begin(); it!= test.end(); it++ ) cout << *it << endl; return 0; 6

Data Scoping, Words of Wisdom When in doubt, explicitly indicate who s what Data scoping: common sources of errors in OpenMP It takes some practice before you understand default behavior Scoping: Not always intuitive 7

#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n",tid); #pragma omp sections nowait #pragma omp section printf("thread %d doing section 1\n",tid); for (i=0; i<n; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %f\n",tid,i,c[i]); When in doubt, explicitly indicate who s what #pragma omp section printf("thread %d doing section 2\n",tid); for (i=0; i<n; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %f\n",tid,i,d[i]); /* end of sections */ printf("thread %d done.\n",tid); /* end of parallel section */ Q: Do you see any problem with this piece of code? 8

Example: Shared & Private Vars. float A[10]; main () int index[10]; #pragma omp parallel Work (index); printf ("%d\n", index[1]); extern float A[10]; void Work (int *index) float temp[10]; static int count; <...> Assumed to be in another translation unit A, index, count Which A, index, variables and count are shared shared and which by all threads, variables but are temp private? is local to each thread temp temp temp A, index, count 9 [IOMPP]

More on Variable Scoping: Fibonacci Sequence 0 0 1 1 2 10

Example: Data Scoping Issue - fib #include <stdio.h> #include <omp.h> int fib(int); int main() int n = 10; omp_set_num_threads(4); #pragma omp parallel #pragma omp single printf("fib(%d) = %d\n", n, fib(n)); 11

Example: Data Scoping Issue - fib Assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task x = fib(n 1); #pragma omp task y = fib(n 2); #pragma omp taskwait return x+y; n is private in both tasks x is a private variable y is a private variable This is important - wait here on the completion of the child tasks spawned (two of them) What s wrong here? Values of the private variables not available outside of tasks 12 [IOMPP]

Example: Data Scoping Issue - fib int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task x = fib(n 1); #pragma omp task y = fib(n 2); #pragma omp taskwait x is a private variable y is a private variable return x+y [IOMPP] Values of the private variables not available outside of task definition 13

Example: Data Scoping Issue - fib int fib ( int n ) int x, y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n 1); #pragma omp task shared(y) y = fib(n 2); #pragma omp taskwait return x+y; n is private in both tasks x & y are now shared we need both values to compute the sum [IOMPP] The values of the x & y variables will be available outside each task construct after the taskwait 14

#include <stdio.h> #include <omp.h> int main(void) const int N = 3; int a[3] = 2, 4, 6 ; int b[3] = 1, 3, 5 ; int c[3], d[3]; int i, tid, nthreads; #pragma omp parallel private(i,tid) shared(a,b) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n", tid); #pragma omp sections #pragma omp section printf("thread %d doing section 1\n", tid); for (i = 0; i < N; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %d\n", tid, i, c[i]); #pragma omp section printf("thread %d doing section 2\n", tid); for (i = 0; i < N; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %d\n", tid, i, d[i]); /* end of sections */ printf("thread %d done.\n", tid); /* end of parallel section */ for (i = 0; i < N; i++) printf("c[%d] = %d AND d[%d]= %d\n", i, c[i], i, d[i]); return 0; 15

#include <stdio.h> #include <omp.h> int main(void) const int N = 3; int a[3] = 2, 4, 6 ; int b[3] = 1, 3, 5 ; int c[3], d[3]; int i, tid, nthreads; #pragma omp parallel private(i,tid, c, d) shared(a,b) tid = omp_get_thread_num(); if (tid == 0) nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); printf("thread %d starting...\n", tid); #pragma omp sections #pragma omp section printf("thread %d doing section 1\n", tid); for (i = 0; i < N; i++) c[i] = a[i] + b[i]; printf("thread %d: c[%d]= %d\n", tid, i, c[i]); #pragma omp section printf("thread %d doing section 2\n", tid); for (i = 0; i < N; i++) d[i] = a[i] * b[i]; printf("thread %d: d[%d]= %d\n", tid, i, d[i]); /* end of sections */ printf("thread %d done.\n", tid); /* end of parallel section */ for (i = 0; i < N; i++) printf("c[%d] = %d AND d[%d]= %d\n", i, c[i], i, d[i]); return 0; 16

Work Plan What is OpenMP? Parallel regions Work sharing Data environment Synchronization Advanced topics 17 [IOMPP]

Implicit Barriers Several OpenMP constructs have implicit barriers parallel necessary barrier cannot be removed for single Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: for clause single clause 18 Credit: IOMPP

Nowait Clause #pragma omp for nowait for(...)...; #pragma single nowait [...] Use when threads unnecessarily wait between independent computations #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigfunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigfunc2(j); 19 Credit: IOMPP

Barrier Construct Explicit barrier synchronization Each thread waits until all threads arrive #pragma omp parallel shared(a, B, C) DoSomeWork(A,B); //input is A, output is B #pragma omp barrier DoMoreWork(B,C); //input is B, output is C 20 Credit: IOMPP

[Preamble] Atomic Construct index[0] = 5; index[1] = 3; index[2] = 4; index[3] = 0; index[4] = 5; index[5] = 5; index[6] = 2; index[7] = 1; Code runs ok sequentially (left) OpenMP code doesn t run ok (right) for (i = 0; i < 8; i++) x[index[i]] += work1(i); y[i] += work2(i); #pragma omp parallel for for (i = 0; i < 8; i++) x[index[i]] += work1(i); y[i] += work2(i); 21

Atomic Construct Applies only to simple update of memory location Special case of a critical section, discussed shortly Atomic introduces less overhead than critical index[0] = 5; index[1] = 3; index[2] = 4; index[3] = 0; index[4] = 5; index[5] = 5; index[6] = 2; index[7] = 1; #pragma omp parallel for shared(x, y, index) for (i = 0; i < 8; i++) #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); 22

Synchronisation, Words of Wisdom Barriers can be very expensive Typically 1000s cycles to synchronise Avoid barriers via: Careful use of the NOWAIT clause Parallelize at the outermost level possible May require re-ordering of loops +/ indexes Choice of CRITICAL / ATOMIC / lock routines may impact performance Credit: Alan Real 23

Example: Dot Product float dot_prod(float* a, float* b, int N) float sum = 0.0; #pragma omp parallel for for(int i=0; i<n; i++) sum += a[i] * b[i]; return sum; This is not good. 24 Credit: IOMPP

Race Condition Definition, race condition : two or more threads access a shared variable at the same time. Leads to nondeterministic behavior For example, suppose that area is shared and both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x); Credit: IOMPP 25

Two Possible Scenarios Value of area Thread A Thread B Value of area Thread A Thread B 11.667 11.667 +3.765 +3.765 15.432 11.667 15.432 15.432 + 3.563 + 3.563 18.995 15.230 Order of thread execution causes non determinant behavior in a data race 26 Credit: IOMPP

Protect Shared Data The critical construct: protects access to shared, modifiable data The critical section allows only one thread to enter it at a given time float dot_prod(float* a, float* b, int N) float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<n; i++) #pragma omp critical sum += a[i] * b[i]; return sum; Credit: IOMPP 27

OpenMP Critical Construct #pragma omp critical [(lock_name)] Defines a critical region on a structured block Threads wait their turn only one at a time calls consum() thereby preventing race conditions Naming the critical construct RES_lock is optional but highly recommended float RES; #pragma omp parallel #pragma omp for for(int i=0; i<niters; i++) float B = big_job(i); #pragma omp critical (RES_lock) consum(b, RES); The for loop Includes material from IOMPP 28

reduction Example #pragma omp parallel for reduction(+:sum) for(i=0; i<n; i++) sum += a[i] * b[i]; Local copy of sum for each thread engaged in the reduction is private Each local sum initialized to the identity operand associated with the operator that comes into play Here we have +, so it s a zero (0) All local copies of sum added together and stored in global variable Credit: IOMPP 29

OpenMP reduction Clause reduction (op:list) The variables in list will be shared in the enclosing parallel region Here s what happens inside the parallel or work-sharing construct: A private copy of each list variable is created and initialized depending on the op These copies are updated locally by threads At end of construct, local copies are combined through op into a single value Credit: IOMPP 30

OpenMP Reduction Example: Numerical Integration 4.0 2.0 1 4.0 4.0 f(x) = (1+x (1+x 2 ) 2 ) dx = 0 static long num_steps=100000; double step, pi; void main() int i; double x, sum = 0.0; 0.0 31 Credit: IOMPP X 1.0 step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++) x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); pi = step * sum; printf("pi = %f\n",pi);

OpenMP Reduction Example: Numerical Integration #include <stdio.h> #include <stdlib.h> #include "omp.h" int main(int argc, char* argv[]) int num_steps = atoi(argv[1]); double step = 1./(double(num_steps)); double sum; #pragma omp parallel for reduction(+:sum) for(int i=0; i<num_steps; i++) double x = (i +.5)*step; sum += 4.0/(1.+ x*x); double my_pi = sum*step; printf("value of integral is: %f\n", my_pi); return 0; 32

C/C++ Reduction Operations A range of associative operands can be used with reduction Operand Initial Value + 0 * 1-0 ^ 0 Operand Initial Value & ~0 0 && 1 0 33 Credit: IOMPP

OpenMP Performance Issues 34

Performance Easy to write OpenMP yet hard to write an efficient program Five main causes of poor performance: Sequential code Communication Load imbalance Synchronisation Compiler (non-)optimisation. 35 Credit: Alan Real

Sequential Code Corollary to Amdahl s law: A code that has been parallelized will never run faster than the sum of the parts executed sequentially If most of the code continues to run sequentially, parallelization not going to make a difference Solution: Go back and understand whether you can approach the solution from a different perspective that exposes more parallelism Thinking within the context of OpenMP All code outside of parallel regions and inside MASTER, SINGLE and CRITICAL directives is sequential This code should be as small as possible. [Alan Real] 36

Communication On shared memory machines, which is where OpenMP operates, communication is disguised as increased memory access costs. It takes longer to access data in main memory or another processor s cache than it does from local cache. Memory accesses are expensive ~100 cycles for a main memory access compared to 1-3 cycles for a flop. Unlike message passing, communication is spread throughout the program Hard to analyse and monitor NOTE: Message passing discussed next week. Programmer manages the movement of data through messages Credit: Alan Real 37

Caches and Coherency Shared memory programming assumes that a shared variable has a unique value at a given time Speeding up sequential computation: done through use of large caches Caching consequence: multiple copies of a physical memory location may exist at different hardware locations For program correctness, caches must be kept coherent Coherency operations are usually performed on the cache lines in the level of cache closest to memory LLC last level cache: high end systems these days: LLC is level 3 cache Can have 45 MB of L3 cache on a high end Intel CPU There is much overhead that goes into cache coherence Credit: Alan Real 38

Memory Hierarchy, Multi-Core Scenario Core 0 Core 1 Core 2 Core 3 L1C L1C L1C L1C L2C L2C L2C L2C Interconnect Memory 39 Credit: Alan Real

What Does MESI Mean to You? 40

MESI: Invalidation-Based Coherence Protocol Cache lines have state bits. Data migrates between processor caches, state transitions maintain coherence. MESI Protocol has four states: M: Modified, E: Exclusive, S: Shared, I: Invalid Processor A s Cache 1. Read x x I E exclusive E S x shared Processor B s Cache I S shared x 2. Read x 3. Write x S M x modified/dirty S I invalid Cache line was invalidated [J. Marathe] Time 41

Coherency Simplified Further [cooked up example] Further simplify MESI for sake of simple discussion on next slide Assume now that each cache line can exist in one of 3 states: Exclusive: the only valid copy in any cache Read-only: a valid copy but other caches may contain it Invalid: out of date and cannot be used In this simplified coherency model A READ on an invalid or absent cache line will be cached as read-only or exclusive A WRITE on a line not in an exclusive state will cause all other copies to be marked invalid and the written line to be marked exclusive Credit: Alan Real 42

Exclusive Read-Only Invalid W P R P W P P R L1C L1C L1C L1C L2C L2C L2C L2C Interconnect Memory Credit: Alan Real Coherency example 43