CSL 860: Modern Parallel

Similar documents
CSL 730: Parallel Programming. OpenMP

COL 730: Parallel Programming. OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Data Handling in OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Shared Memory Parallelism - OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallel and Distributed Programming. OpenMP

Introduction to Standard OpenMP 3.1

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction [1] 1. Directives [2] 7

Lecture 4: OpenMP Open Multi-Processing

[Potentially] Your first parallel application

An Introduction to OpenMP

OpenMP Application Program Interface

Session 4: Parallel Programming with OpenMP

Concurrent Programming with OpenMP

Data Environment: Default storage attributes

OpenMP Application Program Interface

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP - Introduction

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Introduction to OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Parallel Programming

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Programming Shared-memory Platforms with OpenMP. Xu Liu

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Programming Shared Address Space Platforms using OpenMP

Shared Memory Parallelism using OpenMP

OPENMP TIPS, TRICKS AND GOTCHAS

Introduction to OpenMP

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Practical in Numerical Astronomy, SS 2012 LECTURE 12

ECE 574 Cluster Computing Lecture 10

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OPENMP OPEN MULTI-PROCESSING

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP Language Features

GCC Developers Summit Ottawa, Canada, June 2006

Shared Memory Programming Models I

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

Parallel Programming in C with MPI and OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

OpenMP loops. Paolo Burgio.

Parallel Programming in C with MPI and OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

PC to HPC. Xiaoge Wang ICER Jan 27, 2016

OpenMP Technical Report 3 on OpenMP 4.0 enhancements

Introduction to OpenMP.

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Using OpenMP. Shaohao Chen Research Boston University

OPENMP TIPS, TRICKS AND GOTCHAS

Distributed Systems + Middleware Concurrent Programming with OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

OpenMP Lab on Nested Parallelism and Tasks

OpenMP Fundamentals Fork-join model and data environment

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Using OpenMP. Shaohao Chen High performance Louisiana State University

Overview: The OpenMP Programming Model

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Shared Memory Programming with OpenMP

Parallel programming using OpenMP

Alfio Lazzaro: Introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Introduction to OpenMP

Parallel Programming using OpenMP

Parallel and Distributed Computing

Parallel Programming using OpenMP

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

EE/CSCI 451: Parallel and Distributed Computation

Parallel Computing Parallel Programming Languages Hwansoo Han

Allows program to be incrementally parallelized

Multithreading in C with OpenMP

Computational Mathematics

Introduction to Programming with OpenMP

High Performance Computing: Tools and Applications

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

OpenMP Library Functions and Environmental Variables. Most of the library functions are used for querying or managing the threading environment

Transcription:

CSL 860: Modern Parallel Computation

Hello OpenMP #pragma omp parallel { // I am now thread iof n switch(omp_get_thread_num()) { case 0 : blah1.. case 1: blah2.. // Back to normal Parallel Construct Extremely simple to use and incredibly powerful Fork-Join model Every thread has its own execution context Variables can be declared shared or private

Execution Model Encountering thread creates a team: Itself (master) + zero or more additional threads. Applies to structured block immediately following Each thread executes a copy of the code in { But, also see: Work-sharing constructs There s an implicit barrierat the end of block Only master continues beyond the barrier May be nested Sometimes disabled by default

Memory Model Notion of temporary viewof memory Allows local caching Need to flush memory T1 writes -> T1 flushes -> T2 flushes -> T2 reads Same order seen by all threads Supports threadprivate memory Variables declared before parallel construct: Shared by default May be designated as private n-1 copies of the original variable is created May not be initialized by the system

Shared Variables Heap allocated storage Static data members const-qualified (no mutable members) Private: Variables declared in a scope inside the construct Loop variable inforconstruct private to the construct Others are shared unless declared private You can change default Arguments passed by reference inherit from original

Beware of Compiler Re-ordering a = b = 0 thread 1 thread 2 b = 1 a = 1 flush(b); flush(a); flush(a); flush(b); if (a == 0) { if (b == 0) { critical section critical section

Beware more of Compiler Re-ordering // Parallel construct { int b = initialsalary print( Initial Salary was %d\n, initialsalary); Book-keeping() // No read b or write initialsalary if (b < 10000) { raisesalary(500);

Thread Control E nvironment Variable Ways to modify value Way to retrieve value Initial value OMP_NUM_THREADS * omp_set_num_threads omp_get_max_threads Implementation defined OMP_DYNAMIC omp_set_dynamic omp_get_dynamic Implementation defined OMP_NESTED omp_set_nested omp_get_nested false OMP_SCHEDULE * Implementation defined * Also see construct clause: num_threads, schedule

#pragma omp parallel \ if(boolean) \ private(var1, var2, var3) \ { Parallel Construct firstprivate(var1, var2, var3) \ default(shared none) \ shared(var1, var2), \ copyin(var2), \ reduction(operator:list) \ num_threads(n)

Parallel Loop #pragma omp parallel for for (i= 0; i < N; ++i) { blah No of iterations must be known when the construct is encountered Must be the same for each thread Compiler puts a barrier at the end of parallel for But see nowait

Parallel For #pragmaompfor \ private(var1, var2, var3) \ firstprivate(var1, var2, var3) \ lastprivate(var1, var2), \ reduction(operator: list), \ ordered, \ schedule(kind[, chunk_size]), \ nowait Canonical For Loop No loop break

Schedule(kind[, chunk_size]) Divide iterations into contiguous sets, chunks chunks are assigned transparently to threads static: iterations are divided among threads in a round-robin fashion When no chunk_sizeis specified, approximately equal chunks are made dynamic: iterations are assigned to threads in request order When no chunk_size is specified, it defaults to 1. guided: like dynamic, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads If chunk_size =k, chunks have at least k iterations (except the last) When no chunk_size is specified, it defaults to 1. runtime: taken from environment variable

Single #pragma omp parallel { #pragmaompfor for( inti=0; i<n; i++ ) a[i] = f0(i); #pragma omp single x = f1(a); #pragmaompfor for(int i=0; i<n; i++ ) b[i] = x * f2(i); Only one of the threads executes Other threads wait for it unless NOWAIT is specified Hidden complexity Threads may be at different instructions

Sections #pragma omp sections { #pragma omp section { // do this #pragma omp section { // do that // The ompsection directives must be closely nested in a sectionsconstruct, where no other work-sharing construct may appear.

Private Variables #pragma omp parallel private (size, ) for for ( inti= 0; i= numthreads; i++) { int size = numtasks/numthreads; int extra = numtasks numthreads*size; if(i< extra) size ++; dotask(i, size, numthreads); dotask(intstart, intcount) { // Each thread s instance has its own activation record for(int i = 0, t=start; i< count; i++; t+=stride) doit(t);

Firstprivateand Lastprivate Initial value of private variable is unspecified firstprivate initializes copies with the original Once per thread (not once per iteration) Original exists before the construct Only the original copy is retained after the construct lastprivate forces sequential-like behavior thread executing the sequentially last iteration (or last listed section) writes to the original copy

Firstprivateand Lastprivate #pragma omp parallel for firstprivate( simple ) for (inti=0; i<n; i++) { simple += a[f1(i, omp_get_thread_num())] f2(simple); #pragma omp parallel for lastprivate( doneearly) for( i=0; (i<n doneearly; i++ ) { doneearly = f0(i);

Other Synchronization Directives #pragma omp master { binds to the innermost enclosing parallel region Only the master executes No implied barrier

Master Directive #pragma omp parallel { #pragmaompfor for( inti=0; i<100; i++ ) a[i] = f0(i); #pragma omp master x = f1(a); Only master executes. No synchronization.

Critical Section #pragma omp critical accessbankbalance { A single thread at a time Applies to all threads The name is optional; no name implies global critical region

Barrier Directive #pragma omp barrier Stand-alone Binds to inner-most parallel region All threads in the team must execute they will all wait for each other at this instruction Dangerous: if (! ready) #pragma omp barrier Same sequence of work-sharing and barrier for the entire team

Ordered Directive #pragma omp ordered { Binds to inner-most enclosing loop The structured block executed in sequential order The loop must declare the ordered clause May encounter only one ordered regions

Flush Directive #pragma omp flush (var1, var2) Stand-alone, like barrier Only directly affects the encountering thread List-of-vars ensures that any compiler re-ordering List-of-vars ensures that any compiler re-ordering moves all flushes together

Atomic Directive #pragma omp atomic i++; Light-weight critical section Only for some expressions x = expr (no mutual exclusion on exprevaluation) x++ ++x x-- --x

Reductions Reductions are so common that OpenMPprovides support for them May add reduction clause to parallel for pragma Specify reduction operation and reduction variable OpenMPtakes care of storing partial results in private variables and combining partial results after the loop

reduction Clause reduction (<op> :<variable>) + Sum * Product & Bitwise and Bitwise or ^ Bitwise exclusive or && Logical and Logical or Add to parallel for OpenMPcreates a loop to combine copies of the variable The resulting loop may not be parallel

Nesting Restrictions A work-sharing region may not be closely nested inside a work-sharing, critical, ordered, or master region. A barrier region may not be closely nested inside a worksharing, critical, ordered, or master region. A master region may not be closely nested inside a work- sharing region. An ordered region may not be closely nested inside a critical region. An ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause. A critical region may not be nested (closely or otherwise) inside a critical region with the same name. Note that this restriction is not sufficient to prevent deadlock

EXAMPLES

OpenMPMatrix Multiply #pragma omp parallel for for(inti=0; i<n; i++ ) for( intj=0; j<n; j++ ){ c[i][j] = 0.0; for(int k=0; k<n; k++ ) a, b, c are shared i, j, k are private c[i][j] += a[i][k]*b[k][j];

OpenMPMatrix Multiply: Triangular #pragma omp parallel for schedule (dynamic, 1 ) for( inti=0; i<n; i++ ) for( intj=i; j<n; j++ ){ c[i][j] = 0.0; for(int k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; This multiplies upper-triangular matrix A with B Unbalanced workload Schedule improves this

OpenMP Jacobi for some number of timesteps/iterations { #pragma omp parallel for for (inti=0; i<n; i++ ) for( intj=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for for( inti=0; i<n; i++ ) for( intj=0; j<n; j++ ) grid[i][j] = temp[i][j]; This could be improved by using just one parallel region Implicit barrier after loops eliminates race on grid

OpenMP Jacobi for some number of timesteps/iterations { #pragma omp parallel for for (inti=0; i<n; i++ ) for( intj=0, j<n, j++ ) { temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp barrier grid[i][j] = temp[i][j]; Is barrier sufficient? What change to the code is needed? Recall barrier is per-team