Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Similar documents
Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Allows program to be incrementally parallelized

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

In-Class Guerrilla Development of MPI Examples

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Multi-core Architecture and Programming

Lecture 4: OpenMP Open Multi-Processing

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP.

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Concurrent Programming with OpenMP

Introduction to. Slides prepared by : Farzana Rahman 1

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

ECE 574 Cluster Computing Lecture 10

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Programming with OpenMP*

Parallel programming using OpenMP

CS420: Operating Systems

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

EE/CSCI 451: Parallel and Distributed Computation

Alfio Lazzaro: Introduction to OpenMP

Cluster Computing 2008

Computer Architecture

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Data Environment: Default storage attributes

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Shared Memory programming paradigm: openmp

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

A brief introduction to OpenMP

Shared Memory Parallelism - OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Multithreading in C with OpenMP


OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CSL 860: Modern Parallel

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OPENMP OPEN MULTI-PROCESSING

Introduction to OpenMP

OpenMP 4.5: Threading, vectorization & offloading

Introduction to OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP on Ranger and Stampede (with Labs)

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel Programming: OpenMP

Chapter 4: Threads. Chapter 4: Threads

An Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

Parallel Computing. Hwansoo Han (SKKU)

Parallel Numerical Algorithms

Introduction to OpenMP. Lecture 4: Work sharing directives

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Computing Parallel Programming Languages Hwansoo Han

Concurrent Programming with OpenMP

Introduction to OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Overview: The OpenMP Programming Model

Distributed Systems + Middleware Concurrent Programming with OpenMP

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shared memory programming

Introduction to OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

15-418, Spring 2008 OpenMP: A Short Introduction

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Introduction to OpenMP

Chapter 4: Multithreaded Programming

EPL372 Lab Exercise 5: Introduction to OpenMP

NUMERICAL PARALLEL COMPUTING

Parallel Programming

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Session 4: Parallel Programming with OpenMP

Transcription:

Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading Adding specialized array instructions Operating simultaneously on several data elements Shared memory (usually symmetric) Large address space (and memory) Powerful GPUs Dozens of individual processing elements Double layer memory addressing (non-uniform) Specialized instructions (for graphics) 2010@FEUP OpenMP programming 2

Shared Memory Model Processor Processor Processor Processor Memory Code executing on different processors can interact and synchronize with each other through shared variables. 2010@FEUP OpenMP programming 3

Exploring the resources CPUs: multiple threads Calling the OS (pthreads / thread APIs) Low level Libraries Thread Building Blocks (TBBs) from intel (still low level) PPL and Agents Library (from Microsoft like C++ STL extensions) Language constructions Cilk Plus (C/C++ keyword extensions from intel) Limited availability (intel compiler, preview from GNU) OpenMP (language extension (#pragmas) (C/Fortran) + library) Actor based languages (e.g. Erlang and Scala) 2010@FEUP OpenMP programming 4

Exploring the resources - GPGPU Using the GPU for general computations Dozens of processing units with asymmetric (but specialized) memory Hybrid processing (CPU + GPU) Specialized languages CUDA and OpenCL (C like, but different languages) HLSL and DirectCompute (Microsoft / DirectX) Language extensions plus libraries C++ AMP (Microsoft, proposed as standard) C++ minor extension (2 new keywords) + library 2010@FEUP OpenMP programming 5

OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory Takes the form of compiler directives Also a library of support functions Can be used with the following languages C C++ Fortran 2010@FEUP OpenMP programming 6

OpenMP Fork/Join Parallelism (1) Main pattern of OpenMP Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At end of parallel code the created threads die or are suspended, remaining only the master thread 2010@FEUP OpenMP programming 7

Fork/Join Parallelism (2) Master Thread Other threads fork Sequential program a special case of a shared-memory parallel program Time join fork Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time join 2010@FEUP OpenMP programming 8

Shared memory vs. message passing Shared memory Starts with 1 thread, changes dynamically, ends with 1 thread (the master thread) We can incrementally make the program parallel Stop when further effort not worthwhile Message passing All processes active throughout execution Sequential to parallel transformation requires a major effort Transformation done in one giant step instead of multiple tiny steps 2010@FEUP OpenMP programming 9

Parallel for loops C programs often express data-parallel operations as for loops for (i = first; i < size; i += prime) marked[i] = 1; OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads 2010@FEUP OpenMP programming 10

Pragmas Pragma: a compiler directive in C or C++ Stands for pragmatic information A way for the programmer to communicate with the compiler Compiler usually free to ignore pragmas Syntax: #pragma omp <rest of pragma> 2010@FEUP OpenMP programming 11

Parallel for OpenMP pragma Format: #pragma omp parallel for for (i = 0; i < N; i++) a[i] = b[i] + c[i]; Compiler must be able to verify the runtime system will have information it needs to schedule loop iterations index index index index for(index start;index end; index ) index inc inc index index inc index inc index index index inc 2010@FEUP OpenMP programming 12

Execution Context Every thread has its own execution context Execution context: address space containing all of the variables a thread may access Contents of execution context: static variables dynamically allocated data structures in the heap variables on the run-time stack additional run-time stack for functions invoked by the thread 2010@FEUP OpenMP programming 13

Shared and Private Variables Shared variable: has same address in execution context of every thread Private variable: has different address in execution context of every thread A thread cannot access the private variables of another thread 2010@FEUP OpenMP programming 14

Shared and Private Variables int main (int argc, char *argv[]) { int b[3]; char *cptr; int i; cptr = malloc(1); #pragma omp parallel for for (i = 0; i < 3; i++) b[i] = i; Heap Stack i b Master Thread (Thread 0) cptr i i Thread 1 2010@FEUP OpenMP programming 15

OpenMP functions int omp_get_num_procs (void) Returns number of physical processors available for use by the parallel program void omp_set_num_threads (int value) Uses the parameter value to set the number of threads to be active in parallel sections of code May be called at multiple points in a program 2010@FEUP OpenMP programming 16

Declaring private variables From Floyd s algorithm for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); Either loop could be executed in parallel We prefer to make outer loop parallel, to reduce number of forks/joins We then must give each thread its own private copy of variable j 2010@FEUP OpenMP programming 17

private clause Clause: an optional, additional component to a pragma Private clause: directs compiler to make one or more variables private Example private ( <variable list> ) #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j],a[i][k]+tmp); 2010@FEUP OpenMP programming 18

firstprivate clause Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is entered Variables are initialized once per thread, not once per loop iteration If a thread modifies a variable s value in an iteration, subsequent iterations will get the modified value firstprivate ( <variable list> ) 2010@FEUP OpenMP programming 19

lastprivate clause Sequentially last iteration: iteration that occurs last when the loop is executed sequentially lastprivate clause: used to copy back to the master thread s copy of a variable the private copy of the variable from the thread that executed the sequentially last iteration lastprivate ( <variable list> ) 2010@FEUP OpenMP programming 20

Critical Sections and Race Conditions Consider this C program segment to compute using the rectangle rule: double area, pi, x; int i, n;... area = 0.0; for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; 2010@FEUP OpenMP programming 21

Race Condition (cont.) If we simply parallelize the loop... double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; 2010@FEUP OpenMP programming 22

Race Condition Time Line area += 4.0/(1.0 + x*x) Value of area Thread A Thread B 11.667 11.667 15.432 15.230 + 3.765 + 3.563 2010@FEUP OpenMP programming 23

critical pragma Critical section: a portion of code that only thread at a time may execute We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code 2010@FEUP OpenMP programming 24

Correct, but inefficient double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i+0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n; 2010@FEUP OpenMP programming 25

Source of Inefficiency Update to area inside a critical section Only one thread at a time may execute the statement; i.e., it is sequential code Time to execute statement is a significant part of loop By Amdahl s Law we know speedup will be severely constrained 2010@FEUP OpenMP programming 26

Reductions Reductions are so common that OpenMP provides support for them May add reduction clause to parallel for pragma Specify reduction operation and reduction variable OpenMP takes care of storing partial results in private variables and combining partial results after the loop 2010@FEUP OpenMP programming 27

reduction clause The reduction clause has this syntax: reduction (<op> :<variable>) Operators + * & ^ && Sum Product Bitwise and Bitwise or Bitwise exclusive or Logical and Logical or 2010@FEUP OpenMP programming 28

Final solution: -finding code with reduction clause double area, pi, x; int i, n;... area = 0.0; #pragma omp parallel for \ private(x) reduction(+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n; 2010@FEUP OpenMP programming 29

Performance Improvement Patterns Too many fork/joins can lower performance Inverting loops may help performance if Parallelism is in inner loop After inversion, the outer loop can be made parallel Inversion does not significantly lower cache hit rate Parallelize the loop: for (i=1; i<m; i++) for (j=0; j<n; j++) a[i][j] = 2 * a[i-1][j]; 2010@FEUP OpenMP programming 30

Performance Improvement Patterns If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution The if clause instructs compiler to insert code that determines at run-time whether loop should be executed in parallel: #pragma omp parallel for if(n > 5000) 2010@FEUP OpenMP programming 31

Performance Improvement Patterns Balancing the execution time of each iteration Consider the following loops: for (i=0; i<n; i++) for (j=i; j<n; j++) a[i][j] = alpha_omega(i, j); We can use schedule clause to specify how iterations of a loop should be allocated to threads 2010@FEUP OpenMP programming 32

Static vs. Dynamic Scheduling Static schedule: all iterations allocated to threads before any iterations executed Low overhead May exhibit high workload imbalance Dynamic schedule: only some iterations allocated to threads at beginning of loop s execution. Remaining iterations allocated to threads that complete their assigned iterations. Higher overhead Can reduce workload imbalance 2010@FEUP OpenMP programming 33

Chunks A chunk is a contiguous range of iterations Increasing chunk size reduces overhead and may increase cache hit rate Decreasing chunk size allows finer balancing of workloads 2010@FEUP OpenMP programming 34

schedule clause Syntax of schedule clause schedule (<type>[,<chunk> ]) Schedule type required, chunk size optional Allowable schedule types static: static allocation dynamic: dynamic allocation guided: guided self-scheduling runtime: type chosen at run-time based on value of environment variable OMP_SCHEDULE 2010@FEUP OpenMP programming 35

Scheduling Options schedule(static): block allocation of about n/t contiguous iterations to each thread schedule(static,c): interleaved allocation of chunks of size C to threads schedule(dynamic): dynamic one-at-a-time allocation of iterations to threads schedule(dynamic,c): dynamic allocation of C iterations at a time to threads 2010@FEUP OpenMP programming 36

Scheduling Options (cont.) schedule(guided, C): dynamic allocation of chunks to tasks using guided selfscheduling heuristic. Initial chunks are bigger, later chunks are smaller, minimum chunk size is C. schedule(guided): guided self-scheduling with minimum chunk size 1 schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE Unix example: setenv OMP_SCHEDULE static,1 2010@FEUP OpenMP programming 37