Parallel Programming

Similar documents
Lecture 4: OpenMP Open Multi-Processing

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

ECE 574 Cluster Computing Lecture 10

Concurrent Programming with OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

EPL372 Lab Exercise 5: Introduction to OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP - Introduction

Introduction to OpenMP.

CS691/SC791: Parallel & Distributed Computing

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Introduction to OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

EE/CSCI 451: Parallel and Distributed Computation

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Overview: The OpenMP Programming Model

OpenMP 4.5: Threading, vectorization & offloading

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

CSL 860: Modern Parallel

Distributed Systems + Middleware Concurrent Programming with OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to Standard OpenMP 3.1

[Potentially] Your first parallel application

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Parallel Programming with OpenMP. CS240A, T. Yang

Shared Memory Parallelism - OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Parallel Computing Parallel Programming Languages Hwansoo Han

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming using OpenMP

Multithreading in C with OpenMP

Parallel Programming using OpenMP

Multi-core Architecture and Programming

Allows program to be incrementally parallelized

Introduction to OpenMP

Alfio Lazzaro: Introduction to OpenMP

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Introduc)on to OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Introduction to OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared Memory Programming with OpenMP (3)

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming in C with MPI and OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Numerical Algorithms

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Shared Memory Programming with OpenMP

Shared Memory Programming : OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Parallel Programming in C with MPI and OpenMP

A brief introduction to OpenMP

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Session 4: Parallel Programming with OpenMP

Introduction to OpenMP

Introduction to OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

OpenMP loops. Paolo Burgio.

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Masterpraktikum - High Performance Computing

Introduction to OpenMP

CME 213 S PRING Eric Darve

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming

Introduction to. Slides prepared by : Farzana Rahman 1

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Shared Memory Programming Model

Parallel Programming in C with MPI and OpenMP

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Open Multi-Processing: Basic Course

Practical in Numerical Astronomy, SS 2012 LECTURE 12

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

Transcription:

Parallel Programming OpenMP Nils Moschüring PhD Student (LMU) Nils Moschüring PhD Student (LMU), OpenMP 1

1 Overview What is parallel software development Why do we need parallel computation? Problems which benefit from parallelization 2 OpenMP - Basics Basic properties Programming Model Basic Syntax 3 OpenMP - Advanced Clauses Directives Synchronization Constructs 4 Pros and Cons Nils Moschüring PhD Student (LMU), OpenMP 2

Acknowledgments This presentation has been heavily influenced by a lecture series organized by Rolf Rabenseifner from the HLRS (Höchstleistungsrechenzentrum Stuttgart) Go to https://fs.hlrs.de/projects/par/events/2013/parallel_prog_2013/ for currently available courses, and to for an overview. To get the appropriate standards visit http://www.hlrs.de/events These are highly recommended! https://fs.hlrs.de/projects/par/par_prog_ws/standards/readme.html Nils Moschüring PhD Student (LMU), OpenMP 4

What is parallel software development Taking Advantage of one or more of the following concepts Pipelining vector computing Functional Parallelism Multi-core (MIMD) Hyper-Threading ccnuma (cache coherent Non-Uniform Memory Access) Array-Processing (SIMD, MMX, SSE2) Nils Moschüring PhD Student (LMU), OpenMP 5

Pipelining instruction nr. 2 1 A B C D A B C D time A IF - Instruction fetch B ID - Instruction decoding C EX - Execution D WB - Write Back Nils Moschüring PhD Student (LMU), OpenMP 6

Pipelining instruction nr. 3 2 1 A B C D A B C D A B C D time Problems: Instruction depends on outcome of previous instruction (branch prediction, pipeline flushing) ressource conflicts data conflicts Nils Moschüring PhD Student (LMU), OpenMP 7

Why do we need parallel computation? Moore s Law: Increase in # of transistors not frequency Increased memory demands One core is too slow Nils Moschüring PhD Student (LMU), OpenMP 8

Problems which benefit from parallelization Matrix-Vector-Multiplication Solving of Systems of linear equations Grid-based algorithms and many more! Nils Moschüring PhD Student (LMU), OpenMP 9

Basic properties Allows incremental parallelization Uses mainly preprocessor directives Easiest approach to multi-threaded programming (shared memory systems only) Nils Moschüring PhD Student (LMU), OpenMP 11

Basic properties Focus on parallelizable loops Serial Program: int main(int argc,char argv) { double res[1000]; for(int i = 0;i<1000;i++) { compl_calc(res[i]); } } Parallel Program: int main(int argc,char argv) { double res[1000]; #pragma omp parallel for for(int i = 0;i<1000;i++) { compl_calc(res[i]); } } Nils Moschüring PhD Student (LMU), OpenMP 12

Basic properties Compile with: gcc -fopenmp test.c To set the maximum number of threads: Set environment variable OMP NUM THREADS to desired value. I.e. (bash): export OMP NUM THREADS=16 And thats it! Nils Moschüring PhD Student (LMU), OpenMP 13

Programming Model Only for shared memory systems (no multiple processes) Workload is distributed among available threads Variable can be shared among all threads duplicated for each thread Threads communicate by sharing variables High risk of race conditions (standard behavior is shared for all variables!) Synchronization procedures are available to control this Nils Moschüring PhD Student (LMU), OpenMP 14

Execution model time sequential parallel sequential parallel sequential # of threads Nils Moschüring PhD Student (LMU), OpenMP 15

Execution model so-called fork-join model start as a process with a single thread (master thread) when parallel pragma is encountered: branch into team of threads completion of pragma: synchronization, implicit barrier continue with master thread Nils Moschüring PhD Student (LMU), OpenMP 16

Parallel regions Basic construct Starts multiple threads Each thread executes the same code redudantly Syntax: #pragma omp parallel [clause [[,] clause ]... ] new line structured block Clause can be private (list) shared (list)... Nils Moschüring PhD Student (LMU), OpenMP 17

Directives case sensitive changes behaviour inside parallel regions Syntax: #pragma omp directive [clause [[,] clause ]... ] new line Nils Moschüring PhD Student (LMU), OpenMP 18

Library functions small amount of library functions available to control OpenMP Usage: #ifdef _OPENMP #include <omp.h> #endif int main(int argc,char argv) { #ifdef _OPENMP printf("nr of procs = %d\n", omp_get_num_procs()); #endif } Nils Moschüring PhD Student (LMU), OpenMP 19

Library functions More available functions void omp_set_num_threads(int) sets # of threads int omp_get_thread_num(void) get current threads number int omp_in_parallel(void) detects if in parallel region... Nils Moschüring PhD Student (LMU), OpenMP 20

Data scope clauses private (list) Declares the variables in list to be private to each thread shared (list) Declares the variables in list to be shared among all threads The default for all variables is shared, execept: local variables in parallel region are private loop control variable is private... Nils Moschüring PhD Student (LMU), OpenMP 22

Reduction clauses Reduction is the process of collecting data from multiple nodes to one node or the scattering of data to multiple nodes. OpenMP offers certain directives to accomplish this. firstprivate (var) initializes the private variable with the value of the nonparallel region lastprivate (var) Copies the last value of var into the variable of the nonparallel region (last iteration for loops and last section for sections, task) reduction (operator:list) performs reduction on variables in list (must be shared in context) with operator operator operator can be +, *, -, &, ˆ,, &&,, max, min at the end of the reduction the shared variable will updated using each of the values in the private copy of each thread using the operator Nils Moschüring PhD Student (LMU), OpenMP 23

Reduction clauses: Example double result = 0.; #pragma omp parallel for reduction(+:result) for(int i = 0; i < 5; i++) { double val = i * i; result += val; } /*omp end parallel for*/ Nils Moschüring PhD Student (LMU), OpenMP 24

Directives Properties: Divide enclosed code among threads Must be inside a parallel region No implicit synchronization on entry Implicit synchronization on exit (nowait clause gets rid of this) Available Directives sections explicitly define different code for different threads for distribute different iterations of following loop onto different threads single block is executed by a single thread only (reduce fork-join overhead) task generates a new task for the following code which will be distributed to one task free thread Nils Moschüring PhD Student (LMU), OpenMP 25

Directives - sections int main(int argc,char argv) { #pragma omp parallel { #pragma omp sections { #pragma omp section { fa();} #pragma omp section { fb();} } /*omp end sections*/ } /*omp end parallel*/ } fa() fb() Executes funca() and funcb() in parallel Nils Moschüring PhD Student (LMU), OpenMP 26

Directives - for #pragma omp parallel private(k) { k = omp_get_thread_num(); #pragma omp for for(int i=0;i<20;i++) a[i]=k*i; } /*omp end parallel*/ a[i]= k*i i= 0..9 a[i]= k*i i= 10..19 Nils Moschüring PhD Student (LMU), OpenMP 27

Directives - for loop must have canonical shape for( [integer or pointer type] var=b;var<e;var=var+incr) different comparisons possible different increasing possible var can not be modified inside the loop b, e, incr invariant during loop # of iterations must be computable at loop begin Nils Moschüring PhD Student (LMU), OpenMP 28

Directives - for Special clauses for for directive collapse collapse nesting loops and their iterations into larger iteration space nowait no synchronization at the end of the parallel loop schedule(type[, chunk]), with type of static statically assign chunks in a round-robin fashion, default chunk size amounts to one piece for each thread, good if all iterations take the same time, deterministic dynamic dynamically assign chunks to idling threads, default chunk size 1, more overhead, but better load balancing guided exponentially decrease the chunk size while dispatching, chunk specifies smallest piece, default chunk size 1 auto Scheduling determined by compiler and/or at run-time runtime Scheduling determined at run-time, using OMP SCHEDULE variable default schedule is implementation specific (so better set it yourself!) Nils Moschüring PhD Student (LMU), OpenMP 29

Directives - single Block is only executed by one thread implicit barrier at the end (unless no wait is specified) reduce fork-join overhead Nils Moschüring PhD Student (LMU), OpenMP 30

Directives - task struct node { struct node left; struct node right; }; void traverse (struct node p) { if(p->left) #pragma omp task traverse(p->left); if (p->right) #pragma omp task traverse(p->right); process(p); //expensive stuff } int main(int argc,char argv) { struct node tree; //fill tree #pragma omp parallel { #pragma omp single { traverse(&tree); } /*omp end single*/ }/*omp end parallel*/ } Nils Moschüring PhD Student (LMU), OpenMP 31

Directives - task Further properties tasks are created when a task pragma is encountered pending tasks are started if a thread is available #pragma omp taskwait can be used to perform task synchronization many clauses available Nils Moschüring PhD Student (LMU), OpenMP 32

Synchronization Constructs - critical Enclosed code is executed by all threads, but restricted to only one thread at a time one can supply a name after this directive to differentiate different critical parts Nils Moschüring PhD Student (LMU), OpenMP 33

Pros and Cons Pros: portable multithreading code data layout and decomposition is handled automatically incremental approach code works in serial without adjustments original code does not change much Cons: risk of race conditions only shared-memory Nils Moschüring PhD Student (LMU), OpenMP 35