Introduction to OpenMP

Similar documents
EPL372 Lab Exercise 5: Introduction to OpenMP

Lecture 4: OpenMP Open Multi-Processing

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

ECE 574 Cluster Computing Lecture 10

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Shared Memory Parallelism - OpenMP

OpenMP - Introduction

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

15-418, Spring 2008 OpenMP: A Short Introduction

Assignment 1 OpenMP Tutorial Assignment

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Multithreading in C with OpenMP

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Shared Memory Parallelism using OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Shared Memory Programming Model

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

CS 470 Spring Mike Lam, Professor. OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Alfio Lazzaro: Introduction to OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

High Performance Computing: Tools and Applications

Data Environment: Default storage attributes

Distributed Systems + Middleware Concurrent Programming with OpenMP

Session 4: Parallel Programming with OpenMP

Introduction to OpenMP. Rogelio Long CS 5334/4390 Spring 2014 February 25 Class

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Parallel programming using OpenMP

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Introduction to OpenMP

Shared Memory Programming with OpenMP

Shared Memory Programming Paradigm!

OpenMP: Open Multiprocessing

CS691/SC791: Parallel & Distributed Computing

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Parallel Programming in C with MPI and OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP: Open Multiprocessing

Parallel Programming

You Are Here! Peer Instruction: Synchronization. Synchronization in MIPS. Lock and Unlock Synchronization 7/19/2011

Parallel Programming

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

CSL 860: Modern Parallel

Leveraging OpenMP Infrastructure for Language Level Parallelism Darryl Gove. 15 April

Introduction to OpenMP.

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

CS 61C: Great Ideas in Computer Architecture. OpenMP, Transistors

OpenMP. Today s lecture. Scott B. Baden / CSE 160 / Wi '16

Announcements. Scott B. Baden / CSE 160 / Wi '16 2

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Shared Memory programming paradigm: openmp

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallel Programming

OpenMP Fundamentals Fork-join model and data environment

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

OpenMP loops. Paolo Burgio.

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Multi-core Architecture and Programming

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

Parallel and Distributed Programming. OpenMP

Parallel Computing. Lecture 13: OpenMP - I

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Parallel Programming in C with MPI and OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Concurrent Programming with OpenMP

Introduction to OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

CS 61C: Great Ideas in Computer Architecture. Synchronization, OpenMP

OpenMP Shared Memory Programming

Shared Memory Programming with OpenMP

Open Multi-Processing: Basic Course

Computer Architecture

Introduction to Standard OpenMP 3.1

Parallel Programming: OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Synchronisation in Java - Java Monitor

Parallel Programming in C with MPI and OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

For those with laptops

Parallel Computing: Overview

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Programming Shared Memory Systems with OpenMP Part I. Book

Introduction to. Slides prepared by : Farzana Rahman 1

Transcription:

Introduction to OpenMP Ricardo Fonseca https://sites.google.com/view/rafonseca2017/

Outline Shared Memory Programming OpenMP Fork-Join Model Compiler Directives / Run time library routines Compiling and running OpenMP programs OpenMP fundamentals Approaches to Parallelism Data dependencies Shared / private variables OpenMP directives / functions Overview Examples & Projects

Shared Memory Programming

Shared Memory Programming Shared memory systems like multi-core workstations have a single address space: Applications can be developed in which loop iterations (with no dependencies) are executed by different processors Shared memory codes are mostly data parallel, SIMD kinds of codes OpenMP is the new standard for shared memory programming (compiler directives) Vendors offer native compiler directives (gcc, icc, xlc, gfortran, ifort, xlf, etc.)

OpenMP programming model OpenMP (Open Multi-Processing) is An API used to explicitly direct multithreaded, shared memory parallelism Programming Model Parallelism is achieved through the use of threads All threads share address space Explicit Parallelism Offers full control over parallelization Can be as simple as taking a serial program and inserting compiler directives Simple to use Most parallelism specified through simple compiler directives, small API Works in shared memory systems only Thread Address Space

Fork-Join Model All OpenMP programs begin as a single process (master thread) Fork: Master thread creates a team of parallel threads The statements in the parallel region are executed in parallel among various team threads Join: When team threads complete the parallel region they synchronize and terminate The number of parallel regions and threads in them are arbitrary

Compiler Directives Appear as comments in the source code and are ignored unless compilers are told otherwise Are used for various purposes Spawning a parallel region Dividing blocks of code among threads Distributing loop iterations between threads Serializing sections of code Synchronization of work among threads /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) {... } /* All threads join master thread and disband */ example

Run-time Library Routines API includes a set of routines for Setting and querying the number of threads Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier, the thread team size Setting, initializing and terminating locks and nested locks Querying wall clock time and resolution etc. /* Obtain thread number */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); /* Obtain total number of threads number */ nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); example

OpenMP Example #include <omp.h> #include <stdio.h> #include <stdlib.h> http://www.openmp.org/ int main (int argc, char *argv[]) { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) { /* Obtain thread number */ tid = omp_get_thread_num(); printf("hello World from thread = %d\n", tid); Split the next block over all available threads } /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */ All threads do this Only master thread does this

Compiling/Running OpenMP programs OpenMP programs can be compiled using a compiler supporting OpenMP: $ gcc -fopenmp HelloWorld.c -o HelloWorld Launch programs as usual; number of threads can be controlled through the OMP_NUM_THREADS environment variable: bash$ export OMP_NUM_THREADS=3 bash$./omp_hello_world Hello World from thread = 1 Hello World from thread = 0 Hello World from thread = 2 Number of threads = 3 bash$

Environment Variables OpenMP provides several environment variables for controlling the execution of parallel code at run-time. Setting the number of threads Specifying how loop iterations are divided Binding threads to processors Enabling/disabling dynamic threads etc. bash$ export OMP_NUM_THREADS=3 bash$./omp_hello_world Hello World from thread = 1 Hello World from thread = 0 Hello World from thread = 2 Number of threads = 3 bash$ example

OpenMP fundamentals

Approaches to Parallelism Two main approaches for distributing work among threads: Parallel loops Individual loops are parallelized by assigning each thread a range of loop indexes Parallel regions The code launches a number of threads, each with an unique id Up to the programmer to split the workload Code outside these sections will be executed serially

Data dependencies Not all operations in the code can be performed in parallel: Some operations require other operations to complete first: for (i=0; i < N; i++) { a[i] = a[i] + a[i-1]; } example When an operation depends on the result of another one this is called a data dependency A loop can be straightforwardly parallelized if there are no data dependencies: All assignments are performed on arrays Each element of array is written by at most one iteration No loop iterations read array elements modified by other iteration

Shared / private variables Since we are in a shared memory environment, variables inside a parallel region share the same address For the loop index in a parallel loop this would of course pose a problem: different threads require different values of this variable. OpenMP offers control as to how variables are shared among threads, or kept private inside each thread using clauses: private: Each thread as a different copy of the variable. This is the default for the loop index variable. shared: All threads share the variable. This is the default for all other variables. #pragma omp for private(tmp) shared(a,b,c) for (i=0; i < N; i++) { tmp = 2 * a[i]; a[i] = tmp; b[i] = c[i] / tmp; } example

OpenMP Yee field solver Algorithm Spawn nt threads Each thread handles a given field grid region inside node No algorithm overhead Only 2 lines of OpenMP code!!$omp parallel do private(i2,i1) do i3 = 0, b%nx(3) + 1 do i2 = 0, b%nx(2) + 1 do i1 = 0, b%nx(1) + 1 enddo enddo enddo b%f3( 1, i1, i2, i3 ) =... b%f3( 2, i1, i2, i3 ) =... b%f3( 3, i1, i2, i3 ) =...!$omp end parallel do Thread 1 Thread 2 Thread 3 Shared Memory Node Local E,B Field grid

OpenMP directives / functions

Parallel for construct Specifies that the iterations of the loop immediately following it must be executed in parallel by the team C C++ #pragma omp for [clause...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) collapse (n) nowait for(...){... } The amount of work (chunk) for each thread can be controlled Nested loops are also allowed #pragma omp for for (i=0; i < N; i++) c[i] = a[i] + b[i]; example

Variables in parallel regions OpenMP includes a number of clauses to control how variables are shared or not among threads: private(var list) shared(var list) default(private) Create a separate memory space for the variables. The variables are not initialized, the programmer must initialize them inside the parallel region. All the threads will be able to modify and access the variable. This is the default (except for the loop index) The programmer may change the default behavior to private to avoid having to declare a lot of variables. #pragma omp for private(tmp) shared(a,b,c) for (i=0; i < N; i++) { tmp = 2 * a[i]; a[i] = tmp; b[i] = c[i] / tmp; }

Initializing / retaining private variables OpenMP also includes clauses to control how private variables can be initialized, and how to retain values after the parallel section: firstprivate(var list) Declares a variable private, and broadcasts the value the variable had before the beginning of the parallel section to all threads. lastprivate(var list) All the threads will be able to modify and access the variable. This is the default (except for the loop index) j = 1; #pragma omp parallel for firstprivate(j) for(i=0; i<size; i++) { a[i] = a[i] + j; } #pragma omp parallel for lastprivate(x) for(i=0; i<size; i++) { x = i / (SIZE-1); a[i] = x*x; } printf("last x = %g\n", x);

Parallel Reduction A parallel reduction is a very common operation so OpenMP includes a clause to avoid doing it explicitly C C++ reduction (operator : list) This can be used for example to calculate the sum of all values in a large array result = 0.0; #pragma omp for reduction(+:result) for (i=0; i < N; i++) result = result + (a[i] * b[i] ); printf("final result= %f\n",result); example

Order of execution Threads inside a parallel region will be executed in an arbitrary order. If required, the programmer can force a region of code to be executed sequentially, just like the serial version. This is done using the clause ordered: #pragma omp parallel for private(t) ordered for(i=0; i<size; i++) { t = func(i); #pragma omp ordered { printf( func(%d) = %g\n, i, t); } } There will be an ordered section in this loop The next section of the code will be executed in order of increasing loop index

Controlling the work done by each thread When using a parallel for section OpenMP will default to splitting the loop into equal chunks among available threads This may not always be the most efficient way to partition work: In some cases different iterations in a loop may have different workloads This leads to load imbalance and lower performance: the slowest thread dominates computing time OpenMP allows the programmer to control the distribution of iterations over the available threads using the schedule clause

Schedule clause C C++ schedule( type [, chunk_size]) Main schedule types: static - assigns the same number of iterations to each thread dynamic - assigns 1 iteration to each thread. When 1 thread finishes, it is assigned 1 more iteration until all iterations complete. Chunk size (optional) Controls the number of iterations assigned to each thread each time Defaults to Niter/Nthreads for static and 1 for dynamic #pragma omp for schedule(dynamic) for (i=0; i < N; i++) { a[i] = variable_work_load(i); }

Parallel region construct A parallel region is a block of code that will be executed by multiple threads C C++ #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) {... } Can set the number of threads and control which variables are shared / private among threads When reaching a parallel directive the code creates a team of threads and the code is handled by all threads There is an implied barrier at the end of the parallel section

Critical / Barrier C C++ #pragma omp critical {... } #pragma omp barrier Used inside parallel regions The CRITICAL directive specifies a region of code that must be executed by only one thread at a time This allows the threads to avoid conflicts e.g. when writing to memory The BARRIER directive synchronizes all threads in the team No thread will continue until all threads reached the barrier

OpenMP library functions Include <omp.h> for compilation Not required for simple OpenMP programs Querying thread configuration omp_get_numthreads() - Gets the number of active threads inside a parallel region omp_get_thread_num() - Get unique thread id inside a parallel region omp_get_max_threads() - Get the default maximum number of threads in a program Function for timing your code omp_get_wtime() - Get elapsed time from a fixed time in the past, in seconds omp_get_wtick() - Get timer precision, in seconds

Overview

Overview OpenMP allows programmers to easily exploit multiple cores on shared memory systems It can be applied from modest laptops to high-end workstations OpenMP provide a standard/portable toolset for this computing paradigm Minimal learning curve / required hardware resources to begin parallel programming

Further Reading NCSA Introduction to OpenMP course https://www.citutor.org/login.php?course=24 Los Alamos National Laboratory OpenMP tutorial https://computing.llnl.gov/tutorials/openmp/ Parallel Programming in OpenMP R. Chandra, etc., Kaufmann Ed. Using OpenMP B. Chapman, etc., MIT Press

Examples & Projects

Example Programs I/II Source for examples can be found at https://sites.google.com/view/rafonseca2017/ OpenMP Fundamentals Hello World (helloworld.c) Parallel for construct (parallel_for.c) Private/shared variables (private_shared.c) Reduction (reduction.c) Clauses Initializing private variables (firstprivate.c) Retaining value of private variables (lastprivate.c) Ordered execution (ordered.c)

Example Programs II/II Parallel for scheduling Scheduling modes (schedule.c) Run the code with: The OpenMP pragmas commented out (serial execution) Static scheduling Dynamic scheduling Analyze the results

Project 1 Write a program that calculates matrix multiplication using OpenMP Implement C[n p] = A[n m] B[m p] (#rows #cols) Verify correctness (compare parallel vs. serial execution) Measure speedup for different matrix sizes A B C = Indexing : Arow,col C ij = m k= 1 A ik B kj

Project 2 Calculate π in parallel using: = Integrate using Euler s method Split the integration interval over available threads Calculate final result using OpenMP reduction 0 1 4 1 + x 2 dx