OpenMP. A.Klypin. Shared memory and OpenMP. Simple Example. Threads. Dependencies. Directives. Handling Common blocks.

Similar documents
Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Allows program to be incrementally parallelized

15-418, Spring 2008 OpenMP: A Short Introduction

Introduction to OpenMP

OpenMP - Introduction

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Overview: The OpenMP Programming Model

Practical in Numerical Astronomy, SS 2012 LECTURE 12

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Lecture 4: OpenMP Open Multi-Processing

Practical stuff! ü OpenMP. Ways of actually get stuff done in HPC:

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Speeding Up Reactive Transport Code Using OpenMP. OpenMP

Session 4: Parallel Programming with OpenMP

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Introduction [1] 1. Directives [2] 7

Parallel Programming with OpenMP. CS240A, T. Yang

[Potentially] Your first parallel application

A brief introduction to OpenMP

Multi-core Architecture and Programming

Introduction to OpenMP

High Performance Computing: Tools and Applications

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Introduction to OpenMP

Introduction to OpenMP. Lecture 4: Work sharing directives

Parallel Computing Parallel Programming Languages Hwansoo Han

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

CSL 860: Modern Parallel

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

EE/CSCI 451: Parallel and Distributed Computation

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Shared Memory programming paradigm: openmp

Data Handling in OpenMP

Introduction to Programming with OpenMP

GCC Developers Summit Ottawa, Canada, June 2006

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

ECE 574 Cluster Computing Lecture 10

Multithreading in C with OpenMP

Parallel Programming in C with MPI and OpenMP

Introduction to Standard OpenMP 3.1

Introduction to Parallel Programming. Tuesday, April 17, 12

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Cluster Computing 2008

Barbara Chapman, Gabriele Jost, Ruud van der Pas

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.


Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming in C with MPI and OpenMP

Shared memory programming

Programming with OpenMP*

Shared Memory Parallelism using OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP on Ranger and Stampede (with Labs)

Shared Memory Parallelism - OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

EPL372 Lab Exercise 5: Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

Shared Memory Programming : OpenMP

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Parallel Programming in C with MPI and OpenMP

OpenMP Application Program Interface

Open Multi-Processing: Basic Course

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

OpenMP+F90 p OpenMP+F90

CS691/SC791: Parallel & Distributed Computing

Introduction to OpenMP

A Source-to-Source OpenMP Compiler

Parallel Programming: OpenMP

OpenMP Application Program Interface

Data Environment: Default storage attributes

Shared Memory Programming with OpenMP (3)

Parallel and Distributed Programming. OpenMP

Introduction to OpenMP

Parallel and Distributed Computing

An Introduction to OpenMP

Comparing OpenACC 2.5 and OpenMP 4.1 James C Beyer PhD, Sept 29 th 2015

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Introduction to OpenMP

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Programming

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Alfio Lazzaro: Introduction to OpenMP

Transcription:

OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance with Dynamic schedule Data placement

OpenMP and shared memory computers Programming with MPI is very difficult. We take some part of job, which should be handled by the system. It is easy to use OpenMP. Incremental parallelization Relatively easy to get speed up on modest sized computers Scalable to large processor counts Multi core processors: we have it Industry moves in the direction of shared memory systems. hybrid MPI+OpenMP minimizes communications

Simple Example Variables and common blocks can be either shared or private Each thread (processor) has its own copy of private variables. Shared variables are available for each thread and each thread can change them The most frequent bug: missed private variables

Formation of threads: master thread spawns a team of threads Do i=1,n EndDo Serial Parallel

Dependencies In order for a loop to be parallelizable, its results should not depend on the order of its execution. For example, the following loop is not parallel: do i=2,n a(i) =i*a(i-1) enddo Yet, this loop can be parallelized: do i=2,n,2 a(i) =i*a(i-1) enddo If results of execution change with the order of execution, then it is said that we have race conditions.

Dependencies Different ways of removing race conditions: rewrite the algorithm split loop into two: one, which does not have dependencies (and makes most of computations) and another, which handles the dependancies introduce new arrays, which store results for each processor.

Example: assignment of density

OpenMP constructs Directives Environment Control Data Synchronization OMP_NUM_TREADS Do ThreadPrivate Master OMP_DYNAMIC Schedule Shared Private Critical Barrier OMP_NUM_TREADS OMP_SCHEDULE Ordered FirstPrivate Atomic Static Sections LastPrivate Ordered Dynamic,chunk Single Reduction Copyin Guided, chunk Default

Common blocks: ThreadPrivate Determine which commons are private and declare them in each subroutine. Use COPYIN(list) directive to assign the same values to threadprivate common blocks. List may contain names of common blocks and names of variables Be careful with large common blocks: you may run out of memory SUBROUTINE Mine Common /A/ x,y,z Common /B/v,u,w!$omp threadprivate(/a/,/b/)!$omp parallel do default (shared)

Schedule: handling load balance Normally every thread receives equal amount of indexes to work on. For example, if you have 10 threads and the loop is do i=1,10000, then the first thread gets indexes (1-1000), the second (1001-2000), and so on. This works ok if there is equal amount of computations for each chunk of indexes. If this is not the case, we need to use DYNAMIC option in SCHEDULE clause. DYNAMIC has a parameter, chunk, which defines the number of indexes assigned the each thread. The first thread to finish its job takes the next available chunk. Parameter chunk is a variable. It can be assigned inside the code. SUBROUTINE Mine(N)... Nchunk =N/100!$omp parallel do default (shared)!$omp+private(i)!$omp+schedule(dynamic,nchunk) Do i=1,n...

Critical section defines section of the code, which is executed only by one thread at a time. It may dramatically slow down the code. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region. Critical section can be used to Sum up private contributions into shared variables Make I/O contiguous SUBROUTINE Mine...!$omp parallel do default (shared)...!$omp critical Global(i,j,k) = Global(i,j,k) +dx!$omp end critical!$omp critical write(*,*) I m here:,i!$omp end critical Synchronization

Synchronization The BARRIER directive synchronizes all threads in the team. When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier. The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-critical section. SUBROUTINE Mine...!$omp parallel do default (shared)...!$omp critical Global(i,j,k) = Global(i,j,k) +dx!$omp end critical!$omp critical write(*,*) I m here:,i!$omp end critical

REDUCTION clause The REDUCTION clause performs a reduction on the variables that appear in its list. A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable. This is the way to get constructs such as scalar products or to find maximum of elements in an array. REDUCTION (operator intrinsic: list) Operators: +,*,-,Max,Min,IAND,IOR,AND,OR Examples:!$omp do reduction(+:x,y) reduction(max:xmax,ymax)

Environmental variables: OMP_NUM_THREADS Sets the maximum number of threads to use during execution. For example: setenv OMP_NUM_THREADS 8 OMP_SCHEDULE Applies only to DO, PARALLEL DO (Fortran) and for, parallel for (C/C++) directives which have their schedule clause set to RUNTIME. The value of this variable determines how iterations of the loop are scheduled on processors. For example: setenv OMP_SCHEDULE "guided, 4" setenv OMP_SCHEDULE "dynamic"

Re-ordering data: 3d FFT, pass in z direction non local good locality

Mapping multi dimensional array into 1d memory A(Nx,Ny,Nz) : A(i +(j-1)*nx +(k-1)*nx*ny) k=1 k=2 k=3 Nx Nx Nx Nx Nx Nx Nx Nx Nx

Transposition of matrix A(i,j,k) A(k,j,i) Now do FFT along x