DPHPC: Introduction to OpenMP Recitation session

Similar documents
DPHPC: Introduction to OpenMP Recitation session

Lecture 2 A Hand-on Introduction to OpenMP

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

EPL372 Lab Exercise 5: Introduction to OpenMP

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

Data Environment: Default storage attributes

Introduction to. Slides prepared by : Farzana Rahman 1

OpenMP Tutorial. Rudi Eigenmann of Purdue, Sanjiv Shah of Intel and others too numerous to name have contributed content for this tutorial.

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Programming Shared Memory Systems with OpenMP Part I. Book

Parallel and Distributed Programming. OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

A brief introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction to OpenMP

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shared Memory Programming Model

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

High Performance Computing: Tools and Applications

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming Languages 1 - OpenMP

OpenMP Fundamentals Fork-join model and data environment

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

ECE 574 Cluster Computing Lecture 10

OpenMP: The "Easy" Path to Shared Memory Computing

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Lecture 4: OpenMP Open Multi-Processing

Distributed Systems + Middleware Concurrent Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel programming using OpenMP

Introduction to OpenMP.

OpenMP threading: parallel regions. Paolo Burgio

Raspberry Pi Basics. CSInParallel Project

Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

Shared Memory Programming Paradigm!

Parallel Numerical Algorithms

Parallel Programming in C with MPI and OpenMP

CS 5220: Shared memory programming. David Bindel

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Shared Memory programming paradigm: openmp

HPCSE - I. «OpenMP Programming Model - Part I» Panos Hadjidoukas

Concurrent Programming with OpenMP

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

CS 470 Spring Mike Lam, Professor. OpenMP

Parallel Programming in C with MPI and OpenMP

CS691/SC791: Parallel & Distributed Computing

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Introduction to OpenMP

COMP Parallel Computing. SMM (2) OpenMP Programming Model

OpenMP: Open Multiprocessing

OpenMP Programming. Aiichiro Nakano

Introduction to OpenMP

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

Parallel Programming Principle and Practice. Lecture 6 Shared Memory Programming OpenMP. Jin, Hai

Shared Memory Parallelism using OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Shared Memory Parallelism - OpenMP

Multicore Computing. Arash Tavakkol. Instructor: Department of Computer Engineering Sharif University of Technology Spring 2016

Scientific Computing

Alfio Lazzaro: Introduction to OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Introduction to OpenMP

OpenMP loops. Paolo Burgio.

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Session 4: Parallel Programming with OpenMP

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

ME759 High Performance Computing for Engineering Applications

Multithreading in C with OpenMP

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

Parallel Programming using OpenMP

cp r /global/scratch/workshop/openmp-wg-oct2017 cd openmp-wg-oct2017 && ls Current directory

Introduction to OpenMP

Introduction to OpenMP

Tieing the Threads Together

Parallel Programming using OpenMP

Introduction to OpenMP

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP programming. Thomas Hauser Director Research Computing Research CU-Boulder

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Overview: The OpenMP Programming Model

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

Computer Architecture

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Transcription:

SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf

OpenMP An Introduction What is it? A set of compiler directives and a runtime library #pragma omp parallel num_threads(4) #include <omp.h> Why do we care? Simplify (& standardizes) how multi-treadthread application are written Fortran, C, C++ Thread 1 stack Stack Thread 2 stack Thread 3 stack BSS Data Text BSS Data Text 2

OpenMP An Introduction OpenMP is based on Fork/Join model When program starts, one Master thread is created Master thread executes sequential portions of the program At the beginning of parallel region, master thread forks new threads All the threads together now forms a team At the end of the parallel region, the forked threads die 3

What s a Shared-Memory Program? One process that spawns multiple threads Thread 1 Threads can communicate via shared memory Private Memory Thread 2 Read/Write to shared variables Synchronization can be required! OS decides how to schedule threads Thread 4 Shared Memory Area Private Memory Private Memory Thread 3 Private Memory 4

OpenMP: Hello World Make Hello World multi-threaded Include OpenMP header int main(){ int ID=0; printf( hello(%d), ID);; printf( world(%d)\n, ID);; #include omp.h Start parallel region with default number of threads int main(){ #pragma omp parallel { int ID = omp_get_thread_num(); printf( hello(%d), ID);; printf( world(%d)\n, ID);; Who am I? 5

Parallel Regions A parallel region identifies a portion of code that can be executed by different threads You can create a parallel region with the parallel directive You can request a specific number of threads with omp_set_num_threads(n) double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(id,a); double A[1000]; #pragma omp parallel num_threads(4) { int ID = omp_get_thread_num(); pooh(id,a); Each thread will call pooh with a different value of ID 6

Parallel Regions double A[1000] omp_set_num_threads(4) double A[1000]; omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); pooh(id,a); printf( all done!); pooh(0, A) pooh(1, A) pooh(2, A) pooh(3, A) printf( all done! ); All the threads executed the same code The A array is shared Implicit synchronization at the end of the parallel region 7

Behind the scenes The OpenMP compiler generates code logically analogous to that on the right All known OpenMP implementations use a thread pool so full cost of threads creation and destruction is not incurred for reach parallel region. ƒonly three threads are created because the last parallel section will be invoked from the parent thread. #pragma omp parallel num_threads(4) { foobar(); void thunk(){ foobar(); pthread_t tid[4]; for (int i= 1; i< 4; ++i) pthread_create(&tid[i],0,thunk, 0); thunk(); for (int i = 1; i< 4; ++i) pthread_join(tid[i]); 8

Exercise: Compute PI Mathematically, we know that computing the integral of 4/(1+x*x) from 0 to 1 will give us the value of pi which is great since it gives us an easy way to check the answer. We can approximate the integral as a sum of rectangles: Where each rectangle has width 'x and height F(x_i) at the middle of interval i. 9

Computing PI: Sequential Version #include <stdio.h> #include <omp.h> static long num_steps = 100000000; double step; int main () { int i; double x, pi, sum = 0.0; double start_time, run_time; step = 1.0/(double) num_steps; start_time = omp_get_wtime(); for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; run_time = omp_get_wtime() - start_time; printf("\n pi with %ld steps is %lf in %lf seconds\n ",num_steps,pi,run_time); 10

Parallel PI Create a parallel version of the pi program using a parallel construct. Pay close attention to shared versus private variables. In addition to a parallel construct, you will need the runtime library routines int omp_get_num_threads(); int omp_get_thread_num(); double omp_get_wtime(); Possible strategy: This pattern is very general and has been used to support most (if not all) the algorithm strategy patterns. MPI programs almost always use this pattern... it is probably the most commonly used pattern in the history of parallel programming. Run the same program on P processing elements where P can be arbitrarily large. Use the rank... an ID ranging from 0 to (P-1)... to select between a set of tasks and to manage any shared data structures. 11

Simple Parallel PI Original Serial pi program with 100000000 steps ran in 1.83 seconds. Poor scaling!!! 12

False Sharing If independent data elements happen to sit on the same cache line, each update will cause the cache lines to slosh back and forth between threads HotFix: Pad arrays so elements you use are on distinct cache lines. 13

Padding the PI 14

SPMD vs WorkSharing A parallel construct by itself creates an SPMD or Single Program Multiple Data program... i.e., each thread redundantly executes the same code. How do you split up pathways through the code between threads within a team? WorkSharing: The OpenMP loop construct (not the only way to go) The loop worksharing construct splits up loop iterations among the threads in a team #pragma omp parallel { #pragma omp for for (I=0;I<N;I++){ NEAT_STUFF(I); The variable I is made private to each thread by default. You could do this explicitly with a private(i) clause 15

Why should we use it (the loop construct)? Sequential: OpenMP parallel region: OpenMP parallel region and worksharing: 16

Working with Loops Basic approach Find compute intensive loops Make the loop iterations independent.. So they can safely execute in any order without loop-carried dependencies Place the appropriate OpenMP directive and test 17

Reduction OpenMP reduction clause: reduction (op : list) Inside a parallel or a work-sharing construct: A local copy of each list variable is made and initialized depending on the op (e.g. 0 for + ). Updates occur on the local copy. Local copies are reduced into a single value and combined with the original global value. The variables in list must be shared in the enclosing parallel region. 18

PI with loop construct 19