COMP Parallel Computing. SMM (2) OpenMP Programming Model

Similar documents
Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

EE/CSCI 451: Parallel and Distributed Computation

ECE 574 Cluster Computing Lecture 10

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Parallelism - OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang

Multithreading in C with OpenMP

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

CS 470 Spring Mike Lam, Professor. OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Parallel Programming

Introduction to OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

EPL372 Lab Exercise 5: Introduction to OpenMP

Advanced OpenMP. Lecture 11: OpenMP 4.0

Introduction to OpenMP

Introduction to OpenMP

Introduction to OpenMP

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

CSL 860: Modern Parallel

Introduction to OpenMP.

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Lecture 4: OpenMP Open Multi-Processing

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

OpenMP 4.0/4.5. Mark Bull, EPCC

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Shared Memory Parallelism using OpenMP

Distributed Systems + Middleware Concurrent Programming with OpenMP

Allows program to be incrementally parallelized

OpenMP 4.0. Mark Bull, EPCC

OpenMP C and C++ Application Program Interface Version 1.0 October Document Number

Shared Memory Programming with OpenMP

Parallel Programming in C with MPI and OpenMP

Overview: The OpenMP Programming Model

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Alfio Lazzaro: Introduction to OpenMP

OpenMP Fundamentals Fork-join model and data environment

by system default usually a thread per CPU or core using the environment variable OMP_NUM_THREADS from within the program by using function call

Introduction to OpenMP. Lecture 4: Work sharing directives

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

Concurrent Programming with OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Parallel Programming in C with MPI and OpenMP

Shared Memory programming paradigm: openmp

Session 4: Parallel Programming with OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Synchronization. Event Synchronization

Introduction to OpenMP

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

Shared Memory Programming with OpenMP (3)

OpenMP. OpenMP. Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum API for Fortran and C/C++

DPHPC: Introduction to OpenMP Recitation session

CS691/SC791: Parallel & Distributed Computing

OpenMP 4.5: Threading, vectorization & offloading

Parallel Numerical Algorithms

Introduction to OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Introduction to Standard OpenMP 3.1

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Review: Creating a Parallel Program. Programming for Performance

15-418, Spring 2008 OpenMP: A Short Introduction

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

A brief introduction to OpenMP

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

OpenMP - Introduction

Multi-core Architecture and Programming

CS 5220: Shared memory programming. David Bindel

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Introduction to OpenMP

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Programming Shared Memory Systems with OpenMP Part I. Book

OpenMP: Open Multiprocessing

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

NUMA-aware OpenMP Programming

Transcription:

COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial

Topics OpenMP shared-memory parallel programming model loop-level parallel programming Characterizing performance performance measurement of a simple program how to monitor and present program performance general barriers to performance in parallel computation 2

Loop-level shared-memory programming model Work-Time programming model sequential programming language + forall PRAM execution synchronous scheduling implicit (via Brent s theorem) W-T cost model (work and steps) Loop-level parallel programming model sequential programming language + directives to mark for loop as forall shared-memory multiprocessor execution asynchronous execution of loop iterations by multiple threads in a single address space must avoid dependence on synchronous execution model scheduling of work across threads is controlled via directives implemented by the compiler and run-time systems cost model depends on underlying shared memory architecture can be difficult to quantify but some general principles apply 3

OpenMP OpenMP parallelization directives for mainstream performance-oriented sequential programming languages Fortran 90, C/C++ directives are written as comments in the program text ignored by non-openmp compilers interpreted by OpenMP-compliant compilers in OpenMP mode directives specify parallel execution create multiple threads, generally each thread runs on a separate core in a CC-NUMA machine partitioning of variables a variable is either shared between threads OR each thread maintains a private copy work scheduling in loops partitioning of loop iterations across threads C/C++ binding of OpenMP form of directives #pragma omp.... 4

OpenMP example printf( Start.\n ); #pragma omp parallel for shared(a,b) private(i) for (i = 1; i < N-1; i++) { b[i] = (a[i-1] + a[i] + a[i+1]) / 3; } printf( Done.\n ); The parallel directive indicates the next statement should be executed by all threads The for directive indicates the work in the loop should be partitioned across the threads The shared and private directives indicate that arrays a and b are shared by all threads but loop index i has a separate instance in each thread. (the directives are unnecessary in this case since this is the default behavior) Without openmp enabled, execution is sequential 5

OpenMP components Directives specify parallel vs sequential regions specify shared vs private variables in parallel regions specify work sharing: distribution of loop iterations over threads specify synchronization and serialization of threads Run-time library obtain parallel processing resources control dynamic aspects of work sharing Environment variables External to program specification of resources available for a particular execution enables a single compilation to run with different numbers of processors 6

C/OpenMP concepts: parallel region #pragma omp parallel shared( ) private( ) <single entry, single exit block> Fork-join model master thread forks a team of threads on entry to block variables in scope within the block are shared among all threads» if declared outside of the parallel region» if explicitly declared shared in the directive private to (replicated in) each thread» if declared within the parallel region» if explicitly declared private in the directive» if variable is a loop index variable in a loop within the region the team of threads has dynamic lifetime to end of block statements are executed by all threads the end of block is a barrier synchronization that joins all threads only master thread proceeds thereafter master thread <single entry, single exit block> 7

C/OpenMP concepts: work sharing #pragma omp for schedule( ) for (<var> = <lb>; <var> <op> <ub>; <incr-expr>) <loop body> Work sharing only has meaning inside a parallel region the iteration space is distributed among the threads several different scheduling strategies available the loop construct must follow some restrictions <var> has a signed integer type <lb>, <ub>, <incr-expr> must be loop invariant <op>, <incr-expr> restricted to simple relational and arithmetic operations implicit barrier at completion of loop 8

Complete C program (V1) #include <stdio.h> #include <omp.h> #define N 50000000 #define NITER 100 double a[n],b[n]; main () { double t1,t2,td; int i, t, max_threads, niter; max_threads = omp_get_max_threads(); printf("initializing: N = %d, max # threads = %d\n", N, max_threads); /* * initialize arrays */ for (i = 0; i < N; i++){ a[i] = 0.0; b[i] = 0.0; } a[0] = b[0] = 1.0; 9

Program, contd. (V1) /* * time iterations */ t1 = omp_get_wtime(); for (t = 0; t < NITER; t = t + 2){ #pragma omp parallel for private(i) for (i = 1; i < N-1; i++) b[i] = (a[i-1] + a[i] + a[i+1]) / 3.0; } #pragma omp parallel for private(i) for (i = 1; i < N-1; i++) a[i] = (b[i-1] + b[i] + b[i+1]) / 3.0; } t2 = omp_get_wtime(); td = t2 t1; printf("time per element = %6.1f ns\n", td * 1E9 / (NITER * N)); 10

Program, contd. (V2 enlarging scope of parallel region) /* * time iterations */ t1 = omp_get_wtime(); #pragma omp parallel private(i,t) for (t = 0; t < NITER; t = t + 2){ #pragma omp for for (i = 1; i < N-1; i++) b[i] = (a[i-1] + a[i] + a[i+1]) / 3.0; } #pragma omp for for (i = 1; i < N-1; i++) a[i] = (b[i-1] + b[i] + b[i+1]) / 3.0; } t2 = omp_get_wtime(); td = t2 t1; printf("time per element = %6.1f ns\n", td * 1E9 / (NITER * N)); 11

Complete program (V3 page binding) #include <stdio.h> #include <omp.h> #define N 50000000 #define NITER 100 double a[n],b[n]; main () { double t1,t2,td; int i, t, max_threads, niter; max_threads = omp_get_max_threads(); printf("initializing: N = %d, max # threads = %d\n", N, max_threads); #pragma omp parallel private(i,t) { // start parallel region /* * initialize arrays */ #pragma omp for for (i = 1; i < N; i++){ a[i] = 0.0; b[i] = 0.0; } #pragma omp master a[0] = b[0] = 1.0; 12

Program, contd. (V3 page binding) } /* * time iterations */ #pragma omp master t1 = omp_getwtime(); for (t = 0; t < NITER; t = t + 2){ #pragma omp for for (i = 1; i < N-1; i++) b[i] = (a[i-1] + a[i] + a[i+1]) / 3.0; #pragma omp for for (i = 1; i < N-1; i++) a[i] = (b[i-1] + b[i] + b[i+1]) / 3.0; } } // end parallel region t2 = omp_get_wtime(); td = t2 t1; printf("time per element = %6.1f ns\n", td * 1E9 / (NITER * N)); 13

time per elt t/n (ns) Effect of caches Time to update one element in sequential execution b[i] = (a[i-1] + a[i] + a[i+1]) / 3.0; depends on where the elements are found L1 cache, L2 cache, main memory 60 Main memory 50 40 30 20 10 L1 L2 cache 0 1,000 10,000 100,000 1,000,000 10,000,000 number of elements n 14

How to present scaling of parallel programs? Independent variables either number of processors p problem size n Dependent variable (choose) Time (secs) Rate (opns/sec) Speedup S = T 1 / T p Efficiency E = T 1 / pt p Horizontal axis independent variable (n or p) Vertical axis Dependent variable (Different curves for values of parameter not on horizontal axis) 15

time per elt (ns) Time Shortest time is our true goal But hard to judge improvements because values get very small at large p 70.0 60.0 50.0 40.0 30.0 20.0 10.0 0.0 n = 10,000,000 n = 1,000,000 0 2 4 6 8 10 12 14 16 18 20 22 24 number of processors 16

Execution rate (MFLOP / second) Shows work per time easier to judge scaling highest detail at large n, p how to measure MFLOPS? Parallel performance 2500.0 MFLOP / second 2000.0 1500.0 1000.0 500.0 n = 10,000,000 n = 1,000,000 0.0 0 2 4 6 8 10 12 14 16 18 20 22 24 number of processors 17

Speedup Speedup of run time relative to single processor (t 1 / t p ) How to define t 1? run time of parallel algorithm at p = 1? run time of best serial algorithm? Superlinear speedup? 35 Parallel speedup speedup 30 25 20 15 p n = 1,000,000 n = 10,000,000 10 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 number of processors 18

OpenMP: scheduling loop iterations Scheduling a loop with n iterations using p threads The unit of scheduling is a chunk of k iterations T i means iteration(s) executed by thread i schedule(static,k) Chunks mapped to threads in round-robin fashion at entry to loop default k = n/p schedule(dynamic,k) chunks handed out consecutively to first available thread default k = 1 schedule(guided,k) size d chunk handed to first available thread d decreases exponentially from n/p down to k: d i+1 = (1-1/p)d i where d 0 =n/p default k = 1 n n n 19

Varying scheduling strategy: diffusion problem 100 Speedup by schedule type (n = 10,000,000) 10 speedup 1 0.1 0.01 p static guided dynamic,32 dynamic 1 10 100 number of processors 20

Causes of poor parallel performance Possible suspects: Low computational intensity Performance limited by memory performance Poor cache behavior access pattern has poor locality access pattern is poor match to CC-NUMA Sequential overhead Amdahl s law fraction f serial work limits speedup to 1/f Load imbalance Unequal distribution of work, or Unequal thread progress on equal work busy machine, uncooperative OS CC-NUMA issues Bad luck Insufficient sampling - show timing variation on plots! 21

Cache-related mysteries 3000 Effective performance MFLOP / second 2500 2000 1500 1000 n = 10,000,000 n = 1,000,000 500 0 0 2 4 6 8 10 12 14 16 18 20 22 24 number of processors 22

Cache-related mysteries: speedup Parallel speedup (single parallel region) 45 40 35 30 p n = 1,000,000 n = 10,000,000 speedup 25 20 15 10 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 num ber of processors 23

OpenMP on CC-NUMA Performance guidelines shared data structures use cache-line spatial locality linear access patterns (read and write) structs with components grouped by access don t mix reads and writes to same data on different processors use phased updates avoid false sharing unrelated values sharing a cache line updated by multiple threads (for CC-NUMA) make sure data structures are physically distributed across memories by parallel initialization» artifact of page placement policy under e.g. Linux by explicit placement directives and page allocation policies 25

OpenMP on CC-(N)UMA Other guidelines Enlarge parallel region to retain processor data affinity to avoid overhead of repeated entry to parallel region in an inner loop Use appropriate work distribution schedule static, else guided, else dynamic with large chunksize runtime-specified schedule involves relatively small overhead Don t use too many processors OS scheduling of threads behaves erratically when machine is oversubscribed be aware of dynamic thread adjustment (OMP_DYNAMIC) 26

Reductions and critical statements a reduction loop does not have independent iterations for (i=0; i<n; i++) { sum = sum + a[i]; } the loop may be parallelized by inserting a critical section the critical directive applies to a single statement or block #pragma omp parallel for for (i=0; i<n; i++) { #pragma omp critical sum = sum + a[i]; } but this is a poor strategy! a reduction loop can be identified using a reduction directive #pragma omp parallel for reduction(+: sum) for (i=0; i<n; i++) { sum = sum + a[i]; } 27