Lecture 14. Performance Programming with OpenMP

Similar documents
Lecture 10. Shared memory programming

Lecture 3. Stencil methods Multithreaded programming with OpenMP

Lecture 3. Stencil methods Multithreaded programming with OpenMP

CSE 160 Lecture 9. Load balancing and Scheduling Some finer points of synchronization NUMA

Announcements. Scott B. Baden / CSE 160 / Wi '16 2

Lecture 7. OpenMP: Reduction, Synchronization, Scheduling & Applications

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

Lecture 13. Shared memory: Architecture and programming

OpenMP. Today s lecture. Scott B. Baden / CSE 160 / Wi '16

Introduction to OpenMP.

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Multithreading in C with OpenMP

Lecture 6. Floating Point Arithmetic Stencil Methods Introduction to OpenMP

OpenMP. Application Program Interface. CINECA, 14 May 2012 OpenMP Marco Comparato

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Introduction to OpenMP

Concurrent Programming with OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to OpenMP

OPENMP OPEN MULTI-PROCESSING

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Synchronisation in Java - Java Monitor

Computational Mathematics

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Multi-core Architecture and Programming

High Performance Computing: Tools and Applications

INTRODUCTION TO OPENMP (PART II)

CME 213 S PRING Eric Darve

[Potentially] Your first parallel application

Lecture 4: OpenMP Open Multi-Processing

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Exercise: OpenMP Programming

Performance Tuning and OpenMP

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Programming with OpenMP*

Programming Shared-memory Platforms with OpenMP. Xu Liu

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

15-418, Spring 2008 OpenMP: A Short Introduction

Parallel Programming in C with MPI and OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

EE/CSCI 451: Parallel and Distributed Computation

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

CME 213 S PRING Eric Darve

Parallel Programming in C with MPI and OpenMP

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Overview: The OpenMP Programming Model

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

!OMP #pragma opm _OPENMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Shared Memory Programming with OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Parallel Programming in C with MPI and OpenMP

GUIDE Reference Manual (C/C++ Edition)

Programming Shared Memory Systems with OpenMP Part I. Book

Announcements A2 will be posted by Monday at 9AM. Scott B. Baden / CSE 160 / Wi '16 2

Introduction to OpenMP. Lecture 4: Work sharing directives

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Cluster Computing 2008

CSL 860: Modern Parallel


Lecture 17. NUMA Architecture and Programming

Alfio Lazzaro: Introduction to OpenMP

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Introduction to OpenMP

Introduction to OpenMP. Motivation

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Parallel Programming

Performance Tuning and OpenMP

OpenMP on Ranger and Stampede (with Labs)

Masterpraktikum - High Performance Computing

Data Environment: Default storage attributes

Fractals exercise. Investigating task farms and load imbalance

Parallel programming using OpenMP

Parallel Computing Parallel Programming Languages Hwansoo Han

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell. Specifications maintained by OpenMP Architecture Review Board (ARB)

COMP4300/8300: The OpenMP Programming Model. Alistair Rendell

Module 11: The lastprivate Clause Lecture 21: Clause and Routines. The Lecture Contains: The lastprivate Clause. Data Scope Attribute Clauses

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

OPENMP TIPS, TRICKS AND GOTCHAS

OpenMP 4.0/4.5: New Features and Protocols. Jemmy Hu

Compiling and running OpenMP programs. C/C++: cc fopenmp o prog prog.c -lomp CC fopenmp o prog prog.c -lomp. Programming with OpenMP*

Introduction to. Slides prepared by : Farzana Rahman 1

Introduction to OpenMP

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Transcription:

Lecture 14 Performance Programming with OpenMP

Sluggish front end nodes? Advice from NCSA Announcements Login nodes are honest[1-4].ncsa.uiuc.edu Login to specific node to target one that's not overloaded Don t rely on the round robin assignment 11/7/08 Scott B. Baden /CSE 160/ Fall '08 2

OpenMP programming Simpler interface than pthreads Parallelization handled via annotations Loop decomposition Serial sections Barriers A standard http://www.openmp.org Parallel loop: #pragma omp parallel { #pragma omp for private(i) shared(n) for(i=0; i < n; i++) work(i); } 11/7/08 Scott B. Baden /CSE 160/ Fall '08 3

Documentation Documentation at software.intel.com/en-us/articles/ getting-started-with-openmp more-work-sharing-with-openmp advanced-openmp-programming 32-openmp-traps-for-c-developers Tutorial on OpenMP, with code examples https://computing.llnl.gov/tutorials/openmp/exercise.html More updates on the Abe Document in the OpenMP section http://www.cse.ucsd.edu/classes/fa08/cse260/abe.html#openmp 11/7/08 Scott B. Baden /CSE 160/ Fall '08 4

Programming model Start with a single root thread Fan out a set of concurrently executing threads Threads may or may not execute on different processors, and might be interleaved Scheduling behavior specified separately #pragma omp parallel // Begin parallel construct { STATEMENTS } // End of Parallel Construct; disband the team 11/7/08 Scott B. Baden /CSE 160/ Fall '08 5

Parallel Sections #pragma omp parallel // Begin a parallel construct { // form a team // Each team member executes the same code #pragma omp sections // Begin work sharing { #pragma omp section // A unit of work {x = x + 1;} #pragma omp section {x = x + 1;} // Another unit } // Wait until both units complete } // End of Parallel Construct; disband team // continue serial execution 11/7/08 Scott B. Baden /CSE 160/ Fall '08 6

Critical Sections Only one thread at a time may run the code in a critical section Mutual exclusion to implement critical sections #pragma omp parallel // Begin a parallel construct { #pragma omp sections // Begin worksharing { // #pragma omp critical // Critical section {x = x + 1;} #pragma omp critical // Another critical section {x = x + 1;}... // More Replicated Code #pragma omp barrier // Wait for all members to arrive } // Wait until both units of work complete } x=x+1; x=x+1; 11/7/08 Scott B. Baden /CSE 160/ Fall '08 7

Parallel for loop #pragma omp parallel private(i) shared(n) { #pragma omp for for i = istart to end for(i=0; i < n; i++) work(i); } static workload assignments Each thread gets a unique range of indices based on its identity The calls to work(i) take the same time to compute 11/7/08 Scott B. Baden /CSE 160/ Fall '08 10

Shorthand #pragma omp parallel private(i) shared(n) for for(i=0; i < n; i++) work(i); 11/7/08 Scott B. Baden /CSE 160/ Fall '08 11

Implementing the Game of Life with OpenMP (See $(PUB)/Examples/Life_omp)

Parallelization #pragma omp parallel for shared(maxiter,population,gmesh,pmesh, nx,ny) private(i,j) for(i=1;i<nx-1;i++){ } for(j=1;j<ny-1;j++) { } int nn = gmesh[i+1][j]+gmesh[i-1][j]+gmesh[i][j+1]+gmesh[i][j-1] + gmesh[i+1][j+1]+gmesh[i-1][j-1]+gmesh[i-1][j+1]+ gmesh[i+1][j-1]; pmesh[i][j] = gmesh[i][j]? (nn == 2 nn == 3) : (nn == 3); NT=1 Sp 2 1.95 4 3.89 8 5.98 1.10 0.565 0.283 0.219 1.10 0.582 0.283 0.184 1.10 0.584 0.305 0.208 1.10 0.585 0.283 0.338 11/7/08 Scott B. Baden /CSE 160/ Fall '08 13

#pragma omp parallel for shared(maxiter,population,gmesh,pmesh, nx,ny) private(i,j) schedule(static) for(i=1;i<nx-1;i++){ #pragma omp parallel for shared(maxiter,population,gmesh,pmesh, nx,ny) private(i,j) schedule(static) } for(j=1;j<ny-1;j++) { } Nested parallelization int nn = gmesh[i+1][j]+gmesh[i-1][j]+gmesh[i][j+1]+gmesh[i][j-1] + gmesh[i+1][j+1]+gmesh[i-1][j-1]+gmesh[i-1][j+1]+ gmesh[i+1][j-1]; pmesh[i][j] = gmesh[i][j]? (nn == 2 nn == 3) : (nn == 3); Setenv OMP_NESTED 1 11/7/08 Scott B. Baden /CSE 160/ Fall '08 14

Moving between parallel and serial #pragma omp parallel for shared(maxiter,population,gmesh,pmesh, nx,ny) private(i,j) schedule(static) } for(i=1;i<nx-1;i++){ execution X=2 #pragma omp parallel for shared(maxiter,population,gmesh,pmesh, nx,ny) private(i,j) schedule(static) } for(i=1;i<nx-1;i++){ 11/7/08 Scott B. Baden /CSE 160/ Fall '08 15

How does it work under the hood? When the master thread encounters a parallel construct, it creates a team of threads The enclosed program statements enclosed executed in parallel by all team members, including procedure calls. Statements enclosed lexically within a construct define the static extent of the construct When we reach the end of the scope the team of threads synchronize, the team is dissolved, and only the master thread continues execution. The other threads in the team enter a wait state. Thread teams can be created and dissolved many times during program execution. www.ncsa.uiuc.edu/userinfo/resources/software/intel/compilers/ 10.0/main_cls/mergedProjects/optaps_cls/whskin_homepage.htm 11/7/08 Scott B. Baden /CSE 160/ Fall '08 16

Reductions in OpenMP OpenMP uses a local accumulator, which it then accumulates globally when the loop is over #pragma omp parallel reduction(+:sum) for (int i=i0; i< i1; i++) sum += x[i]; 11/7/08 Scott B. Baden /CSE 160/ Fall '08 19

Dynamic workload assignment Static decomposition assumes that all the iterations take the same time In some applications, this condition doesn t hold We can t use static decomposition Motivating application: Mandelbrot set computation Named after B. Mandelbrot 11/7/08 Scott B. Baden /CSE 160/ Fall '08 20

A quick review of complex numbers We define i = 1 A complex number z = x + i y x is called the real part y is called the imaginary part We associate each complex number with a point in the x-y plane The magnitude of a complex number is the same as vector length: z = (x 2 + y 2 ) z 2 = (x+iy)(x+iy) = (x 2 y 2 ) +2xyi 11/7/08 Scott B. Baden /CSE 160/ Fall '08 21

The Mandelbrot set For which points c in the complex plane does the following iteration remain bounded? z k+1 = z k 2 + c c is a complex number, z 0 = 0 When c=0, all points lay within a unit disk : z 1 If z 2, the iteration is guaranteed to diverge to Plot the rate at which points in a given region diverge Stop the iterations when z k+1 2 or k reaches some limit Plot k at each position 11/7/08 Scott B. Baden /CSE 160/ Fall '08 22

A load balancing problem Some points iterate longer than others If we use a uniform static decomposition, some processors finish later than others We have a load imbalance 8000 7000 6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 11/7/08 Scott B. Baden /CSE 160/ Fall '08 23

Dynamic shceduling OpenMP provides a dynamic scheduling option Processors schedule themselves Sample a shared counter or work queue to obtain work User tunes work granularity ( chunk size ) to trade off the overhead of sampling the queue against increased load imbalance High overheads Running time Load imbalance Increasing granularity 11/7/08 Scott B. Baden /CSE 160/ Fall '08 25

We need to choose an appropriate task granularity The finest granularity: each point is a separate task Coarsest granularity: one block per processor What is the tradeoff? Granularity tradeoff 11/7/08 Scott B. Baden /CSE 160/ Fall '08 26

Simulations Let s simulate the workload distribution Assumptions: Work distribution is instantaneous (zero cost) Running time time is proportional to k Vary the chunk size b Optimal running time on 16 processors: 3044 11/7/08 Scott B. Baden /CSE 160/ Fall '08 27

Limits to performance What happened when the chunk size was too large? Consider P=8 The running times on the processors: 1 4771 5855 6018 6101 7129 6390 6470 5964 The optimal running time is 6088 8000 7000 6000 5000 4000 3000 2000 1000 0 2 3 4 5 6 7 8 11/7/08 Scott B. Baden /CSE 160/ Fall '08 28

How to specify dynamic scheduling #pragma omp parallel private(i,j) shared(unew, u, n) #pragma omp for schedule(dynamic) for(i=0; i < n; i++) for(j=0; j < n; j++) { u new [i,j] = Work(u[i,j]) } setenv OMP_DYNAMIC 1 11/7/08 Scott B. Baden /CSE 160/ Fall '08 29

How does self scheduling work? while (getnextchunk(&mymin,&mymax )) for i = mymin to mymax work(i); end for end while SelfScheduler S(n,P,CSize); Boolean getnextchunk(int * mymin, int * mymax ){ } $omp critical { k = S.counter += S.ChunkSize; } mymin = k; mymax = k + S.chunkSize; 11/7/08 Scott B. Baden /CSE 160/ Fall '08 30

Guided self-scheduling Observation: we need many more pieces of work than there are processors (say 10 times) at the time we are sampling the counter Adjust or guide task granularity in proportion to the amount of remaining work For example, g = NRemain/10P When there s lots of work to do, we hand out large chunks. Toward the end, we hand out smaller chunks See Polychronopoulos, C.D., and Kuck, D.J. Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans, on Computers, Vol. C-36, No. 12, Dec. 1987, pp. 1425-1439. 11/7/08 Scott B. Baden /CSE 160/ Fall '08 31