Parallel Computing. Lecture 17: OpenMP Last Touch

Similar documents
Hybrid MPI and OpenMP Parallel Programming

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Hybrid MPI+OpenMP Parallel MD

Introduction to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Hybrid MPI+OpenMP Parallel MD

Parallel Computing: Overview

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

JURECA Tuning for the platform

ECE 574 Cluster Computing Lecture 10

OpenMP Programming. Aiichiro Nakano

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

12:00 13:20, December 14 (Monday), 2009 # (even student id)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Lecture 4: OpenMP Open Multi-Processing

Hybrid MPI+OpenMP+CUDA Programming

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Data Environment: Default storage attributes

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Solution of Exercise Sheet 2

Parallel Computing. Lecture 13: OpenMP - I

Shared Memory programming paradigm: openmp

MPI & OpenMP Mixed Hybrid Programming

Shared Memory Programming Model

Tutorial: parallel coding MPI

Introduction to OpenMP

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

DPHPC: Introduction to OpenMP Recitation session

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Parallel Programming. Libraries and Implementations

MultiCore Architecture and Parallel Programming Final Examination

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Shared Memory Parallelism - OpenMP

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Introduction to Parallel Programming

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Processing Top manufacturer of multiprocessing video & imaging solutions.

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

PARALLEL AND DISTRIBUTED COMPUTING

Hybrid MPI+CUDA Programming

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Parallel Programming in C with MPI and OpenMP

Mango DSP Top manufacturer of multiprocessing video & imaging solutions.

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

DPHPC: Introduction to OpenMP Recitation session

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

COSC 6374 Parallel Computation. Introduction to OpenMP. Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Parallel Programming Libraries and implementations

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to OpenMP.

Session 4: Parallel Programming with OpenMP

Parallel Programming Using MPI

Shared Memory Parallelism using OpenMP

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

OPENMP TIPS, TRICKS AND GOTCHAS

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Our new HPC-Cluster An overview

Parallel programming in Madagascar. Chenlong Wang

Parallel Programming in C with MPI and OpenMP

A brief introduction to OpenMP

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

MPI and OpenMP. Mark Bull EPCC, University of Edinburgh

Parallel Programming

Multithreading in C with OpenMP

Faculty of Electrical and Computer Engineering Department of Electrical and Computer Engineering Program: Computer Engineering

Introduction to OpenMP

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

CS 470 Spring Mike Lam, Professor. OpenMP

OPENMP TIPS, TRICKS AND GOTCHAS

CS 426. Building and Running a Parallel Application

CS691/SC791: Parallel & Distributed Computing

Review. Tasking. 34a.cpp. Lecture 14. Work Tasking 5/31/2011. Structured block. Parallel construct. Working-Sharing contructs.

Message Passing Interface

Introduction to OpenMP

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Parallel Computing. November 20, W.Homberg

Introduction to OpenMP

CS 470 Spring Mike Lam, Professor. OpenMP

CS 61C: Great Ideas in Computer Architecture. Synchronization, OpenMP

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

Parallel Numerical Algorithms

Transcription:

CSCI-UA.0480-003 Parallel Computing Lecture 17: OpenMP Last Touch Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Some slides from here are adopted from: Yun (Helen) He and Chris Ding Lawrence Berkeley National Laboratory Yao-Yuan Chuang

Performance Easy to write OpenMP but hard to write an efficient program! 5 main causes of poor performance: Sequential code Communication Load imbalance Synchronisation Compiler (non-)optimisation.

Sequential code Amdahl s law: Limits performance. Need to find ways of parallelising it! In OpenMP, all code outside of parallel regions and inside MASTER, SINGLE and CRITICAL directives is sequential. This code should be as small as possible.

Communication On Shared memory machines, communication = increased memory access costs. It takes longer to access data in main memory or another processor s cache than it does from local cache. Memory accesses are expensive! Unlike message passing, communication is spread throughout the program. Much harder to analyse and monitor.

Caches and coherency Shared memory programming assumes that a shared variable has a unique value at a given time. Caching means that multiple copies of a memory location may exist in the hardware. To avoid two processors caching different values of the same memory location, caches must be kept coherent. Coherency operations are usually performed on the cache lines in the level of cache closest to the shared inclusive cache/memory

Coherency Highly simplified view: Each cache line can exist in one of 3 states: Exclusive: the only valid copy in any cache. Read-only: a valid copy but other caches may contain it. Invalid: out of date and cannot be used. A READ on an invalid or absent cache line will be cached as read-only. A READ on an exclusive cache line will be changed to read only. A WRITE on a line not in an exclusive state will cause all other copies to be marked invalid and the written line to be marked exclusive.

False sharing Cache lines consist of several words of data. What happens when two processors are both writing to different words on the same cache line? Each write will invalidate the other processors copy. Lots of remote memory accesses. Symptoms: Poor speedup High, non-deterministic numbers of cache misses. Mild, non-deterministic, unexpected load imbalance.

Matrix-vector multiplication Copyright 2010, Elsevier Inc. All rights Reserved

Matrix-vector multiplication Run-times and efficiencies of matrix-vector multiplication (times are in seconds) Copyright 2010, Elsevier Inc. All rights Reserved

Data affinity Data is cached on the processors which access it. Must reuse cached data as much as possible. Write code with good data affinity: Ensure the same thread accesses the same subset of program data as much as possible. Try to make these subsets large, contiguous chunks of data. Will avoid false sharing and other problems. The manner in which the memory is accessed by individual threads has a major influence on performance If each thread accesses a distinct portion of data consistently through the program, the threads will probably make excellent use of memory. This improvement includes good use of thread-local cache. Increased performance can often be obtained by increasing the fraction of data references that can be handled in cache.

Load imbalance Load imbalance can arise from both communication and computation. Worth experimenting with different scheduling options runtime clause is handy here If none are appropriate, may be best to do your own scheduling!

Synchronisation Barriers can be very expensive Avoid barriers via: Careful use of the NOWAIT clause A recommended strategy is to first ensure that the OpenMP program works correctly and then avoid the nowait clause where possible by carefully inserting explicit barriers at points in the program where needed. Parallelise at the outermost level possible. May require re-ordering of loops /indices. Choice of CRITICAL / ATOMIC / lock routines may impact performance.

Compiler (non-)optimisation Sometimes the addition of parallel directives can inhibit the compiler from performing sequential optimisations. Symptoms: 1-thread parallel code has longer execution and higher instruction count than sequential code. Can sometimes be cured by making shared data private, or local to a routine.

Performance tuning My code is giving me poor speedup. I don t know why. What do I do now? A: Say this machine/language is a heap of junk Give up and go back to your laptop B: Try to classify and localise the sources of overhead. What type of problem is it and where in the code does it occur Fix problems that are responsible for large overheads first. Iterate

Performance Tuning: Timing the OpenMP Performance A standard practice is to use a standard operating system command. For example $time./a.out The real, user, and system times are then printed after the program has finished execution. For example $ time.program.exe real 5.4 user 3.2 sys 1.0 Elapsed time CPU time These three numbers can be used to get initial information about the performance.

Performance Tuning: Timing the OpenMP Performance A common cause for the difference between the wall-clock time of 5.4 seconds and the CPU time is a processor sharing too high a load on the system. If sufficient processors are available (i.e., not being used by other users), your elapsed time should be less than the CPU time. The omp_get_wtime() function provided by OpenMP is useful for measuring the elapsed time of blocks of source code.

Performance Tuning: OpenMP Translation Overheads The overhead experienced by creation of parallel regions, sharing work between threads, and all kind of synchronization are a result of the OpenMP used. The efficiency of a particular version of OpenMP depends upon the implementation of its compiler and the library and their efficiencies, the target platform, etc. The EPCC microbenchmarks were created to help programmers estimate the relative cost of using different OpenMP constructs. These microbenchmarks are intended to measure the overheads of synchronisation, loop scheduling and array operations in the OpenMP runtime library.

Performance Tuning: Avoid Parallel Regions in Inner Loop Another common technique to improve the performance is to move parallel regions out of the innermost loops. Otherwise, we repeatedly incur the overheads of the parallel construct. By moving the parallel construct outside of the loop nest, the parallel construct overheads are minimized.

Performance Tuning: Overlapping Computation and I/O This helps avoid having all but one processors wait while the I/O is handled. A general rule for MIMD parallelism in general is to overlap computation and communications so that the total time taken is less that the sum of the times to do each of these. However, this general guideline might not always be possible.

Hybrid OpenMP and MPI

MPI vs. OpenMP Pure MPI Pro: Portable to distributed and shared memory machines. Scales beyond one node No race condition problem Pure MPI Con: Difficult to develop and debug High latency, low bandwidth Explicit communication Large granularity Difficult load balancing Pure OpenMP Pro: Easy to implement parallelism Low latency, high bandwidth Implicit Communication Coarse and fine granularity Dynamic load balancing Pure OpenMP Con: Only on shared memory machines Scale within one node Possible race condition problem No specific thread order

Why Hybrid? Hybrid MPI/OpenMP paradigm is the software trend for clusters of SMP architectures, supercomputers (although GPUs are usually also used here),. Elegant in concept and architecture: using MPI across nodes and OpenMP within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). Avoids the extra communication overhead with MPI within node. OpenMP adds fine granularity and allows increased and/or dynamic load balancing. Some problems have two-level parallelism naturally. Could have better scalability than both pure MPI and pure OpenMP.

Why Mixed OpenMP/MPI Code is Sometimes Slower? OpenMP has less scalability due to implicit parallelism while MPI communication. Thread creation overhead Cache coherence Natural one level parallelism problems. Pure OpenMP code performs worse than pure MPI within node. Lack of optimized OpenMP compilers/libraries.

Hybrid Parallelization Strategies From sequential code: decompose with MPI first, then add OpenMP. From OpenMP code: treat as serial code. From MPI code: add OpenMP. Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks.

Hybrid MPI + OpenMP programming Each MPI process spawns multiple OpenMP threads rank0 mpirun rank1 25

Multi-dimensional array transpose: Pure MPI and Pure OpenMP within One Node OpenMP vs. MPI (16 CPUs) 64x512x128: 2.76 times faster 16x1024x256:1.99 times faster

Pure MPI and Hybrid MPI/OpenMP Across Nodes With 128 CPUs, n_thrds=4 hybrid MPI/OpenMP performs faster than n_thrds=16 hybrid by a factor of 1.59, and faster than pure MPI by a factor of 4.44.

omphello.c #include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { int iam = 0, np = 1; } #pragma omp parallel default(shared) private(iam, np) { #if defined (_OPENMP) np = omp_get_num_threads(); iam = omp_get_thread_num(); #endif printf("hello from thread %d out of %d\n", iam, np); }

mpihello.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[mpi_max_processor_name]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("process %d on %s out of %d\n", rank, processor_name, numprocs); } MPI_Finalize();

Hybrid MPI + OpenMP: mixhello.c #include <stdio.h> #include "mpi.h" #include <omp.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[mpi_max_processor_name]; int iam = 0, np = 1; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); #pragma omp parallel default(shared) private(iam, np) { np = omp_get_num_threads(); iam = omp_get_thread_num(); printf("hello from thread %d out of %d from process %d out of %d on %s\n", iam, np, rank, numprocs, processor_name); } } MPI_Finalize();

Compile and Execution mpicc fopenmp mixhello.c mpiexec n x./a.out

Hybrid MPI + OpenMP Calculate p Each MPI process integrates over a range of width 1/nproc, as a discrete sum of nbin bins each of width step Within each MPI process, nthreads OpenMP threads perform part of the sum as in OPENMP alone. 32

omppi.c #include <omp.h> static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(num_threads); #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }

mpipi.c #include <mpi.h> void main (int argc, char *argv[]) { int i, my_id, numprocs; double x, pi, step, sum = 0.0 ; step = 1.0/(double) num_steps ; MPI_Init(&argc, &argv) ; MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ; MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ; my_steps = num_steps/numprocs ; for (i=my_id*my_steps; i<(my_id+1)*my_steps ; i++) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ; } 34

mixpi.c #include <mpi.h> #include omp.h void main (int argc, char *argv[]) { int i, my_id, numprocs; double x, pi, step, sum = 0.0 ; step = 1.0/(double) num_steps ; MPI_Init(&argc, &argv) ; MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ; MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ; my_steps = num_steps/numprocs ; } Get the MPI part done first, then add OpenMP pragma #pragma omp parallel for private(x) reduction(+:sum) for (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); } sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD) ; 35

Hybrid MPI + OpenMP Calculate p #include <stdio.h> #include <mpi.h> #include <omp.h> #define NBIN 100000 #define MAX_THREADS 8 void main(int argc,char **argv) { int nbin,myid,nproc,nthreads,tid; double step,sum[max_threads]={0.0},pi=0.0,pig; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Comm_size(MPI_COMM_WORLD,&nproc); nbin = NBIN/nproc; step = 1.0/(nbin*nproc); #pragma omp parallel private(tid) { int i; double x; nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) { x = (i+0.5)*step; sum[tid] += 4.0/(1.0+x*x);} printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]); } for(tid=0; tid<nthreads; tid++) pi += sum[tid]*step; MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); if (myid==0) printf("pi = %f\n",pig); MPI_Finalize(); } 36

Conclusions Always keep in mind the 5 reasons of poor performance If: you have a machine with several nodes The problem at hand has two levels of parallelism Then consider hybrid OpenMP + MPI