Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Similar documents
Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

Barbara Chapman, Gabriele Jost, Ruud van der Pas

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Numerical Algorithms

Introduction to OpenMP.

Lecture 4: OpenMP Open Multi-Processing

Concurrent Programming with OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Multi-core Architecture and Programming

Allows program to be incrementally parallelized

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to OpenMP

ECE 574 Cluster Computing Lecture 10

Overview: The OpenMP Programming Model

Multithreading in C with OpenMP

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallelising Scientific Codes Using OpenMP. Wadud Miah Research Computing Group

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

A brief introduction to OpenMP

Introduction to OpenMP

Shared Memory Programming Model

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Computing. Lecture 17: OpenMP Last Touch

Martin Kruliš, v

Shared Memory programming paradigm: openmp

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

COMP Parallel Computing. SMM (2) OpenMP Programming Model

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Computing. Hwansoo Han (SKKU)

EE/CSCI 451: Parallel and Distributed Computation

Parallel Programming in C with MPI and OpenMP

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

EPL372 Lab Exercise 5: Introduction to OpenMP

Parallel Programming

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to OpenMP

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

NUMA-aware OpenMP Programming

CS691/SC791: Parallel & Distributed Computing

Lecture 13. Shared memory: Architecture and programming

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Introduction to OpenMP

15-418, Spring 2008 OpenMP: A Short Introduction

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to OpenMP

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

Parallel Programming in C with MPI and OpenMP

Standard promoted by main manufacturers Fortran

OPENMP TIPS, TRICKS AND GOTCHAS

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

!OMP #pragma opm _OPENMP

CS 5220: Shared memory programming. David Bindel

OPENMP TIPS, TRICKS AND GOTCHAS

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

Parallel Programming with OpenMP. CS240A, T. Yang

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

OpenMP on Ranger and Stampede (with Labs)

[Potentially] Your first parallel application

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Department of Informatics V. HPC-Lab. Session 2: OpenMP M. Bader, A. Breuer. Alex Breuer

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Loop Modifications to Enhance Data-Parallel Performance

Lecture 3: Intro to parallel machines and models

OpenMP. António Abreu. Instituto Politécnico de Setúbal. 1 de Março de 2013

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Introduction to OpenMP

Synchronization. Event Synchronization

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

Multiprocessor Systems. Chapter 8, 8.1

Point-to-Point Synchronisation on Shared Memory Architectures

Parallel Programming

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Chap. 4 Multiprocessors and Thread-Level Parallelism

Introduction to Parallel Computing

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Open Multi-Processing: Basic Course

Multiprocessor Systems. COMP s1

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Alfio Lazzaro: Introduction to OpenMP

DPHPC: Introduction to OpenMP Recitation session

12:00 13:20, December 14 (Monday), 2009 # (even student id)

Transcription:

Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and Page Locality Debugging 2 Factors impacting performance performance of single threaded code percentage of code that is run in parallel and scalability CPU utilization, effective data sharing, data locality and load balancing amount of synchronization and communication overhead to create, resume, manage, suspend, destroy and synchronize threads memory conflicts due to shared memory or falsely shared memory performance limitations of shared resources e.g memory, bus bandwidth, CPU execution units Scalable Speedup Most often the memory is the limit to the performance of a shared memory program On scalable architectures, the latency and bandwidth of memory accesses depend on the locality of accesses In achieving good speedup of a shared memory program, data locality is an essential element 3 4 1

What Determines Data Locality In multi-node system, initial data distribution determines on which node the memory is placed first touch or round-robin system policies data distribution directives explicit page placement Work sharing, e.g., loop scheduling, determines which thread accesses which data Cache friendliness determines how often main memory is accessed Cache Friendliness For both serial loops and parallel loops locality of references spatial locality: use adjacent cache lines and all items in a cache line temporal locality: reuse same cache line; may employ techniques such as cache blocking low cache contention avoid the sharing of cache lines among different objects; may resort to array padding or increasing the rank of an array 5 6 Cache Friendliness NUMA machines Contention is an issue specific to parallel loops, e.g., false sharing of cache lines cache friendliness = high locality of references + low contention Memory hierarchies exist in single-cpu computers and Symmetric Multiprocessors (SMPs) Distributed shared memory (DSM) machines based on Non- Uniform Memory Architecture (NUMA) add levels to the hierarchy: local memory has low latency remote memory has high latency 7 8 2

Origin2000 memory hierarchy Level Latency (cycles) register 0 primary cache 2..3 secondary cache 8..10 local main memory & TLB hit 75 remote main memory & TLB hit 250 main memory & TLB miss 2000 page fault 10^6 Page Level Locality An ideal application has full page locality: pages accessed by a processor are on the same node as the processor, and no page is accessed by more than one processor (no page sharing) Twofold benefit:» low memory latency» scalability of memory bandwidth 9 10 Performance Issues Load Imbalance Idle threads do no useful work Divide work among threads as evenly as possible Threads should finish parallel tasks at same time Synchronization may be necessary Minimize time waiting for protected resources Unequal work loads lead to idle threads and wasted time. #pragma omp parallel #pragma omp for for( ; ; ) time time Busy Idle 11 12 3

Performance Tuning Profilers use sampling to provide performance data. Traditional profilers are of limited use for tuning OpenMP*: Measure CPU time, not wall clock time Do not report contention for synchronization objects Cannot report load imbalance Are unaware of OpenMP constructs Parallelizing Code 1 Optimize single-cpu performance maximize cache reuse eliminate cache misses Parallelize as high a fraction of the work as possible preserve cache friendliness Programmers need profilers specifically designed for OpenMP. 13 14 Parallelizing Code 2 Synchronization Avoid synchronization and scheduling overhead: partition in few parallel regions, avoid reduction, single and critical sections, make the code loop fusion friendly use static scheduling Partition work to achieve load balancing Check correctness of parallel code Run OpenMP compiled code first on one thread, then on several threads Lost time waiting for locks #pragma omp parallel #pragma omp critical...... time Busy Idle In Critical 15 16 4

Synchronization Overhead Parallel regions, work-sharing, and synchronization incur overhead Edinburgh OpenMP Microbenchmarks, version 1.0, by J. Mark Bull Synchronization Overhead Parallel regions, work-sharing, and synchronization incur overhead Edinburgh OpenMP Microbenchmarks, version 1.0, by J. Mark Bull, In next slides used to measure the cost of synchronization on a 32 processor Origin 2000, with 300 MHz R12000 processors, and compiling the benchmarks with MIPSpro Fortran 90 compiler, version 7. 3.1.1m 17 18 Synchronization Overhead Synchronization Overhead 19 20 5

Insights cost (DO) ~ cost(barrier) cost (parallel DO) ~ 2 * cost(barrier) cost (parallel) > cost (parallel DO) atomic is less expensive than critical bad scalability for reduction mutual exclusion: critical, (un)lock single Overhead on 4-way Intel Xeon at 3.0GHz Intel compiler and runtime library Constructs parallel barrier Schedule (static) Schedule (guided) Schedule (dynamic) ordered single reduction atomic Critical lock/unlock Cost (microsecs) 1.5 1.0 1.0 6.0 50 0.5 1.0 2.5 0.5 0.5 Scalability Depend on datatype/hardware 21 22 Overhead on Intel Quad Core Q6600 @ 2.40GHz (dune) 4 threads gcc compiler and gomp runtime library Constructs parallel barrier Schedule (static) 1 Schedule (guided) Schedule (dynamic) ordered single reduction atomic Critical lock/unlock Cost (microsecs) 31.5 21.1 29.9 39.9 361.1 6.8 23.6 31.6 0.62 3.2 Scalability Depend on datatype/hardware Overhead on 2 processor Opteron 250 @ 2.40 GHz (strider) 2 threads gcc compiler and gomp runtime library Constructs parallel barrier Schedule (static) Schedule (guided) Schedule (dynamic) ordered single reduction atomic Critical lock/unlock Cost (microsecs) 11.6 6.7 19.9 21.5 44.3 7.5 6.0 12.1 0.14 4.6 Scalability Depend on datatype/hardware 23 24 6

Overhead on 2 x Dual-Core AMD Opteron 2220 @ 2.80 GHz (gandalf node13) 4 thrds gcc compiler and gomp runtime library Constructs parallel barrier Schedule (static) Schedule (guided) Schedule (dynamic) ordered single reduction atomic Critical lock/unlock Cost (microsecs) 27.3 23.0 34.1 24.9 115.0 4.8 25.2 27.2 0.17 1.9 Scalability Depend on datatype/hardware Overhead on 2 x Quad-Core AMD Opteron 2350 @2.0GHz (gandalf node1) 8 threads gcc compiler and gomp runtime library Constructs parallel barrier Schedule (static) Schedule (guided) Schedule (dynamic) ordered single reduction atomic Critical lock/unlock Cost (microsecs) 55.2 78.8 83.5 62.1 327.8 11.4 84.3 83.7 0.36 5.1 Scalability Depend on datatype/hardware 25 26 Loop Parallelization Identify the loops that are bottleneck to performance Parallelize the loops, and ensure that no data races are created cache friendliness is preserved page locality is achieved synchronization and scheduling overheads are minimized Hurdles to Loop Parallelization Data dependencies among iterations caused by shared variables Input/Output operations inside the loop Calls to thread-unsafe code, e.g., the intrinsic function rtc Branches out of the loop Insufficient work in the loop body 27 28 7

Data Races Parallelizing a loop with data dependencies causes data races: unordered or interfering accesses by multiple threads to shared variables, which make the values of these variables different from the values assumed in a serial execution A program with data races produces unpredictable results, which depend on thread scheduling and speed. Types of Data Dependencies Reduction operations: const int n = 4096; int a[n], i, sum = 0; for (i = 0; i < n; i++) sum += a[i]; Easy to parallelize using reduction variables 29 30 Types of Data Dependencies Types of Data Dependencies Carried dependence on a shared array, e.g., recurrence: const int n = 4096; int a[n], i, sum = 0; #pragma omp parallel for reduction(+:sum) for (i = 0; i < n; i++) sum += a[i]; const int n = 4096; int a[n], i; for (i = 0; i < n-1; i++) a[i] = a[i+1]; Non-trivial to eliminate 31 32 8

Parallelizing the Recurrence Idea: Segregate even and odd indices #define N 16384 // Update even indices from odd int a[n], work[n+1]; #pragma omp parallel for for ( i = 0; i < N-1; i+=2) // Save border element work[n]= a[0]; a[i] = a[i+1]; // Save & shift even indices // Update odd indices with even #pragma omp parallel for #pragma omp parallel for for ( i = 2; i < N; i+=2) for ( i = 1; i < N-1; i+=2) work[i-1] = a[i]; a[i] = work[i]; // Set border element a[n-1] = work[n]; Performing Reduction The bad scalability of the reduction clause affects its usefulness, e.g., bad speedup when summing the elements of a matrix: #define N 1<<12 #define M 16 int i, j; double a[n][m], sum = 0.0; #pragma omp parallel for reduction(+:sum) for (i = 0; i < N; i++) for (j = 0; j < M; j++) sum += a[i][j]; 33 34 Parallelizing the Sum Sum and Product Speedup on SGI Idea: Use explicit partial sums and combine them atomically #define N 1<<12 #define M 16 int main() double a[n][m], sum = 0.0; int i, j = 0; // compute partial sum #pragma omp for nowait for (i = 0; i < N; i++) for (j = 0; j < M; i++) mysum += a[i][j]; #pragma omp parallel private(i,j) double mysum = 0.0; // initialization of a // not shown // each thread adds its // partial sum #pragma omp atomic sum += mysum; 35 36 9

Loop Fusion Recall that at the end of the parallel region, the threads are suspended and wait for the next parallel region, loop or section Suspend/resume operations lighter weight than create/terminate but still create overhead Loop Fusion fuses loops to increase the work in the loop body Better serial programs: fusion promotes software pipelining and reduces the frequency of branches Better OpenMP programs: fusion reduces synchronization and scheduling overhead Promoting Loop Fusion Loop fusion inhibited by statements between loops which may have dependencies with data accessed by the loops Promote fusion: reorder the code to get loops which are not separated by statements creating data dependencies Use one parallel do construct for several adjacent loops; may leave it to the compiler to actually perform fusion fewer parallel regions and work-sharing constructs 37 38 Fusion-friendly code Fusion-friendly code Unfriendly Friendly Unfriendly Friendly integer,parameter::n=4096 real :: sum, a(n) do i=1,n a(i) = sqrt(dble(i*i+1)) enddo sum = 0.d0 do i=1,n sum = sum + a(i) enddo integer,parameter::n=4096 real :: sum, a(n) sum = 0.d0 do i=1,n a(i) = sqrt(dble(i*i+1)) enddo do i=1,n sum = sum + a(i) enddo int n=4096; double sum, a[4096]; for (i=0;i<n; i++) a[i] = sqrt(double(i*i+1)); sum = 0.d0; for (i=0;i<n; i++) sum = sum + a[i]; int n=4096; double sum, a[4096]; sum = 0.d0; for (i=0;i<n; i++) a[i] = sqrt(double(i*i+1)); for (i=0;i<n; i++) sum = sum + a[i]; 39 40 10

Tradeoffs in Parallelization To increase parallel fraction of work when parallelizing loops, it is best to parallelize the outermost loop of a nested loop However, doing so may require loop transformations such as loop interchanges, which can destroy cache friendliness, e.g., defeat cache blocking Static loop scheduling in large chunks per thread promotes cache and page locality but may not achieve load balancing Dynamic and interleaved scheduling achieve good load balancing but cause poor locality of data references Tuning the Parallel Code Examine resource usage, e.g., execution time, number of floating point operations, primary, secondary, and TLB cache misses and identify the performance bottleneck the routines generating the bottleneck Correct the performance problem and verify the desired speedup. 41 42 The Future of OpenMP Debugging OpenMP programs Data placement directives will become part of OpenMP affinity scheduling may be a useful feature It is desirable to add parallel input/output to OpenMP Java binding of OpenMP Standard debuggers do not normally handle OpenMP approach : 1. use binary search to try to narrow down where the problem is by disabling OpenMP pragmas 2. Compile with fopenmp_stubs if available this lets one run a serial version. If the bug persists it is in the serial code so debug as a serial program 3. Compile with fopenmp and OMP_NUM_THREADS=1. If it still fails debug in single threaded mode. 4. Identify the errors with the lowest optimization possible 43 44 11

Debugging OpenMP programs References 1. use binary search to try to narrow down where the problem is by disabling OpenMP pragmas 2. Compile with fopenmp_stubs if available this lets one run a serial version. If the bug persists it is in the serial code so debug as a serial program 3. Compile with fopenmp and OMP_NUM_THREADS=1. If it still fails debug in single threaded mode. 4. Identify the errors with the lowest optimization possible 5. Look for problems such as data dependence, race conditions, deadlock, missing barriers, unitialized variables 6. Compile using a thread checker if available Introduction to OpenMP Lawrence Livermore National Laboratory www.llnl.gov/computing/tutorials/workshops/workshop/openmp/main.ht ml Ohio Supercomputing Center oscinfo.osc.edu/training/openmp/big Minnesota Supercomputing Institute www.msi.umn.edu/tutorials/shared_tutorials/openmp 45 46 References SPMD Example OpenMP Benchmarks Edinburgh OpenMP Microbenchmarks www.epcc.ed.ac.uk/research/openmpbench A single parallel region, no scheduling needed, each thread explicitly determines its work program mat_init implicit none integer, parameter::n=1024 real A(N,N) integer :: iam, np iam = 0 np = 1!$omp parallel private(iam,np) np = omp_get_num_threads() iam = omp_get_thread_num()! Each thread calls work call work(n, A, iam, np)!$omp end parallel end subroutine work(n, A, iam, np) integer n, iam, n real A(n,n) integer :: chunk,low,high,i,j chunk = (n + np - 1)/np low = 1 + iam*chunk high=min(n,(iam+1)*chunk) do j = low, high do I=1,n A(I,j)=3.14 + & sqrt(real(i*i*i+j*j+i*j*j)) enddo enddo return 47 48 12

Pros and Cons of SPMD» Potentially higher parallel fraction than with loop parallelism» The fewer parallel regions, the less overhead» More explicit synchronization needed than for loop parallelization» Does not promote incremental parallelization and requires manually assigning data subsets to threads Message passing vs multithreading Process versus thread address space threads have shared address space, but the thread stack holds thread-private data processes have separate address spaces For message passing multiprocessing, e.g., MPI, all data is explicitly communicated, no data is shared For OpenMP, threads in a parallel region reference both private and shared data Synchronization: explicit or embedded in communication 49 50 Too Many Threads Which threads cause overhead If more threads than processors, round-robin scheduling is used Scheduling overhead degrades performance Sources of overhead Saving and restoring registers negligible Saving and restoring cache state when run out of cache, threads tend to flush other threads cached data Thrashing virtual memory Convoying of threads waiting on a lock, waiting on a thread whose timeslice has expired and which is still holding the lock Solution: limit number of threads to Number of hardware threads (cores or hyper-threaded cores) or Number of caches Only runnable threads cause overhead blocked threads do not Helps to separate compute and I/O threads Compute threads are running most of the time and number should correspond to number of cores they may feed from task queues I/O threads may be blocked most of time and are not a significant factor in having too many threads Useful Hints: Let OpenMP choose number of threads Use a thread pool 51 52 13