Parallel Numerical Algorithms

Similar documents
Parallel Numerical Algorithms

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Parallel Numerical Algorithms

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Shared Memory Programming with OpenMP (3)

Introduction to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Numerical Algorithms

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Allows program to be incrementally parallelized

OpenMP Introduction. CS 590: High Performance Computing. OpenMP. A standard for shared-memory parallel programming. MP = multiprocessing

Parallel Programming using OpenMP

Parallel Programming using OpenMP

Programming with Shared Memory PART II. HPC Fall 2007 Prof. Robert van Engelen

Parallel Numerical Algorithms

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Parallel Programming

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

A brief introduction to OpenMP

Parallel Computing Parallel Programming Languages Hwansoo Han

Programming with Shared Memory PART II. HPC Fall 2012 Prof. Robert van Engelen

Concurrent Programming with OpenMP

Memory Systems and Performance Engineering

ECE 574 Cluster Computing Lecture 10

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

OpenMP 4.0. Mark Bull, EPCC

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

CSL 860: Modern Parallel

The Art of Parallel Processing

CS516 Programming Languages and Compilers II

Introduction to Standard OpenMP 3.1

Advanced OpenMP. Lecture 11: OpenMP 4.0

Parallel Programming with OpenMP. CS240A, T. Yang

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

OpenMP. Dr. William McDoniel and Prof. Paolo Bientinesi WS17/18. HPAC, RWTH Aachen

Lecture 2. Memory locality optimizations Address space organization

Evaluating the Portability of UPC to the Cell Broadband Engine

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

OpenMP 4.0/4.5. Mark Bull, EPCC

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Shared Memory Parallelism - OpenMP

CS691/SC791: Parallel & Distributed Computing

Computational Mathematics

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Introduction to OpenMP. Lecture 4: Work sharing directives

Lecture 13. Shared memory: Architecture and programming

OpenMP 2. CSCI 4850/5850 High-Performance Computing Spring 2018

More Advanced OpenMP. Saturday, January 30, 16

Parallel Programming in C with MPI and OpenMP

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

NUMA-aware OpenMP Programming

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Programming in C with MPI and OpenMP

Memory Systems and Performance Engineering. Fall 2009

Review. 35a.cpp. 36a.cpp. Lecture 13 5/29/2012. Compiler Directives. Library Functions Environment Variables

Lect. 2: Types of Parallelism

EE/CSCI 451: Parallel and Distributed Computation

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to OpenMP

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

Multithreading in C with OpenMP

Shared Memory Programming Model

Introduction to OpenMP. Martin Čuma Center for High Performance Computing University of Utah

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

OpenMP. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Data Handling in OpenMP

Name: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Handout 3 Multiprocessor and thread level parallelism

ECE 563 Spring 2012 First Exam

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Multi-core Architecture and Programming

OpenACC Course. Office Hour #2 Q&A

Introduction to Parallel Computing

Transcription:

Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1

PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 2

Memory models Distributed memory Network Proc Proc Proc Proc Memory Memory Memory Memory Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 3

OpenMP Frequently used API for shared memory parallel computing in high performance computing FX10 supports OpenMP version 3.0 Shared memory, global view Describe whole data structure and whole computations Parallel Numerical Algorithms / IST / UTokyo 4

Weak Consistency Compiler can reorder operations, as long as it does not change the meaning in sequential execution Hardware can reorder operations, as long as it does not change the meaning in sequential execution Weak consistency The order of operations are guaranteed only at special commands Flush is the special command in OpenMP Usually used as implicitly in parallel, barrier, atomic, etc. Parallel Numerical Algorithms / IST / UTokyo 5

The solution int data; Barrier #pragma omp parallel if (producer) { produce_data data = produce_data(); #pragma omp barrier } else { // consumer #pragma omp barrier consume_data(data); consume_data Flush is not enough Flush of the producer must be earlier than flush of the consumer } Parallel Numerical Algorithms / IST / UTokyo 6

Barrier should be inserted Before writing data After writing data Before reading data After reading data Any thread reads data Any thread reads data One thread writes data One thread writes data Barrier Barrier Barrier Barrier Barrier Parallel Numerical Algorithms / IST / UTokyo 7

Performance Issues Mutual exclusion Synchronization Load imbalance Memory access congestion More issues Parallel Numerical Algorithms / IST / UTokyo 8

Mutual exclusion Atomic operation (#pragma omp atomic) Operation is done in an inseparable way Limited to a small number of operations Maybe done in hardware, so possibly very fast Critical section (#pragma omp critical) Any block of code can be declared as critical section While one thread resides in a critical section, no other thread can enter critical section Software implementation, so slower Lock (omp_set_lock(), omp_unset_lock(), etc.) While one thread keep the lock, no other thread can get the lock Software implementation, so slower Parallel Numerical Algorithms / IST / UTokyo 9

Atomic and Critical Section #pragma omp atomic x += a; #pragma omp atomic y *= b; #pragma omp critical x += a; #pragma omp critical y *= b; x += a and y *= b can be done in parallel Maybe hardware supported x += a and y *= b cannot be done in parallel Perhaps software implemented Use atomic if applicable Parallel Numerical Algorithms / IST / UTokyo 10

Synchronization #pragma omp barrier Wait until all the threads reach barrier Barrier Barrier Barrier Barrier Barrier takes time ~ 1μμμμ Amount to many thousands of operations Load imbalance produces idle times Parallel Numerical Algorithms / IST / UTokyo 11

Load balancing = equal time? Load balancing: Assign same amount of computation for each thread When threads do the same computations, do they consume the same time? Actually not, on shared memory processors Arbitration at atomic operations and critical sections, collision at memory access, OS tasks etc. More fluctuation if all cores are used Sometimes, dynamic load balancing is better than perfect static load balancing Parallel Numerical Algorithms / IST / UTokyo 12

Loop Scheduling clauses Schedule clause in omp-for #pragma omp for schedule(kind [, chunk_size]) Kinds of OpenMP Loop scheduling static dynamic guided auto Schedule decided by compiler or system runtime Environment variable OMP_SCHEDULE Function omp_set_schedule(kind, modifier) Parallel Numerical Algorithms / IST / UTokyo 13

Loop Scheduling 1. Static: round-robin assignments of fixed chunks Block Cyclic in distributed computing 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2. Dynamic: dynamic assignments of fixed chunks An idle thread asks a new chunk to compute 0 1 2 3 3 1 0 2 0 3 1 2 0 1 3. Guided: dynamic assignments, starting with big chunk size, reducing into the given chunk size 0 1 2 3 3 2 1 0 3 2 Parallel Numerical Algorithms / IST / UTokyo 14

Performance tips Use atomic rather than critical (if possible) Reduce synchronization Balance the loads, consider dynamic load balancing Choose best performing loop scheduling Parallel Numerical Algorithms / IST / UTokyo 15

Memory models Shared memory Uniform Memory Access (UMA) Non Uniform Memory Access (NUMA) Proc Proc Proc Proc Proc Proc Proc Proc Memory Mem Mem Mem Mem In NUMA, memory access costs (latency and bandwidth) are different from core s own memory and other cores memory Parallel Numerical Algorithms / IST / UTokyo 16

First touch principle In whose memory it is allocated? First touch principle At the allocation (malloc etc.), the physical position is not determined At the first access to the allocated memory (must be a write), it is allocated on the memory space of the accessing core Assignment is usually in the unit of pages (4KB etc) Proc Proc Proc Proc Mem Mem Mem Mem Parallel Numerical Algorithms / IST / UTokyo 17

Affinity Use (mostly) static scheduling, so that each thread always accesses the same (similar) memory area Stop OS from moving threads among cores Use the same scheduling for initialization #pragma omp parallel for (static, 512) for (i = 0; i < n; i ++) a[i] = 0.0; Proc Proc Proc Proc Mem Mem Mem Mem If possible, design so that the each thread uses a memory size of a multiple of the page size If possible, align starting address to the page boundary Parallel Numerical Algorithms / IST / UTokyo 18

Affinity Use (mostly) static scheduling, so that each thread always accesses the same (similar) memory area Stop OS from moving threads among cores Use the same scheduling for initialization #pragma omp parallel for (static, 512) for (i = 0; i < n; i ++) a[i] = 0.0; Proc Proc Proc Proc Mem Mem Mem Mem If possible, design so that the each thread uses a memory size of a multiple of the page size If possible, align starting address to the page boundary Parallel Numerical Algorithms / IST / UTokyo 19

Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 20

Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 21

Consistency should contain a copy of main memory But main memory data may be overwritten by other threads There are several (~10) algorithms for cache consistency Line of other cache must be updated, or at least invalidated Memory Parallel Numerical Algorithms / IST / UTokyo 22

False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 23

False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 24

False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 25

False sharing Happens when private variables reside on the same cache line Updates of variable invalidate cache line on the other cache Memory Parallel Numerical Algorithms / IST / UTokyo 26

Performance tips Remote memory access False sharing Solution: Locality! Collect data used by each thread into one place Block distribution Parallel Numerical Algorithms / IST / UTokyo 27

Locality! On Shared Memory Systems CPU CPU CPU CPU CPU CPU CPU CPU $ $ $ $ $ $ $ $ Memory Memory computation mem comp memory access Parallel Numerical Algorithms / IST / UTokyo 29

Locality! Do your computations on cache Explore locality! Remember 3rd lecture Computational intensity High: matrix-matrix multiply OO mm 1.5 Middle: stencil OO(kk) Relatively low: FFT OO(log mm) Low: matrix-vector multiply, reduction OO(1) mm : data size kk : number of iterations Parallel Numerical Algorithms / IST / UTokyo 30

Remember cache Key data size Line size Association CPU Main Memory Accessed data is automatically stored in cache Old data is evicted if line is full Parallel Numerical Algorithms / IST / UTokyo 31

Padding Array a[n][m] Array a[n][m+2] Last 2 elements in each row are not used Parallel Numerical Algorithms / IST / UTokyo 32

Tiling Matrix-matrix multiply: C = C + A B for ii = 1 to nn for jj = 1 to nn for kk = 1 to nn cc iiii = cc iiii + aa iiii bb kkkk AA 11 AA 12 AA 13 AA 14 AA 21 AA 22 AA 23 AA 24 AA 31 AA 32 AA 33 AA 34 Similar for B and C for ss ii = 1 to nn step bb ii for ss jj = 1 to nn step bb jj for ss kk = 1 to nn step bb kk for ii = ss ii to ss ii + bb ii for jj = ss jj to ss jj + bb jj for kk = ss kk to ss kk + bb kk cc iiii = cc iiii + aa iiii bb kkkk AA 41 AA 42 AA 43 AA 44 CC iiii = CC iiii + AA iiii BB kkkk Parallel Numerical Algorithms / IST / UTokyo 33 kk Choose block sizes so that CC iiii, AA iiii and BB kkkk can be stored on cache at once

Oblivious Algorithm Matrix-matrix multiply AA 11 AA 12 AA 21 AA 22 MMM(A, B, C) { if (small enough) compute directly; else { divide A, B, C into four submatrices; for i = 1, 2 for j = 1, 2 for k = 1, 2 MMM(AA iiii, BB kkkk, CC iiii ) } } Reformulate it in divide-and-conquer Some level fits into the cache Parallel Numerical Algorithms / IST / UTokyo 34

Array dimension / loop exchange for i = 0 to n 1 for j = 0 to n 1 a[i][j] = ; for i = 0 to n 1 for j = 0 to n 1 a[j][i] = ; for j = 0 to n 1 for i = 0 to n 1 a[j][i] = ; for j = 0 to n 1 for i = 0 to n 1 a[i][j] = ; Higher temporal locality Lower temporal locality Parallel Numerical Algorithms / IST / UTokyo 35

Array of structures / structure of arrays typedef struct { double x, y, z; } point; point p[n]; struct { double x[n]; double y[n]; double z[n]; } p; Array of structures Structure of arrays Case 1: increase x of all elements by 1 Case 2: compute norm sqrt(x*x + y*y + z*z) for each elements Parallel Numerical Algorithms / IST / UTokyo 36

Loop fusion Loop fission for i = 0 to n 1 compute1(i); for i = 0 to n 1 compute2(i); Fusion Fission for i = 0 to n 1 { compute1(i); compute2(i); } Fusion If compute1(i) and compute2(i) access same (or near) addresses, then locality is improved May remove array temporal Fission Reduces working set size, may fit in cache for i = 0 to n 1 b[i] = 2 * a[i]; for i = 0 to n 1 c[i] = sqrt(b[i]); Parallel Numerical Algorithms / IST / UTokyo 37

Tiled data / Space-filling curve Tiled data structure Space-filling curve (Z-curve)

Debugging is hard! Debugging of shared memory parallel program is harder than distributed memory parallel program Unintentional data race happens Wrong results comes non-deterministically One thread writes data Any thread reads data One thread writes data Any thread reads data Barrier Barrier Barrier Barrier Barrier Parallel Numerical Algorithms / IST / UTokyo 39

Hybrid Parallelization Flat MPI model One MPI rank per core Use only distributed memory model Hybrid parallel programming Use both OpenMP and MPI Node MPI rank OpenMP thread Flat MPI Hybrid Parallel Numerical Algorithms / IST / UTokyo 40

Pros and Cons Pros of Flat MPI model Simpler programming, need less learning Sometimes faster than hybrid Cons of Flat MPI model Partially duplicated memory allocation Contentions of messages (network is shared) Too many MPI ranks in supercomputers nowadays Pros of Hybrid model Less duplicated memory Message contention can be avoided Less MPI ranks Cons of Hybrid model Must learn both MPI and OpenMP Sometimes not faster than flat MPI Hybrid model is recommended for high parallelism Parallel Numerical Algorithms / IST / UTokyo 41

PNA16 Lecture Plan General Topics 1. Architecture and Performance 2. Dependency 3. Locality 4. Scheduling MIMD / Distributed Memory 5. MPI: Message Passing Interface 6. Collective Communication 7. Distributed Data Structure MIMD / Shared Memory 8. OpenMP 9. Performance Special Lectures 5/30 How to use FX10 (Prof. Ohshima) 6/6 Dynamic Parallelism (Prof. Peri) SIMD / Shared Memory 10. GPU and CUDA 11. SIMD Performance Parallel Numerical Algorithms / IST / UTokyo 42