Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Similar documents
12:00 13:20, December 14 (Monday), 2009 # (even student id)

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Hybrid MPI and OpenMP Parallel Programming

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Computer Architecture

A Programmer s View of Shared and Distributed Memory Architectures

HPC Workshop University of Kentucky May 9, 2007 May 10, 2007

Parallel Computing. Lecture 17: OpenMP Last Touch

Message Passing Interface

Parallel Computing: Overview

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CS420: Operating Systems

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Mixed-Mode Programming

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

CS 261 Fall Mike Lam, Professor. Threads

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Concurrency, Thread. Dongkun Shin, SKKU

Introduction to the Message Passing Interface (MPI)

Parallel Programming in C with MPI and OpenMP

ECE 574 Cluster Computing Lecture 13

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Parallel Programming in C with MPI and OpenMP

CS510 Operating System Foundations. Jonathan Walpole

PRACE Autumn School Basic Programming Models

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

OpenMP examples. Sergeev Efim. Singularis Lab, Ltd. Senior software engineer

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Parallel Computing and the MPI environment

Shared-Memory Programming

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Parallel Programming in C with MPI and OpenMP

Parallel Computing Why & How?

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

Distributed Systems CS /640

ECE 574 Cluster Computing Lecture 10

Chapter 4: Threads. Chapter 4: Threads

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Threaded Programming. Lecture 9: Alternatives to OpenMP

MultiCore Architecture and Parallel Programming Final Examination

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Shared memory programming

<Insert Picture Here> OpenMP on Solaris

Questions from last time

CS 426. Building and Running a Parallel Application

Introduction to parallel computing concepts and technics

Lecture 4: OpenMP Open Multi-Processing

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Programming with Shared Memory. Nguyễn Quang Hùng

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

Parallel Programming in C with MPI and OpenMP

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP - II. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

Shared Memory programming paradigm: openmp

Lecture 10. Shared memory programming

CS333 Intro to Operating Systems. Jonathan Walpole

Parallel Programming with OpenMP. CS240A, T. Yang

Shared Memory: Virtual Shared Memory, Threads & OpenMP

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Introduction to Parallel Programming

Programming with Shared Memory

Introduction to OpenMP.

CS 470 Spring Mike Lam, Professor. OpenMP

Collective Communication in MPI and Advanced Features

HPCSE - I. «Introduction to multithreading» Panos Hadjidoukas

An Introduction to MPI

Concurrent Programming

Slides prepared by : Farzana Rahman 1

Holland Computing Center Kickstart MPI Intro

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

CS370 Operating Systems

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Parallel Programming in C with MPI and OpenMP

CS 5220: Shared memory programming. David Bindel

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CS370 Operating Systems

Chapter 4: Multithreaded Programming

PCS - Part Two: Multiprocessor Architectures

Cloud Computing CS

Introduction to OpenMP

Lecture 7: Distributed memory

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

Distributed Memory Parallel Programming

Our new HPC-Cluster An overview

Introduction to OpenMP

Shared Memory Parallel Programming with Pthreads An overview

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Transcription:

Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu

Challenges Difficult to write parallel programs Most programmers think sequentially Performance vs. correctness tradeoffs Missing good parallel abstractions Automatic parallelization by compilers Works with some applications: loop parallelism, reduction Unclear how we can apply to other complex applications 2

Concurrent vs. Parallel Concurrent program A program containing two or more processes (threads) Parallel program A concurrent program in which each process (thread) executes on its own processor, and hence the processes (threads) execute in parallel Concurrent programming has been around for a while, so why do we bother? Logical parallelism (GUIs, asynchronous events, ) vs. Physical parallelism (for performance) 3

Performance Think Parallel or Perish The free lunch is over! GHz Era Multi-core Era Time 4

Parallel Prog. Models (1) Shared address space programming model Single address space for all CPUs Communication through regular load/store (implicit) Synchronization using locks and barriers (explicit) Ease of programming Complex hardware for cache coherence Thread APIs (Pthreads, ) OpenMP Intel TBB (Thread Building Blocks) 5

Parallel Prog. Models (2) Message passing programming model Private address space per CPU Communication through message send/receive over network interface (explicit) Synchronization using blocking messages (implicit) Need to program explicit communication Simple hardware No cache coherence supporting hardware RPC (Remote Procedure Calls) PVM (obsolete) MPI (de factor standard) 6

Parallel Prog. Models (3) Parallel programming models vs. Parallel computer architectures Parallel Architectures Shared-memory Distributed-memory Parallel Prog. Single Address Space Pthreads OpenMP (CC) NUMA Software DSM Models Message Passing Multi-processes MPI MPI 7

Speedup (1) Amdahl s Law Speedup = 1 (1 P) + P n S = (1- P) : the time spent executing the serial portion n : the number of processor cores A theoretical basis by which the speedup of parallel computations can be estimated. The theoretical upper limit of speedup is limited by the serial portion of the code. 8

Speedup (2) Reducing the serial portion (20% 10%) Doubling the number of cores (8 16) 9

Speedup (3) Speedup in reality Practical Speedup = H(n) : overhead Thread management & scheduling Communication & synchronization Load balancing Extra computation Operating system overhead For better speedup, Parallelize more (increase P) Parallelize effectively (reduce H(n)) 1 1 P + P n + H(n) 10

Calculating Pi (1) tan = 1 atan(1) = 4 4 4 0 1 1 1+x2 dx 1 y = atan(x) = 4 (atan(1) atan(0)) = 4 ( 0) = 4 y = 1+x 2 11

Calculating Pi (2) Pi: Sequential version #define N 20000000 #define STEP (1.0 / (double) N) double compute () { int i; double x; double sum = 0.0; 1 1 N } for (i = 0; i < N; i++) { x = (i + 0.5) * STEP; sum += 4.0 / (1.0 + x * x); } return sum * STEP; 0 1 (i+0.5) N 1 1+ ( i+0.5 ) 2 N 12

Threads API (1) Threads are the natural unit of execution for parallel programs on shared-memory hardware Windows threading API POSIX threads (Pthreads) Threads have their own Execution context Stack Threads share Address space (code, data, heap) Resources (PID, open files, etc.) 13

Threads API (2) Pi: Pthreads version #include <pthread.h> #define N 20000000 #define STEP (1.0 / (double) N) #define NTHREADS 8 double pi = 0.0; pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; int main() { int i; pthread_t tid[nthreads]; for (i = 0; i < NTHREADS; i++) pthread_create (&tid[i], NULL, compute, (void *) i); for (i = 0; i < NTHREADS; i++) pthread_join (tid[i], NULL); } void *compute (void *arg) { int i; double x; double sum = 0.0; int id = (int) arg; } for (i=id; i<n; i += NTHREADS) { x = (i + 0.5) * STEP; sum += 4.0 / (1.0 + x * x); } pthread_mutex_lock (&lock); pi += (sum * STEP); pthread_mutex_unlock (&lock); 14

Threads API (3) Pros Flexible and widely available The thread library gives you detailed control over the threads Performance can be good Cons YOU have to take detailed control over the threads. Relatively low level of abstraction No easy code migration path from sequential program Lack of structure means error-prone 15

OpenMP (1) What is OpenMP? A set of compiler directives and library routines for portable parallel programming on shared memory systems Fortran 77, Fortran 90, C, and C++ Multi-vendor support, for both Unix and Windows Standardizes loop-level (data) parallelism Combines serial and parallel code in single source Incremental approach to parallelization http://www.openmp.org, http://www.compunity.org Standard documents, tutorials, sample codes 16

OpenMP (2) Fork-join model Master thread spawns a team of threads as needed Synchronize when leaving parallel region (join) Only master executes sequential part 17

OpenMP (3) Parallel construct #pragma omp parallel { task(); } 18

OpenMP (4) Work-sharing construct: sections #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); } 19

OpenMP (5) Work-sharing construct: loop #pragma omp parallel for for (i = 0; i < 12; i++) c[i] = a[i] + b[i]; #pragma omp parallel { task (); #pragma omp for for (i = 0; i < 12; i++) c[i] = a[i] + b[i]; } 20

OpenMP (6) Pi: OpenMP version #ifdef _OPENMP #include <omp.h> #endif #define N 20000000 #define STEP (1.0/(double) N) int main() { double pi; } pi = compute(); double compute () { int i; double x; double sum = 0.0; #pragma omp parallel for private(x) reduction(+:sum) for (i = 0; i < N; i ++) { x = (i + 0.5) * STEP; sum += 4.0 / (1.0 + x * x); } return sum * STEP; } 21

Speedup OpenMP (7) Environment Dual Quad-core Xeon (8 cores) 2.4GHz Intel OpenMP Compiler 8 7 6 5 4 N = 2e6 N = 2e7 N = 2e8 Ideal Pi (OpenMP) 3 2 1 0 1 2 3 4 5 6 7 8 The Number of Cores 22

OpenMP (8) Pros Simple and portable Single source code for both serial and parallel version Incremental parallelization Cons Primarily for bounded loops over built-in types Only for Fortran-style C/C++ programs Pointer-chasing loops? Performance may be degraded due to Synchronization overhead Load imbalancing Thread management overhead 23

MPI (1) Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based on message passing Available on almost all parallel machines in C/Fortran Two phases MPI 1: Traditional message-passing MPI 2: Remote memory, parallel I/O, dynamic processes Tasks are typically created when the jobs are launched http://www.mpi-forum.org 24

MPI (2) Point-to-point communication Sending and receiving messages Process 0 Process 1 A[4] B[4] MPI_Send (A, 4, MPI_INT, 1, 1000, MPI_COMM_WORLD); MPI_Recv (B, 4, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD); 25

MPI (3) Collective communication Broadcast MPI_Bcast (array, 100, MPI_INT, 0, comm); array array array array Reduce MPI_Reduce (&localsum, &sum, 1, MPI_DOUBLE, MPI_SUM, 0, comm); localsum localsum localsum localsum sum + + + 26

MPI (4) Pi: MPI version #include mpi.h #include <stdio.h> #define N 20000000 #define STEP (1.0 / (double) N) #define f(x) (4.0 / (1.0 + (x)*(x)) int main (int argc, char *argv[]) { int i, numprocs, myid, start, end, n; double sum = 0.0, mypi, pi; MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &numprocs); MPI_Comm_rank (MPI_COMM_WORLD, &myid); start = (N / numprocs) * myid + 1; end = (myid==numprocs-1)? n : (n/numprocs) * (myid+1); for (i = start; i <= end; i++) sum += f(step * ((double) i + 0.5)); myip = STEP * sum; MPI_Reduce (&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Finalize (); } 27

MPI (5) Pros Runs on either shared or distributed memory architectures Can be used on a wider range of problems Distributed memory computers are less expensive than large shared memory computers Cons Requires more programming changes to go from serial to parallel version Can be harder to debug Performance is limited by the communication network between the nodes 28

Parallel Benchmarks (1) Linpack Matrix linear algebra Basis for measuring Top500 Supercomputing sites (http://www.top500.org) SPECrate Parallel run of SPEC CPU programs Job-level (or task-level) parallelism SPLASH Stanford Parallel Applications for Shared Memory Mix of kernels and applications Strong scaling: keep the problem size fixed 29

Parallel Benchmarks (2) NAS parallel benchmark suite By NASA Advanced Supercomputing (NAS) Computational fluid dynamics kernels PARSEC suite Princeton Application Repository for Shared Memory Computers Multithreaded applications using Pthreads and OpenMP 30

Cache Coherence (1) Examples Read X Read X? CPU0 Cache X = 0 CPU1 Cache X = 1 Read X Write X 1 Read X Write X 1 CPU0 Cache X = 1 CPU1 Cache X = 2 Read X Write X 2 Flush X Memory X = 0 Flush X? Memory X = 2 Invalidate all copies before allowing a write to proceed Disallow more than one modified copy 31

Cache Coherence (2) Cache coherence protocol A set of rules to maintain the consistency of data stored in the local caches as well as in main memory Ensures multiple read-only copies and exclusive modified copy Types Snooping-based (or Snoopy ) Directory-based Write strategies Invalidation-based Update-based 32

Cache Coherence (3) Snooping protocol All cache controllers monitor (or snoop) on the bus Send all requests for data to all processors Processors snoop to see if they have a shared block Requires broadcast, since caching info resides at processors Works well with bus (natural broadcast) Dominates for small scale machines Cache coherence unit Cache block (line) is the unit of management False sharing is possible: Two processors share the same cache line but not the actual word Coherence miss: Invalidate can cause a miss for the data read before 33

Cache Coherence (4) MESI protocol: invalidation-based Modified (M) Cache has only one copy and copy is modified Memory is not up-to-date Exclusive (E) Cache has only one copy and copy is unmodified (clean) Memory is up-to-date Shared (S) Copies may exist in other caches and all are unmodified Memory is up-to-date Invalid (I) Not in cache 34

Cache Coherence (5) MESI protocol state machine Invalid CPU Read / Bus Read & S-signal on Bus ReadX / - CPU Read / - (no Bus traffic) Shared (read-only) If cache miss occurs, cache will write back modified block. Bus ReadX / Bus WriteBack (Flush) CPU Write / Bus ReadX Bus Read / Bus S-signal on Bus Read / Bus S-signal on CPU Read / - (no Bus traffic) Modified (read/write) CPU Write / - (invalidate is not needed) CPU Write / - (no Bus traffic) Exclusive (read-only) CPU Read / - (no Bus traffic) 35

Cache Coherence (6) Examples revisited Read X X = 0 (E) X = 0 (S) X (I) Read X X = 1 (S) CPU0 Cache X = 0 Memory X = 0 CPU1 Cache X = 1 Read X X = 0 (S) Write X 1 X = 1 (M) Flush X X = 1 (S) Read X X = 0 (S) Write X 1 X = 1 (M) Flush X X (I) CPU0 Cache X = 1 Memory X = 2 CPU1 Cache X = 2 Read X X = 0 (S) X (I) Write X 2 X = 2 (M) Flush X Invalidate all copies before allowing a write to proceed Disallow more than one modified copy 36

Speedup Cache Coherence (7) False sharing struct thread_stat { int count; char pad[pads]; } stat[8]; double compute () { int i, id; double x, sum = 0.0; #pragma omp parallel for private(x) reduction(+:sum) for (i = 0; i < N; i ++) { int id=omp_get_thread_num(); stat[id].count++; x = (i + 0.5) * STEP; sum += 4.0 / (1.0 + x * x); } return sum * STEP; } 7 6 5 4 3 2 1 0 Pi (OpenMP) N = 2e7 Pad = 0 Pad = 28 Pad = 60 Pad =124 1 2 3 4 5 6 7 8 The Number of Cores stat[0] stat[1]... stat[7] C pad C... C 37

Summary: Why Hard? Impediments to parallel computing Parallelization Parallel algorithms which maximizes concurrency Eliminating the serial portion as much as possible Lack of standardized APIs, environments, and tools Correct parallelization Shared resource identification Difficult to debug (data races, deadlock, ) Memory consistency Effective parallelization Communication & synchronization overhead Problem decomposition, granularity, load imbalance, affinity.. Architecture dependency 38