Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Similar documents
Parallelization Principles. Sathish Vadhiyar

Parallelization of an Example Program

Simulating ocean currents

ECE7660 Parallel Computer Architecture. Perspective on Parallel Programming

Lecture 2: Parallel Programs. Topics: consistency, parallel applications, parallelization process

Lecture 26: Multiprocessors. Today s topics: Synchronization Consistency Shared memory vs message-passing

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Lecture 26: Multiprocessors. Today s topics: Directory-based coherence Synchronization Consistency Shared memory vs message-passing

Lect. 2: Types of Parallelism

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Lecture 18: Shared-Memory Multiprocessors. Topics: coherence protocols for symmetric shared-memory multiprocessors (Sections

Multiprocessors at Earth This lecture: How do we program such computers?

Lecture 17: Multiprocessors. Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Flynn's Classification

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture 10. Shared memory programming

Lecture 1: Introduction

Suggested Solutions (Midterm Exam October 27, 2005)

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

Parallel Architectures & Parallelization Principles. R. Govindarajan CSA/SERC, IISc

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

ECE 5750 Advanced Computer Architecture Programming Assignment #1

Parallel Programming Basics

Lecture: Memory Technology Innovations

Lecture. DM510 - Operating Systems, Weekly Notes, Week 11/12, 2018

Pre-lab #2 tutorial. ECE 254 Operating Systems and Systems Programming. May 24, 2012

Chapter 5: Thread-Level Parallelism Part 1

10/17/ Gribble, Lazowska, Levy, Zahorjan 2. 10/17/ Gribble, Lazowska, Levy, Zahorjan 4

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Steve Gribble. Synchronization. Threads cooperate in multithreaded programs

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Implementing Mutual Exclusion. Sarah Diesburg Operating Systems CS 3430

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

The University of Texas at Arlington

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Parallel Programming. March 15,

CS 6400 Lecture 11 Name:

Synchronization. CS61, Lecture 18. Prof. Stephen Chong November 3, 2011

OpenMP and MPI parallelization

CS 3305 Intro to Threads. Lecture 6

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

CS516 Programming Languages and Compilers II

Parallel Programming in C with MPI and OpenMP

CSE 451: Operating Systems Winter Lecture 7 Synchronization. Hank Levy 412 Sieg Hall

Synchronization I. Jo, Heeseung

MS Windows Concurrency Mechanisms Prepared By SUFIAN MUSSQAA AL-MAJMAIE

CS510 Advanced Topics in Concurrency. Jonathan Walpole

Operating Systems. Synchronization

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

The University of Texas at Arlington

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

CSC/ECE 506: Architecture of Parallel Computers Sample Final Examination with Answers

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

7 Parallel Programming and Parallel Algorithms

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Parallel Programming in C with MPI and OpenMP

CSE 160 Lecture 18. Message Passing

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

Today: Synchronization. Recap: Synchronization

MULTITHREADING AND SYNCHRONIZATION. CS124 Operating Systems Fall , Lecture 10

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

250P: Computer Systems Architecture. Lecture 14: Synchronization. Anton Burtsev March, 2019

Module 20: Multi-core Computing Multi-processor Scheduling Lecture 39: Multi-processor Scheduling. The Lecture Contains: User Control.

There are 10 total numbered pages, 5 Questions. You have 60 minutes. Budget your time carefully!

Parallel Programming Assignment 3 Compiling and running MPI programs

High Performance Computing Course Notes Message Passing Programming I

Topic Notes: Message Passing Interface (MPI)

CS533 Concepts of Operating Systems. Jonathan Walpole

Parallel Programming in C with MPI and OpenMP

Symmetric Multiprocessors: Synchronization and Sequential Consistency

A Message Passing Standard for MPP and Workstations

Spin Locks and Contention. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Intermediate MPI features

ECE 574 Cluster Computing Lecture 15

Lecture notes for CS Chapter 4 11/27/18

Assignment 5 Using Paraguin to Create Parallel Programs

Basic Structure and Low Level Routines

CS 450 Exam 2 Mon. 4/11/2016

CROWDMARK. Examination Midterm. Spring 2017 CS 350. Closed Book. Page 1 of 30. University of Waterloo CS350 Midterm Examination.

EE4144: Basic Concepts of Real-Time Operating Systems

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Signaling and Hardware Support

Synchronization. CSE 2431: Introduction to Operating Systems Reading: Chapter 5, [OSC] (except Section 5.10)

Assignment 2 Using Paraguin to Create Parallel Programs

Last Class: Synchronization

Parallel Programming Basics

Agenda. Lecture. Next discussion papers. Bottom-up motivation Shared memory primitives Shared memory synchronization Barriers and locks

ITCS 4145/5145 Assignment 2

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

Last Class: CPU Scheduling! Adjusting Priorities in MLFQ!

All-Pairs Shortest Paths - Floyd s Algorithm

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Message-Passing Programming with MPI

Parallel Computer Architecture and Programming Written Assignment 3

Transcription:

The Lecture Contains: Shared Memory Version Mutual Exclusion LOCK Optimization More Synchronization Message Passing Major Changes MPI-like Environment file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_1.htm[6/14/2012 11:43:12 AM]

Shared Memory Version BARINIT (gm->barrier); /* include files */ n = atoi ( argv [1]); MAIN_ENV; P = atoi ( argv [2]); int P, n; gm->a = (float**) G_MALLOC ((n+2)* sizeof void Solve (); (float*)); struct gm_t { for ( i = 0; i < n+2; i ++) { LOCKDEC ( diff_lock ); gm->a[ i ] = (float*) G_MALLOC ((n+2)* sizeof BARDEC (barrier); (float)); float **A, diff; *gm; Initialize (gm->a); int main (char ** argv, int argc ) for ( i = 1; i < P; i ++) { /* starts at 1 */ { CREATE (Solve); int i ; MAIN_INITENV; Solve (); gm = ( struct gm_t *) G_MALLOC ( sizeof ( struct WAIT_FOR_END (P-1); gm_t )); MAIN_END; LOCKINIT (gm-> diff_lock ); void Solve (void) local_diff += fabs (gm->a[ i ] [j] temp); { int i, j, pid, done = 0; /* end for */ float temp, local_diff ; /* end for */ GET_PID ( pid ); LOCK (gm-> diff_lock ); while (!done) { gm->diff += local_diff ; local_diff = 0.0; UNLOCK (gm-> diff_lock ); if (! pid ) gm->diff = 0.0; BARRIER (gm->barrier, P); BARRIER (gm->barrier, P);/*why?*/ if (gm->diff/(n*n) < TOL) done = 1; for ( i = pid *(n/p); i < (pid+1)*(n/p); i ++) { BARRIER (gm->barrier, P); /* why? */ for (j = 0; j < n; j++) { /* end while */ temp = gm->a[ i ] [j]; gm->a[ i ] [j] = 0.2*(gm->A[ i ] [j] + gm->a[ i ] [j-1] ) + gm->a[ i ] [j+1] + gm->a[i+1] [j] + gm->a[i-1] [j]; file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_1a.htm[6/14/2012 11:43:12 AM]

Mutual Exclusion Use LOCK/UNLOCK around critical sections Updates to shared variable diff must be sequential Heavily contended locks may degrade performance Try to minimize the use of critical sections: they are sequential anyway and will limit speedup This is the reason for using a local_diff instead of accessing gm->diff every time Also, minimize the size of critical section because the longer you hold the lock, longer will be the waiting time for other processors at lock acquire LOCK Optimization LOCK (gm-> cost_lock ); if ( my_cost < gm->cost) { gm->cost = my_cost ; UNLOCK (gm-> cost_lock ); /* May lead to heavy lock contention if everyone tries to update at the if ( my_cost < gm->cost) { LOCK (gm-> cost_lock ); if ( my_cost < gm->cost) { /* make sure*/ gm->cost = my_cost ; UNLOCK (gm-> cost_lock ); /* this works because gm->cost is monotonically decreasing */ same time */ file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_1b.htm[6/14/2012 11:43:12 AM]

More Synchronization Global synchronization Through barriers Often used to separate computation phases Point-to-point synchronization A process directly notifies another about a certain event on which the latter was waiting Producer-consumer communication pattern Semaphores are used for concurrent programming on uniprocessor through P and V functions Normally implemented through flags on shared memory multiprocessors (busy wait or spin) P 0 : A = 1; flag = 1; P 1 : while (!flag); use (A); Message Passing What is different from shared memory? No shared variable: expose communication through send/receive No lock or barrier primitive Must implement synchronization through send/receive Grid solver example P 0 allocates and initializes matrix A in its local memory Then it sends the block rows, n, P to each processor i.e. P 1 waits to receive rows n/p to 2n/P-1 etc. (this is one-time) Within the while loop the first thing that every processor does is to send its first and last rows to the upper and the lower processors (corner cases need to be handled) Then each processor waits to receive the neighboring two rows from the upper and the lower processors At the end of the loop each processor sends its local_diff to P 0 and P 0 sends back the accumulated diff so that each processor can locally compute the done flag file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_2.htm[6/14/2012 11:43:13 AM]

Major Changes file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_3.htm[6/14/2012 11:43:13 AM]

Major Changes Message Passing This algorithm is deterministic May converge to a different solution compared to the shared memory version if there are multiple solutions: why? There is a fixed specific point in the program (at the beginning of each iteration) when the neighboring rows are communicated This is not true for shared memory file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_4.htm[6/14/2012 11:43:13 AM]

MPI-like Environment Message Passing Grid Solver MPI stands for Message Passing Interface A C library that provides a set of message passing primitives (e.g., send, receive, broadcast etc.) to the user PVM (Parallel Virtual Machine) is another well-known platform for message passing programming Background in MPI is not necessary for understanding this lecture Only need to know When you start an MPI program every thread runs the same main function We will assume that we pin one thread to one processor just as we did in shared memory Instead of using the exact MPI syntax we will use some macros that call the MPI functions MAIN_ENV; while (!done) { /* define message tags */ local_diff = 0.0; #define ROW 99 /* MPI_CHAR means raw byte format */ #define DIFF 98 if ( pid ) { /* send my first row up */ #define DONE 97 SEND(&A[1][1], N* sizeof (float), MPI_CHAR, pid-1, ROW); int main( int argc, char ** argv ) { if ( pid!= P-1) { /* recv last row */ int pid, P, done, i, j, N; RECV(&A[N/P+1][1], N* sizeof (float), MPI_CHAR, pid+1, ROW); float tempdiff, local_diff, temp, **A; if ( pid!= P-1) { /* send last row down */ MAIN_INITENV; SEND(&A[N/P][1], N* sizeof (float), MPI_CHAR, pid+1, ROW); GET_PID( pid ); GET_NUMPROCS(P); if ( pid ) { /* recv first row from above */ N = atoi ( argv [1]); RECV(&A[0][1], N* sizeof (float), MPI_CHAR, pid-1, ROW); tempdiff = 0.0; done = 0; for ( i =1; i <= N/P; i ++) for (j=1; j <= N; j++) { A = (double **) malloc ((N/P+2) * sizeof (float *)); temp = A[ i ][j]; for ( i =0; i < N/P+2; i ++) { A[ i ][j] = 0.2 * (A[ i ][j] + A[ i ][j-1] + A[i-1][j] + A[ i ][j+1] + A[i+1][j]); A[ i ] = (float *) malloc ( sizeof (float) * (N+2)); local_diff += fabs (A[ i ][j] - temp); initialize(a); file:///d /...haudhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture7/7_5.htm[6/14/2012 11:43:13 AM]