EE/CSCI 451: Parallel and Distributed Computation

Size: px
Start display at page:

Download "EE/CSCI 451: Parallel and Distributed Computation"

Transcription

1 EE/CSCI 451: Parallel and Distributed Computation Lecture #15 3/7/2017 Xuehai Qian University of Southern California 1

2 From last class Outline Data distribution Mapping Parallel algorithm models Today (Chapter 6) Message passing Send and receive operations Examples, performance issues 2

3 Message Passing Programming Model (1) Message passing One of the oldest parallel programming paradigms Widely used Key features Partition address space local data, remote data Explicit parallelization user is responsible to specify and manage concurrency Can be challenging 3

4 Message Passing Programming Model (2) Explicit communication Program 0 Program 1 Program p-1 Data local to program 0 Data local to program 1 Data local to program p-1 Program address space partitioned across the programs Communication - needs coordination among the communicating processes (and the host for the two processes) 4

5 Message Passing Program (1) Most General Model: Asynchronous Program 0 Program 1 Program 2 R S R S R S S R S = send R = receive End End End No structure with respect to instructions, interactions No global clock Execution is asynchronous Programs 0,1,, p 1 can be all distinct Hard to write/debug 5

6 Message Passing Program (2) Loosely synchronous Program 0 Program 1 Program 2 Program 3 Program 4 Receive data sync sync sync Some structure Easier to reason about than asynchronous execution model 6

7 Message Passing Program (3) SPMD (Single Program Multiple Data) Code is same in all the processes except for initialization Restrictive model, easy to write and debug Widely used In all 3 cases (models of concurrency) correctness Irrespective of the rate of execution of each program, should produce the correct results for every input data as intended 7

8 Message Passing Program Specification User specifies: Processes Process layout Data layout 1-D 0 1 p 1 Embedding: Specified by user or MPI system software finds the most appropriate mapping that reduces the cost of sending and receiving messages 8 2-D 0 1 K 1 ( p 1, p 1) Target Platform

9 Send and Receive (1) Send and Receive operations P 0 data P 1 Send ( Destination process ID sendbuf, size, dest ) size Receive ( recvbuf, Source process ID size, source ) Send data from process 0 to process 1 Sent data = data at the beginning of the execution of send Send and receive should be matched (for ex. Use process IDs) Complications may arise due to the way the software and hardware implement the operation 9

10 Send and Receive (2) What data is sent? Buffered? Issues Sending process: wait until completion of communication? Overheads at sender, at receiver 10

11 Adding Using Message Passing (1) Start with adding on PRAM Output = : A i in A(0) A(0) A(n 1) 11

12 PRAM Algorithm Program in processor j, 0 j n Adding Using Message Passing(2) Do i = 0 to log A n 1 If j = k D 2 8F7, for some k N A j A j + A(j ) A(0) 3. end Note: A(n 1) A is shared among all the processors Synchronous operation [For ex. all the processors execute instruction 2 during the same cycle, log A n time] N = set of natural numbers = {0, 1, } Parallel time = O(log n) cycles 12

13 Adding using Message Passing (3) Message Passing Algorithm (SPMD model) Program in process j, 0 j n 1 1. Do i = 0 to log A n End If j = k D 2 8F7 +2 8, for some k N Send A(j) to process j 2 i Else if j = k D 2 8F7, for some k N End Receive A(j + 2 i ) from process j + 2 i A j A j + A(j ) Barrier Note: A(j) is local to process j N = set of natural numbers = {0, 1, } Parallel time = O(log n) iterations 13

14 Adding using Message Passing (4) Communication between processes Power of 2 connections e.g. Hypercube Total amount of communication = O(n) 14

15 MM using Message Passing (1) C A B Cannon s algorithm n n matrixes p p processors, P 8U 0 i, j < p, 1 p n Processor P i,j assigned to A i,j, B i,j, C i,j (i, j)th block of size 5 W 5 W 15

16 MM using Message Passing (2) Circular left shift 0 1 p-1 Circular up shift 0 p-1 16

17 MM using Message Passing (3) Initial data alignment For A: i Z[ row circular left shift by i (0 i < p) For B: j Z[ column circular up shift by j 0 j < p 4 4 matrix 4 4 processor array A 0,0 B 0,0 A 1,1 B 1,0 A 2,2 B 2,0 A 3,3 B 3,0 A 0,1 B 1,1 A 1,2 B 2,1 A 2,3 B 3,1 A 3,0 B 0,1 A 0,2 B 2,2 A 1,3 B 3,2 A 2,0 B 0,2 A 3,1 B 1,2 A and B after initial alignment A 0,3 B 3,3 A 1,0 B 0,3 A 2,1 B 1,3 A 3,2 B 2,3 17

18 Super step MM using Message Passing (4) 1. Initial data alignment 2. Repeat p times Parallel algorithm (global view) Ø All processors P 8,U perform 5 5 matrix multiplication in parallel using local W W data Ø In parallel for all i, j Processor P 8,U : circular left shift a ( 5 5 ) by 1 position W W Ø In parallel for all i, j Processor P 8,U : End Note: a, b, c : 5 W 5 W circular up shift b ( 5 5 ) by 1 position W W matrices, local to each processor Data alignment using message passing (permutation in each row and each col) 18

19 MM using Message Passing (5) Parallel algorithm (local view from P i,j ) Repeat p times Ø c c + a b 5 W 5 W matrix multiplication Super step Ø a read from right neighbor from {i, (j + 1) mod p} Ø b read from neighbor below from {(i + 1) mod p, j} End 19

20 Illustration (4 4 matrix, 4 4 processor array) A 0,0 A 1,1 MM using Message Passing (6) A 0,1 A 1,2 A 0,2 A 1,3 Cannon s algorithm A 0,3 B 0,0 B 1,1 B 2,2 B 3,3 A 1,0 B 1,0 B 2,1 B 3,2 B 0,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B A 2,2 A 2,3 A 2,0 A 2,1 B 2,0 B 3,1 B 0,2 B 1,3 A 3,3 A 3,0 A 3,1 A 3,2 B 3,0 B 0,1 B 1,2 B 2,3 20

21 A 0,1 A 1,2 A 2,3 MM using Message Passing (7) A 0,2 A 1,3 A 2,0 A 0,3 A 1,0 A 2,1 Cannon s algorithm A 0,0 B 1,0 B 2,1 B 3,2 B 0,3 A 1,1 B 2,0 B 3,1 B 0,2 B 1,3 A 2,2 B 3,0 B 0,1 B 1,2 B 2,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B A 3,0 A 3,1 A 3,2 A 3,3 B 0,0 B 1,1 B 2,2 B 3,3 21

22 A 0,2 A 1,3 A 2,0 A 3,1 MM using Message Passing (8) A 0,3 A 1,0 A 2,1 A 3,2 A 0,0 A 1,1 A 2,2 A 3,3 Cannon s algorithm A 0,1 B 2,0 B 3,1 B 0,2 B 1,3 A 1,2 B 3,0 B 0,1 B 1,2 B 2,3 A 2,3 B 0,0 B 1,1 B 2,2 B 3,3 A 3,0 B 1,0 B 2,1 B 3,2 B 0,3 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B Super step 2 Compute using local data Circular left shift A Circular up shift B 22

23 A 0,3 A 1,0 A 2,1 A 3,2 MM using Message Passing (9) A 0,0 A 1,1 A 2,2 A 3,3 A 0,1 Cannon s algorithm A 0,2 B 3,0 B 0,1 B 1,2 B 2,3 A 1,2 A 1,3 B 0,0 B 1,1 B 2,2 B 3,3 A 2,3 A 3,0 A 2,0 B 1,0 B 2,1 B 3,2 B 0,3 A 3,1 B 2,0 B 3,1 B 0,2 B 1,3 23 Initial alignment Super step 0 Compute using local data Circular left shift A Circular up shift B Super step 1 Compute using local data Circular left shift A Circular up shift B Super step 2 Compute using local data Circular left shift A Circular up shift B Super step 3 Compute using local data

24 MM using Message Passing (10) Performance analysis Total number of multiply and add operations in each super step (in each PE): ( 5 W )` multiplications and ( 5 W )` additions Total number of super steps: p Total number of operations (over all the PEs): ( 5 W )` p p p = n` multiplications ( 5 W )` p p p = n` additions Number of super steps Total amount of data communicated (data received) = p D (2 5 W Number of processes 5 ) D p = W O(nA D p) Number of super steps 24 Number of processors

25 MM using Shared Variable (1) C A B n n Each thread i, j is responsible to update C(i, j), 0 i, j < n A and B are shared variables 25

26 MM using Shared Variable (2) Thread i, j C i, j 0 Do k from 0 to n 1 C i, j C i, j + A i, k B k, j End Shared Memory Threads 26

27 Blocking Send/Receive Blocking semantics Data sent = data at the time the Send command was initiated To ensure correctness, block the send operation till some condition to ensure semantics of send Blocking non-buffered send Block send process Send request to receiving process Wait for receiving process to acknowledge (matched receive operation) Upon receiving acknowledgement, start the transfer No buffers 27

28 Sending process Receiving process Blocking Send/Receive Idling overheads Sending process Receiving process Sending process Receiving process Send Idle Request to send okay to send data Recv. Send Request to send okay data Recv. Send Request to send okay data Idle Recv. (a) Sender comes first; idling at sender (b) Sender and receiver come at about the same time; idling minimized 28 (c) Receiver comes first; idling at receiver

29 Deadlock (1) Example (1) blocked P : 1. send(&a, 1,1); 2. receive(&b, 1,1); P 7 1. send(&a, 1,0); 2. receive(&b, 1,0); Deadlocks are very easy in blocking protocols 29

30 Deadlock (2) Example (2) If myid = even Send Receive P : P 7 Send Receive Receive Send If myid = odd Receive Send 30

31 Non-blocking Send/Receive (1) Non-blocking send/receive Fast send/receive (reduce overhead) Let the programmer manage semantic correctness Send: Perform simple initiation, setup Return control immediately User should not alter data immediately after issuing send. However, user can do other (useful) operations Status information available for user to check Example: check-status 31

32 Non-blocking Send/Receive (2) Non-blocking send/receive Sending process Receiving process Copy data into buffer Continue execution; Unsafe to update sent data Send Request to send okay to send Receive Finish copying; Safe to update sent data data 32

33 Summary of Blocking Send Non-buffered Data sent = data at the time the Send command was initiated Issue send request and block sending process Start data transfer after receiving acknowledgement from receiving process Return control to sending process after communication completion Eg. Receiving process has received the entire data 33

34 Summary of Non-Blocking Send Data sent = data at the time the Send command was initiated Copy data into send buffer then return control to sending process immediately User can alter sent data after they have been copied into buffer 34

35 Additional Materials in Textbook These are not required for this class Non-blocking non-buffered send/receive operations with communication hardware support Non-blocking non-buffered send/receive operations without communication hardware support 35

36 OpenMP or MPI? MPI Interconnection Network OpenMP Node Node Multicore Shared-Memory 36

37 Summary Send and Receive operations Blocking / non-blocking Issues Overhead Performance Correctness Deadlock 37 3/6/18 37

38 Backup Slides 38

39 Protocols For Send/Receive Blocking Operations Non-Blocking Operations Buffered Sending process returns after data has been copied into communication buffer Sending process returns after initiating DMA transfer to buffer. This operation may not be completed on return Non-Buffered Sending process blocks until matching receive operation has been encountered Send and Receive semantics assured by corresponding operation Programmer must explicitly ensure semantics by polling to verify completion 39

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #12 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Last class Outline

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered

5/5/2012. Message Passing Programming Model Blocking communication. Non-Blocking communication Introducing MPI. Non-Buffered Buffered Lecture 7: Programming Using the Message-Passing Paradigm 1 Message Passing Programming Model Blocking communication Non-Buffered Buffered Non-Blocking communication Introducing MPI 2 1 Programming models

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1 Lecture 7: More about MPI programming Lecture 7: More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems

More information

Standard MPI - Message Passing Interface

Standard MPI - Message Passing Interface c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100

EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 EE/CSCI 451 Spring 2017 Homework 3 solution Total Points: 100 1 [10 points] 1. Task parallelism: The computations in a parallel algorithm can be split into a set of tasks for concurrent execution. Task

More information

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS Parallel & Concurrent Programming: ZPL Emery Berger CMPSCI 691W Spring 2006 Department of Computer Science Outline Previously: MPI point-to-point & collective Complicated, far from problem abstraction

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Message-Passing Programming with MPI

Message-Passing Programming with MPI Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes

More information

Lecture 17: Array Algorithms

Lecture 17: Array Algorithms Lecture 17: Array Algorithms CS178: Programming Parallel and Distributed Systems April 4, 2001 Steven P. Reiss I. Overview A. We talking about constructing parallel programs 1. Last time we discussed sorting

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Message-Passing Programming with MPI. Message-Passing Concepts

Message-Passing Programming with MPI. Message-Passing Concepts Message-Passing Programming with MPI Message-Passing Concepts Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Distributed Memory Programming With MPI (4)

Distributed Memory Programming With MPI (4) Distributed Memory Programming With MPI (4) 2014 Spring Jinkyu Jeong (jinkyu@skku.edu) 1 Roadmap Hello World in MPI program Basic APIs of MPI Example program The Trapezoidal Rule in MPI. Collective communication.

More information

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen Programming with Message Passing PART I: Basics HPC Fall 2012 Prof. Robert van Engelen Overview Communicating processes MPMD and SPMD Point-to-point communications Send and receive Synchronous, blocking,

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Linear Arrays. Chapter 7

Linear Arrays. Chapter 7 Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Lecture 6: Parallel Matrix Algorithms (part 3)

Lecture 6: Parallel Matrix Algorithms (part 3) Lecture 6: Parallel Matrix Algorithms (part 3) 1 A Simple Parallel Dense Matrix-Matrix Multiplication Let A = [a ij ] n n and B = [b ij ] n n be n n matrices. Compute C = AB Computational complexity of

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Today s Theme 2 Message passing model (abstraction) Threads

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,

More information

Parallel Programming with MPI and OpenMP

Parallel Programming with MPI and OpenMP Parallel Programming with MPI and OpenMP Michael J. Quinn Chapter 6 Floyd s Algorithm Chapter Objectives Creating 2-D arrays Thinking about grain size Introducing point-to-point communications Reading

More information

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Lecture 6. Programming with Message Passing Message Passing Interface (MPI) Lecture 6 Programming with Message Passing Message Passing Interface (MPI) Announcements 2011 Scott B. Baden / CSE 262 / Spring 2011 2 Finish CUDA Today s lecture Programming with message passing 2011

More information

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop Parallelizing The Matrix Multiplication 6/10/2013 LONI Parallel Programming Workshop 2013 1 Serial version 6/10/2013 LONI Parallel Programming Workshop 2013 2 X = A md x B dn = C mn d c i,j = a i,k b k,j

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007.

More information

Parallel Programming

Parallel Programming Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

More about MPI programming. More about MPI programming p. 1

More about MPI programming. More about MPI programming p. 1 More about MPI programming More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems, the CPUs share

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

CSE 486/586 Distributed Systems

CSE 486/586 Distributed Systems CSE 486/586 Distributed Systems Mutual Exclusion Steve Ko Computer Sciences and Engineering University at Buffalo CSE 486/586 Recap: Consensus On a synchronous system There s an algorithm that works. On

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #5 1/29/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 From last class Outline

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Parallel Programming. March 15,

Parallel Programming. March 15, Parallel Programming March 15, 2010 1 Some Definitions Computational Models and Models of Computation real world system domain model - mathematical - organizational -... computational model March 15, 2010

More information

What is a parallel computer?

What is a parallel computer? 7.5 credit points Power 2 CPU L 2 $ IBM SP-2 node Instructor: Sally A. McKee General interconnection network formed from 8-port switches Memory bus Memory 4-way interleaved controller DRAM MicroChannel

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010 Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing

More information

Parallel Programming

Parallel Programming Parallel Programming Point-to-point communication Prof. Paolo Bientinesi pauldj@aices.rwth-aachen.de WS 18/19 Scenario Process P i owns matrix A i, with i = 0,..., p 1. Objective { Even(i) : compute Ti

More information

Lecture 9: MPI continued

Lecture 9: MPI continued Lecture 9: MPI continued David Bindel 27 Sep 2011 Logistics Matrix multiply is done! Still have to run. Small HW 2 will be up before lecture on Thursday, due next Tuesday. Project 2 will be posted next

More information

High Performance Computing

High Performance Computing High Performance Computing Course Notes 2009-2010 2010 Message Passing Programming II 1 Communications Point-to-point communications: involving exact two processes, one sender and one receiver For example,

More information

L15: Putting it together: N-body (Ch. 6)!

L15: Putting it together: N-body (Ch. 6)! Outline L15: Putting it together: N-body (Ch. 6)! October 30, 2012! Review MPI Communication - Blocking - Non-Blocking - One-Sided - Point-to-Point vs. Collective Chapter 6 shows two algorithms (N-body

More information

L21: Putting it together: Tree Search (Ch. 6)!

L21: Putting it together: Tree Search (Ch. 6)! Administrative CUDA project due Wednesday, Nov. 28 L21: Putting it together: Tree Search (Ch. 6)! Poster dry run on Dec. 4, final presentations on Dec. 6 Optional final report (4-6 pages) due on Dec. 14

More information

A Message Passing Standard for MPP and Workstations

A Message Passing Standard for MPP and Workstations A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker Message Passing Interface (MPI) Message passing library Can be

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

1 On the reduce implementation

1 On the reduce implementation 1 On the reduce implementation Definition 1 (Reduce Operator). Given an associative operator and a vector A R M, we define the a second order reduce operator as y = reduce(a, ) = A 1 A 2... A M (1) If

More information

Coupling Thursday, October 21, :23 PM

Coupling Thursday, October 21, :23 PM Coupling Page 1 Coupling Thursday, October 21, 2004 3:23 PM Two kinds of multiple-processor systems Tightly-coupled Can share efficient semaphores. Usually involve some form of shared memory. Loosely-coupled

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Lecture 8 Parallel Algorithms II

Lecture 8 Parallel Algorithms II Lecture 8 Parallel Algorithms II Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Original slides from Introduction to Parallel

More information

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives)

CSE 160 Lecture 23. Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) CS 160 Lecture 23 Matrix Multiplication Continued Managing communicators Gather and Scatter (Collectives) Today s lecture All to all communication Application to Parallel Sorting Blocking for cache 2013

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Hybrid Programming with MPI and OpenMP

Hybrid Programming with MPI and OpenMP Hybrid Programming with and OpenMP Fernando Silva and Ricardo Rocha Computer Science Department Faculty of Sciences University of Porto Parallel Computing 2017/2018 F. Silva and R. Rocha (DCC-FCUP) Programming

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

Concurrency Issues. Past lectures: What about coordinated access across multiple objects? Today s lecture:

Concurrency Issues. Past lectures: What about coordinated access across multiple objects? Today s lecture: Deadlock 1 Concurrency Issues Past lectures: Ø Problem: Safely coordinate access to shared resource Ø Solutions: Use semaphores, monitors, locks, condition variables Coordinate access within shared objects

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Message Passing Models and Multicomputer distributed system LECTURE 7

Message Passing Models and Multicomputer distributed system LECTURE 7 Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup Sorting Algorithms - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup The worst-case time complexity of mergesort and the average time complexity of quicksort are

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Nur Dean PhD Program in Computer Science The Graduate Center, CUNY 05/01/2017 Nur Dean (The Graduate Center) Matrix Multiplication 05/01/2017 1 / 36 Today, I will talk about matrix

More information

Parallel Processing IMP Questions

Parallel Processing IMP Questions Winter 14 Summer 14 Winter 13 Summer 13 180702 Parallel Processing IMP Questions Sr Chapter Questions Total 1 3 2 9 3 10 4 9 5 7 What is Data Decomposition? Explain Data Decomposition with proper example.

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Basic MPI Communications. Basic MPI Communications (cont d)

Basic MPI Communications. Basic MPI Communications (cont d) Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each

More information

Topic Notes: Message Passing Interface (MPI)

Topic Notes: Message Passing Interface (MPI) Computer Science 400 Parallel Processing Siena College Fall 2008 Topic Notes: Message Passing Interface (MPI) The Message Passing Interface (MPI) was created by a standards committee in the early 1990

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Chapter 3: Processes. Operating System Concepts 8 th Edition,

Chapter 3: Processes. Operating System Concepts 8 th Edition, Chapter 3: Processes, Silberschatz, Galvin and Gagne 2009 Chapter 3: Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Silberschatz, Galvin and Gagne 2009

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. November 2014 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting Parallel Systems Course: Chapter VIII Sorting Algorithms Kumar Chapter 9 Jan Lemeire ETRO Dept. Fall 2017 Overview 1. Parallel sort distributed memory 2. Parallel sort shared memory 3. Sorting Networks

More information

CSE 306/506 Operating Systems Deadlock. YoungMin Kwon

CSE 306/506 Operating Systems Deadlock. YoungMin Kwon CSE 306/506 Operating Systems Deadlock YoungMin Kwon Deadlock A set of processes are deadlocked if Each process in the set is blocked and Waiting for an event that can be triggered only from another process

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 5 Vector and Matrix Products Prof. Michael T. Heath Department of Computer Science University of Illinois at Urbana-Champaign CS 554 / CSE 512 Michael T. Heath Parallel

More information

Introduction to Lab Series DMS & MPI

Introduction to Lab Series DMS & MPI TDDC 78 Labs: Memory-based Taxonomy Introduction to Lab Series DMS & Mikhail Chalabine Linköping University Memory Lab(s) Use Distributed 1 Shared 2 3 Posix threads OpenMP Distributed 4 2011 LAB 5 (tools)

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Introduction to Parallel Programming. Tuesday, April 17, 12

Introduction to Parallel Programming. Tuesday, April 17, 12 Introduction to Parallel Programming 1 Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

Following are a few basic questions that cover the essentials of OS:

Following are a few basic questions that cover the essentials of OS: Operating Systems Following are a few basic questions that cover the essentials of OS: 1. Explain the concept of Reentrancy. It is a useful, memory-saving technique for multiprogrammed timesharing systems.

More information

Parallel dense linear algebra computations (1)

Parallel dense linear algebra computations (1) Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.07] Tuesday, January 29, 2008 1 Sources for today s material Mike Heath

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010

CS4961 Parallel Programming. Lecture 14: Reasoning about Performance 10/07/2010. Administrative: What s Coming. Mary Hall October 7, 2010 CS4961 Parallel Programming Lecture 14: Reasoning about Performance Administrative: What s Coming Programming assignment 2 due Friday, 11:59PM Homework assignment out on Tuesday, Oct. 19 and due Monday,

More information

Collective Communication

Collective Communication Lab 14 Collective Communication Lab Objective: Learn how to use collective communication to increase the efficiency of parallel programs In the lab on the Trapezoidal Rule [Lab??], we worked to increase

More information

Collective Communication in MPI and Advanced Features

Collective Communication in MPI and Advanced Features Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective

More information

A short overview of parallel paradigms. Fabio Affinito, SCAI

A short overview of parallel paradigms. Fabio Affinito, SCAI A short overview of parallel paradigms Fabio Affinito, SCAI Why parallel? In principle, if you have more than one computing processing unit you can exploit that to: -Decrease the time to solution - Increase

More information