Synchronous Computations

Size: px
Start display at page:

Download "Synchronous Computations"

Transcription

1 Chapter 6 slides6-1 Synchronous Computations

2 Synchronous Computations slides6-2 In a (fully) synchronous application, all the processes synchronized at regular points. Barrier A basic mechanism for synchronizing processes - inserted at the point in each process where it must wait. All processes can continue from this point when all the processes have reached it (or, in some implementations, when a stated number of processes have reached this point).

3 Processes reaching barrier at different times slides6-3 Processes P 0 P 1 P 2 P p 1 Active Time Waiting Barrier

4 In message-passing systems, barriers provided with library routines: Processes slides6-4 P 0 P 1 P p 1 Barrier(); Processes wait until all reach their barrier call Barrier(); Barrier();

5 slides6-5 MPI MPI_Barrier() Barrier with a named communicator being the only parameter. Called by each process in the group, blocking until all members of the group have reached the barrier call and only returning then. similar barrier routine used with a named group of processes.

6 Barrier Implementation slides6-6 Centralized counter implementation (a linear barrier): Processes Counter, C P 0 P 1 P p-1 Increment and check for p Barrier(); Barrier(); Barrier();

7 slides6-7 Good barrier implementations must take into account that a barrier might be used more than once in a process. Might be possible for a process to enter the barrier for a second time before previous processes have left the barrier for the first time.

8 slides6-8 Counter-based barriers often have two phases: A process enters arrival phase and does not leave this phase until all processes have arrived in this phase. Then processes move to departure phase and are released. Two-phase handles the reentrant scenario.

9 slides6-9 Example code: Master: for (i = 0; i < n; i++)/*count slaves as they reach barrier*/ recv(p any ); for (i = 0; i < n; i++)/* release slaves */ send(p i ); Slave processes: send(p master ); recv(p master );

10 slides6-10 Barrier implementation in a message-passing system Master Slave processes Arrival phase Departure phase for(;i<n;i++) recv(p any ); for(;i<n;i++) send(p i ); Barrier: send(p master ); recv(p master ); Barrier: send(p master ); recv(p master );

11 Tree Implementation More efficient. O(log p) steps Suppose 8 processes, P 0, P 1, P 2, P 3, P 4, P 5, P 6, P 7 : slides6-11 1st stage: 2nd stage: 3rd stage: P 1 sends message to P 0 ; (when P 1 reaches its barrier) P 3 sends message to P 2 ; (when P 3 reaches its barrier) P 5 sends message to P 4 ; (when P 5 reaches its barrier) P 7 sends message to P 6 ; (when P 7 reaches its barrier) P 2 sends message to P 0 ; (P 2 & P 3 reached their barrier) P 6 sends message to P 4 ; (P 6 & P 7 reached their barrier) P 4 sends message to P 0 ; (P 4, P 5, P 6, & P 7 reached barrier) P 0 terminates arrival phase; (when P 0 reaches barrier & received message from P 4 ) Release with a reverse tree construction.

12 Tree barrier slides6-12 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 Arrival at barrier Sychronizing message Departure from barrier

13 Butterfly Barrier slides6-13 1st stage P 0 P 1, P 2 P 3, P 4 P 5, P 6 P 7 2nd stage P 0 P 2, P 1 P 3, P 4 P 6, P 5 P 7 3rd stage P 0 P 4, P 1 P 5, P 2 P 6, P 3 P 7 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 1st stage 2nd stage Time 3rd stage

14 Local Synchronization slides6-14 Suppose a process P i needs to be synchronized and to exchange data with process P i 1 and process P i+1 before continuing: Process P i-1 Process P i Process P i+1 recv(p i ); send(p i-1 ); recv(p i ); send(p i ); send(p i+1 ); send(p i ); recv(p i-1 ); recv(p i+1 ); Not a perfect three-process barrier because process P i 1 will only synchronize with P i and continue as soon as P i allows. Similarly, process P i+1 only synchronizes with P i.

15 slides6-15 Deadlock When a pair of processes each send and receive from each other, deadlock may occur. Deadlock will occur if both processes perform the send, using synchronous routines first (or blocking routines without sufficient buffering). This is because neither will return; they will wait for matching receives that are never reached.

16 slides6-16 A Solution Arrange for one process to receive first and then send and the other process to send first and then receive. Example Linear pipeline, deadlock can be avoided by arranging so the evennumbered processes perform their sends first and the oddnumbered processes perform their receives first.

17 Combined deadlock-free blocking sendrecv() routines slides6-17 Example Process P i-1 Process P i Process P i+1 sendrecv(p i ); sendrecv(p i-1 ); sendrecv(p i+1 ); sendrecv(p i ); MPI provides MPI_Sendrecv()and MPI_Sendrecv_replace(). MPI sendrev()s actually has 12 parameters!

18 Synchronized Computations slides6-18 Can be classififed as: Fully synchronous or Locally synchronous In fully synchronous, all processes involved in the computation must be synchronized. In locally synchronous, processes only need to synchronize with a set of logically nearby processes, not all processes involved in the computation

19 slides6-19 Fully Synchronized Computation Examples Data Parallel Computations Same operation performed on different data elements simultaneously; i.e., in parallel. Particularly convenient because: Ease of programming (essentially only one program). Can scale easily to larger problem sizes. Many numeric and some non-numeric problems can be cast in a data parallel form.

20 slides6-20 Example To add the same constant to each element of an array: for (i = 0; i < n; i++) a[i] = a[i] + k; The statement: a[i] = a[i] + k; could be executed simultaneously by multiple processors, each using a different index i (0 < i n).

21 Data Parallel Computation slides6-21 Instruction a[] = a[] + k; Processors a[0]=a[0]+k; a[1]=a[1]+k; a[n-1]=a[n-1]+k; a[0] a[1] a[n-1]

22 slides6-22 forall construct Special parallel construct in parallel programming languages to specify data parallel operations Example forall (i = 0; i < n; i++) { } body states that n instances of the statements of the body can be executed simultaneously. One value of the loop variable i is valid in each instance of the body, the first instance has i = 0, the next i = 1, and so on.

23 slides6-23 To add k to each element of an array, a, we can write forall (i = 0; i < n; i++) a[i] = a[i] + k;

24 slides6-24 Data parallel technique applied to multiprocessors and multicomputers Example To add k to the elements of an array: i = myrank; a[i] = a[i] + k;/* body */ barrier(mygroup); where myrank is a process rank between 0 and n 1.

25 slides6-25 Data Parallel Example - Prefix Sum Problem Given a list of numbers, x 0,, x n 1, compute all the partial summations (i.e., x 0 + x 1 ; x 0 + x 1 + x 2 ; x 0 + x 1 + x 2 + x 3 ; ). Can also be defined with associative operations other than addition. Widely studied. Practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation.

26 slides6-26 Data parallel method of adding all partial sums of 16 numbers

27 Data parallel prefix sum operation slides6-27 Numbers x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 x 13 x 14 x 15 Step 1 (j = 0) i=1 3 i=2 4 i=3 5 i=4 6 i=5 7 i=6 8 i=7 9 i=8 10 i=9 11 i=10 12 i=11 13 i=12 14 i=13 15 i=14 Add Step 2 (j = 1) i=1 5 i=2 6 i=3 7 i=4 8 i=5 9 i=6 10 i=7 11 i=8 12 i=9 13 i=10 14 i=11 15 i=12 Add Step 3 (j = 2) i=1 9 i=2 10 i=3 11 i=4 12 i=5 13 i=6 14 i=7 Add 15 i=8 Final step (j = 3) Add 15

28 slides6-28 Sequential code for (j = 0; j < log(n); j++)/* at each step, add*/ for (i = 2 j ; i < n; i++)/* to accumulating sum */ x[i] = x[i] + x[i - 2 j ]; Parallel code for (j = 0; j < log(n); j++) /* at each step, add */ forall (i = 0; i < n; i++)/*to sum */ if (i >= 2 j ) x[i] = x[i] + x[i - 2 j ];

Synchronous Computations

Synchronous Computations Chapter 6 slides6-1 Synchronous Computations Synchronous Computations slides6-2 In a (fully) synchronous application, all the processes synchronized at regular points. Barrier A basic mechanism for synchronizing

More information

Synchronous Computations

Synchronous Computations Synchronous Computations Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Introduction We

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Synchronous Iteration

Synchronous Iteration Synchronous Iteration Iteration-based computation is a powerful method for solving numerical (and some non-numerical) problems. For numerical problems, a calculation is repeated and each time, a result

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Basic Techniques of Parallel Programming & Examples

Basic Techniques of Parallel Programming & Examples Basic Techniques of Parallel Programming & Examples Fundamental or Common Problems with a very large degree of (data) parallelism: (PP ch. 3) Image Transformations: Shifting, Rotation, Clipping etc. Pixel-level

More information

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day 184.727: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day Jesper Larsson Träff, Francesco Versaci Parallel Computing Group TU Wien October 16,

More information

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup

Sorting Algorithms. - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup Sorting Algorithms - rearranging a list of numbers into increasing (or decreasing) order. Potential Speedup The worst-case time complexity of mergesort and the average time complexity of quicksort are

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore

High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore High Performance Computing Prof. Matthew Jacob Department of Computer Science and Automation Indian Institute of Science, Bangalore Module No # 09 Lecture No # 40 This is lecture forty of the course on

More information

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Parallel Programming Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Summing n numbers for(i=1; i++; i

More information

CHAPTER 5 Pipelined Computations

CHAPTER 5 Pipelined Computations CHAPTER 5 Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming.

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection detecting

More information

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities

More information

Array Packing Implementation A Parallel approach. By, Abhishek Cumbakonam Desikan Rajesh Balasubramanian Ramalingam Sankaran Aswin Gokulachandran

Array Packing Implementation A Parallel approach. By, Abhishek Cumbakonam Desikan Rajesh Balasubramanian Ramalingam Sankaran Aswin Gokulachandran Array Packing Implementation A Parallel approach By, Abhishek Cumbakonam Desikan Rajesh Balasubramanian Ramalingam Sankaran Aswin Gokulachandran The Problem Given an array X, having some labelled items.

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

CSCI 104 Log Structured Merge Trees. Mark Redekopp

CSCI 104 Log Structured Merge Trees. Mark Redekopp 1 CSCI 10 Log Structured Merge Trees Mark Redekopp Series Summation Review Let n = 1 + + + + k = σk i=0 n = k+1-1 i. What is n? What is log (1) + log () + log () + log (8)++ log ( k ) = 0 + 1 + + 3+ +

More information

Parallel Random Access Machine (PRAM)

Parallel Random Access Machine (PRAM) PRAM Algorithms Parallel Random Access Machine (PRAM) Collection of numbered processors Access shared memory Each processor could have local memory (registers) Each processor can access any shared memory

More information

Scan Algorithm Effects on Parallelism and Memory Conflicts

Scan Algorithm Effects on Parallelism and Memory Conflicts Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Pipelined Computations

Pipelined Computations Pipelined Computations In the pipeline technique, the problem is divided into a series of tasks that have to be completed one after the other. In fact, this is the basis of sequential programming. Each

More information

COS 318: Midterm Exam (October 23, 2012) (80 Minutes)

COS 318: Midterm Exam (October 23, 2012) (80 Minutes) COS 318: Midterm Exam (October 23, 2012) (80 Minutes) Name: This exam is closed-book, closed-notes. 1 single-sided 8.5x11 sheet of notes is permitted. No calculators, laptop, palmtop computers are allowed

More information

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms Part 2 1 3 Maximum Selection Problem : Given n numbers, x 1, x 2,, x

More information

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III) Topics Lecture 7 MPI Programming (III) Collective communication (cont d) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking

More information

NOI 2012 TASKS OVERVIEW

NOI 2012 TASKS OVERVIEW NOI 2012 TASKS OVERVIEW Tasks Task 1: MODSUM Task 2: PANCAKE Task 3: FORENSIC Task 4: WALKING Notes: 1. Each task is worth 25 marks. 2. Each task will be tested on a few sets of input instances. Each set

More information

7 Parallel Programming and Parallel Algorithms

7 Parallel Programming and Parallel Algorithms 7 Parallel Programming and Parallel Algorithms 7.1 INTRODUCTION Algorithms in which operations must be executed step by step are called serial or sequential algorithms. Algorithms in which several operations

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Example of usage of Prefix Sum Compacting an Array. Example of usage of Prexix Sum Compacting an Array

Example of usage of Prefix Sum Compacting an Array. Example of usage of Prexix Sum Compacting an Array Example of usage of Prefix um A 0 0 0 e 1 0 0 0 0 0 B e 1 Example of usage of Prexix um A 0 0 0 e 1 0 0 0 0 0 B e 1 Initialize B with zeroes Any idea on the solution (first in sequential)? If A[i]!= 0

More information

Suggested Solutions (Midterm Exam October 27, 2005)

Suggested Solutions (Midterm Exam October 27, 2005) Suggested Solutions (Midterm Exam October 27, 2005) 1 Short Questions (4 points) Answer the following questions (True or False). Use exactly one sentence to describe why you choose your answer. Without

More information

HIGH PERFORMANCE SCIENTIFIC COMPUTING

HIGH PERFORMANCE SCIENTIFIC COMPUTING ( HPSC 5576 ELIZABETH JESSUP ) HIGH PERFORMANCE SCIENTIFIC COMPUTING :: Homework / 8 :: Student / Florian Rappl 1 problem / 10 points Problem 1 Task: Write a short program demonstrating the use of MPE's

More information

Message-Passing Computing

Message-Passing Computing Chapter 2 Slide 41þþ Message-Passing Computing Slide 42þþ Basics of Message-Passing Programming using userlevel message passing libraries Two primary mechanisms needed: 1. A method of creating separate

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III) Topics Lecture 6 MPI Programming (III) Point-to-point communication Basic point-to-point communication Non-blocking point-to-point communication Four modes of blocking communication Manager-Worker Programming

More information

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8 CSE341T/CSE549T 09/17/2014 Lecture 8 Scan and its Uses 1 Scan Today, we start by learning a very useful primitive. First, lets start by thinking about what other primitives we have learned so far? The

More information

Heidi Poxon Cray Inc.

Heidi Poxon Cray Inc. Heidi Poxon Topics GPU support in the Cray performance tools CUDA proxy MPI support for GPUs (GPU-to-GPU) 2 3 Programming Models Supported for the GPU Goal is to provide whole program analysis for programs

More information

CS475 Parallel Programming

CS475 Parallel Programming CS475 Parallel Programming Sorting Wim Bohm, Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license. Sorting

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

COMP 322: Fundamentals of Parallel Programming. Lecture 13: Parallelism in Java Streams, Parallel Prefix Sums

COMP 322: Fundamentals of Parallel Programming. Lecture 13: Parallelism in Java Streams, Parallel Prefix Sums COMP 322: Fundamentals of Parallel Programming Lecture 13: Parallelism in Java Streams, Parallel Prefix Sums Instructors: Vivek Sarkar, Mack Joyner Department of Computer Science, Rice University {vsarkar,

More information

Priority Queues (Heaps)

Priority Queues (Heaps) Priority Queues (Heaps) October 11, 2016 CMPE 250 Priority Queues October 11, 2016 1 / 29 Priority Queues Many applications require that we process records with keys in order, but not necessarily in full

More information

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102

GPU Programming. Parallel Patterns. Miaoqing Huang University of Arkansas 1 / 102 1 / 102 GPU Programming Parallel Patterns Miaoqing Huang University of Arkansas 2 / 102 Outline Introduction Reduction All-Prefix-Sums Applications Avoiding Bank Conflicts Segmented Scan Sorting 3 / 102

More information

(Refer Slide Time: 1:27)

(Refer Slide Time: 1:27) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 1 Introduction to Data Structures and Algorithms Welcome to data

More information

Programming with Shared Memory. Nguyễn Quang Hùng

Programming with Shared Memory. Nguyễn Quang Hùng Programming with Shared Memory Nguyễn Quang Hùng Outline Introduction Shared memory multiprocessors Constructs for specifying parallelism Creating concurrent processes Threads Sharing data Creating shared

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection detecting

More information

Review of MPI Part 2

Review of MPI Part 2 Review of MPI Part Russian-German School on High Performance Computer Systems, June, 7 th until July, 6 th 005, Novosibirsk 3. Day, 9 th of June, 005 HLRS, University of Stuttgart Slide Chap. 5 Virtual

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

More Communication (cont d)

More Communication (cont d) Data types and the use of communicators can simplify parallel program development and improve code readability Sometimes, however, simply treating the processors as an unstructured collection is less than

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

CS 179: GPU Programming. Lecture 7

CS 179: GPU Programming. Lecture 7 CS 179: GPU Programming Lecture 7 Week 3 Goals: More involved GPU-accelerable algorithms Relevant hardware quirks CUDA libraries Outline GPU-accelerated: Reduction Prefix sum Stream compaction Sorting(quicksort)

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center

Discussion: MPI Basic Point to Point Communication I. Table of Contents. Cornell Theory Center 1 of 14 11/1/2006 3:58 PM Cornell Theory Center Discussion: MPI Point to Point Communication I This is the in-depth discussion layer of a two-part module. For an explanation of the layers and how to navigate

More information

Message-Passing Computing Examples

Message-Passing Computing Examples Message-Passing Computing Examples Problems with a very large degree of parallelism: Image Transformations: Shifting, Rotation, Clipping etc. Mandelbrot Set: Sequential, static assignment, dynamic work

More information

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers UNIVERSITI SAINS MALAYSIA Second Semester Examination Academic Session 2003/2004 September/October 2003 CCS524 Parallel Computing Architectures, Algorithms & Compilers Duration : 3 hours INSTRUCTION TO

More information

Contents. Preface. About the Authors BASIC TECHNIQUES CHAPTER 1 PARALLEL COMPUTERS. l. 1 The Demand for Computational Speed 3

Contents. Preface. About the Authors BASIC TECHNIQUES CHAPTER 1 PARALLEL COMPUTERS. l. 1 The Demand for Computational Speed 3 Preface About the Authors PARTI BASIC TECHNIQUES CHAPTER 1 PARALLEL COMPUTERS l. 1 The Demand for Computational Speed 3 1.2 Potential for Increased Computational Speed 6 Speedup Factor 6 What Is the Maximum

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Ver teil tes Rechnen und Parallelprogrammierung: Introduction to Multi-Threading in Java

Ver teil tes Rechnen und Parallelprogrammierung: Introduction to Multi-Threading in Java Ver teil tes Rechnen und Parallelprogrammierung: Introduction to Multi-Threading in Java Based on the book (chapter 29): Introduction to Java Programming (Comprehensive Version) by Y. Daniel Liang Based

More information

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system OpenMP A parallel language standard that support both data and functional Parallelism on a shared memory system Use by system programmers more than application programmers Considered a low level primitives

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Point-to-Point Transfers Nick Maclaren nmm1@cam.ac.uk May 2008 Programming with MPI p. 2/?? Digression Most books and courses teach point--to--point first

More information

STEP 7 PROFESSIONAL. Function STEP 7

STEP 7 PROFESSIONAL. Function STEP 7 STEP 7 PROFESSIONAL Function STEP 7 STEP 7 blocks STEP 7 files all user programs and all the data required by those programs in blocks. The possibility of calling other blocks within one block, as though

More information

Paradigms for Parallel Algorithms

Paradigms for Parallel Algorithms S Parallel Algorithms Paradigms for Parallel Algorithms Reference : C. Xavier and S. S. Iyengar, Introduction to Parallel Algorithms Binary Tree Paradigm A binary tree with n nodes is of height log n Can

More information

Summary of Computer Architecture

Summary of Computer Architecture Summary of Computer Architecture Summary CHAP 1: INTRODUCTION Structure Top Level Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output

More information

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf

MapReduce on the Cell Broadband Engine Architecture. Marc de Kruijf MapReduce on the Cell Broadband Engine Architecture Marc de Kruijf Overview Motivation MapReduce Cell BE Architecture Design Performance Analysis Implementation Status Future Work What is MapReduce? A

More information

BiSS C (unidirectional) PROTOCOL DESCRIPTION

BiSS C (unidirectional) PROTOCOL DESCRIPTION Rev A2, Page 1/10 FEATURES Unidirectional sensor interface Synchronous, real-time-capable data transmission Fast, serial, safe Point-to-point or multiple slaves networks Compact and cost-effective Open

More information

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans Lesson 1 4 Prefix Sum Definitions Prefix sum given an array...the prefix sum is the sum of all the elements in the array from the beginning to the position, including the value at the position. The sequential

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 2 Part I Programming

More information

Parallel Algorithms. Thoai Nam

Parallel Algorithms. Thoai Nam Parallel Algorithms Thoai Nam Outline Introduction to parallel algorithms development Reduction algorithms Broadcast algorithms Prefix sums algorithms -2- Introduction to Parallel Algorithm Development

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

2 MARKS Q&A 1 KNREDDY UNIT-I

2 MARKS Q&A 1 KNREDDY UNIT-I 2 MARKS Q&A 1 KNREDDY UNIT-I 1. What is bus; list the different types of buses with its function. A group of lines that serves as a connecting path for several devices is called a bus; TYPES: ADDRESS BUS,

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Last Class: Deadlocks. Where we are in the course

Last Class: Deadlocks. Where we are in the course Last Class: Deadlocks Necessary conditions for deadlock: Mutual exclusion Hold and wait No preemption Circular wait Ways of handling deadlock Deadlock detection and recovery Deadlock prevention Deadlock

More information

MPI Collective communication

MPI Collective communication MPI Collective communication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) MPI Collective communication Spring 2018 1 / 43 Outline 1 MPI Collective communication

More information

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles

Warp shuffles. Lecture 4: warp shuffles, and reduction / scan operations. Warp shuffles. Warp shuffles Warp shuffles Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies slides3-1 Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Load Balancing

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

OPERATING SYSTEMS UNIT - 1

OPERATING SYSTEMS UNIT - 1 OPERATING SYSTEMS UNIT - 1 Syllabus UNIT I FUNDAMENTALS Introduction: Mainframe systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered Systems Real Time Systems Handheld Systems -

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

Lecture 4: warp shuffles, and reduction / scan operations

Lecture 4: warp shuffles, and reduction / scan operations Lecture 4: warp shuffles, and reduction / scan operations Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 4 p. 1 Warp shuffles Warp

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 13 Parallelism in Software I Mattan Erez The University of Texas at Austin N EE382N: Parallelilsm and Locality, Spring

More information

Holland Computing Center Kickstart MPI Intro

Holland Computing Center Kickstart MPI Intro Holland Computing Center Kickstart 2016 MPI Intro Message Passing Interface (MPI) MPI is a specification for message passing library that is standardized by MPI Forum Multiple vendor-specific implementations:

More information

Basic Structure and Low Level Routines

Basic Structure and Low Level Routines SUZAKU Pattern Programming Framework Specification 1 - Structure and Low-Level Patterns B. Wilkinson, March 17, 2016. Suzaku is a pattern parallel programming framework developed at UNC-Charlotte that

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing

More information

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

k-selection Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

k-selection Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong Department of Computer Science and Engineering Chinese University of Hong Kong In this lecture, we will put randomization to some real use, by using it to solve a non-trivial problem called k-selection

More information

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 3 Distributed Memory Programming with MPI 1 Roadmap Writing your first MPI program. Using the common MPI functions. The Trapezoidal Rule in

More information

Sai Nath University. Assignment For BCA 3 RD Sem.

Sai Nath University. Assignment For BCA 3 RD Sem. 1 Sai Nath University Assignment For BCA 3 RD Sem. The Assignment will consist of two parts, A and B. will have 5 short answer questions(40-60 words) of 4 marks each. will have 4 long answer questions

More information

MPI 3. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 3. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Engineering 2012 Intake Semester 8 Examination CS4532 CONCURRENT PROGRAMMING Time allowed: 2 Hours March

More information

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,

More information

Parallel Patterns Ezio Bartocci

Parallel Patterns Ezio Bartocci TECHNISCHE UNIVERSITÄT WIEN Fakultät für Informatik Cyber-Physical Systems Group Parallel Patterns Ezio Bartocci Parallel Patterns Think at a higher level than individual CUDA kernels Specify what to compute,

More information

LECTURE 9 Data Structures: A systematic way of organizing and accessing data. --No single data structure works well for ALL purposes.

LECTURE 9 Data Structures: A systematic way of organizing and accessing data. --No single data structure works well for ALL purposes. LECTURE 9 Data Structures: A systematic way of organizing and accessing data. --No single data structure works well for ALL purposes. Input Algorithm Output An algorithm is a step-by-step procedure for

More information

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to:

General Objectives: To understand the process management in operating system. Specific Objectives: At the end of the unit you should be able to: F2007/Unit5/1 UNIT 5 OBJECTIVES General Objectives: To understand the process management in operating system Specific Objectives: At the end of the unit you should be able to: define program, process and

More information