CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

Similar documents
Part II. Extreme Models. Spring 2006 Parallel Processing, Extreme Models Slide 1

Instructor s Manual. Behrooz Parhami. Vol. 2: Presentation Material CCC. Mesh/Torus. Butterfly. Hypercube. Pyramid !*#? Sea Sick

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Parallel Models. Hypercube Butterfly Fully Connected Other Networks Shared Memory v.s. Distributed Memory SIMD v.s. MIMD

Part II. Shared-Memory Parallelism. Part II Shared-Memory Parallelism. Winter 2016 Parallel Processing, Shared-Memory Parallelism Slide 1

Fundamental Algorithms

Fundamental Algorithms

Part II. Extreme Models. Fall 2008 Parallel Processing, Extreme Models Slide 1

Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

CS256 Applied Theory of Computation

Parallel Random Access Machine (PRAM)

Real parallel computers

Parallel Random-Access Machines

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

Basic Communication Ops

The PRAM Model. Alexandre David

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

Complexity and Advanced Algorithms Monsoon Parallel Algorithms Lecture 2

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

COMP Parallel Computing. PRAM (4) PRAM models and complexity

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

Lecture 8 Parallel Algorithms II

Algorithms & Data Structures 2

CPS222 Lecture: Parallel Algorithms Last Revised 4/20/2015. Objectives

Lecture 3: Sorting 1

15-750: Parallel Algorithms

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSCE 750, Spring 2001 Notes 3 Page Symmetric Multi Processors (SMPs) (e.g., Cray vector machines, Sun Enterprise with caveats) Many processors

IE 495 Lecture 3. Septermber 5, 2000

Sorting (Chapter 9) Alexandre David B2-206

Sorting Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

CS473 - Algorithms I

Chapter 2 Parallel Computer Models & Classification Thoai Nam

Sorting (Chapter 9) Alexandre David B2-206

Dense Matrix Algorithms

PRAM Divide and Conquer Algorithms

CSC630/CSC730 Parallel & Distributed Computing

Parallel Computing: Parallel Algorithm Design Examples Jin, Hai

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms

CSE 4500 (228) Fall 2010 Selected Notes Set 2

Lecture 4 CS781 February 3, 2011

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR Stamp / Signature of the Invigilator

CS2223: Algorithms Sorting Algorithms, Heap Sort, Linear-time sort, Median and Order Statistics

Chapter 5: Analytical Modelling of Parallel Programs

Parallel scan on linked lists

Lecture 4: Graph Algorithms

Parallel algorithms at ENS Lyon

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Chapter 8 Dense Matrix Algorithms

n = 1 What problems are interesting when n is just 1?

DPHPC: Performance Recitation session

Introduction to Parallel Computing Errata

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Paradigms for Parallel Algorithms

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Lesson 1 4. Prefix Sum Definitions. Scans. Parallel Scans. A Naive Parallel Scans

What is Parallel Computing?

CSC630/CSC730 Parallel & Distributed Computing

COMP Parallel Computing. PRAM (2) PRAM algorithm design techniques

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

16/10/2008. Today s menu. PRAM Algorithms. What do you program? What do you expect from a model? Basic ideas for the machine model

UML CS Algorithms Qualifying Exam Fall, 2003 ALGORITHMS QUALIFYING EXAM

arxiv: v2 [cs.dc] 20 Jun 2014

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego

EE/CSCI 451 Midterm 1

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

D-BAUG Informatik I. Exercise session: week 5 HS 2018

Fundamental mathematical techniques reviewed: Mathematical induction Recursion. Typically taught in courses such as Calculus and Discrete Mathematics.

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CSE 3101: Introduction to the Design and Analysis of Algorithms. Office hours: Wed 4-6 pm (CSEB 3043), or by appointment.

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Chapter 2 Abstract Machine Models. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Data Structures and Algorithms CSE 465

4.2 Control Flow + - * * x Fig. 4.2 DAG of the Computation in Figure 4.1

Midterm solutions. n f 3 (n) = 3

Chapter 3 Dynamic programming

Cache-Oblivious Algorithms

UNIT-2. Problem of size n. Sub-problem 1 size n/2. Sub-problem 2 size n/2. Solution to the original problem

Problem Strategies. 320 Greedy Strategies 6

CSL 730: Parallel Programming

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

Matrix multiplication

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Part III. Mesh-Based Architectures. Winter 2016 Parallel Processing, Mesh-Based Architectures Slide 1

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Writing Parallel Programs; Cost Model.

Flynn classification. S = single, M = multiple, I = instruction (stream), D = data (stream)

CSE 202 Divide-and-conquer algorithms. Fan Chung Graham UC San Diego

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department

PRAM (Parallel Random Access Machine)

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Lecture 17: Array Algorithms

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell

Parallel Models RAM. Parallel RAM aka PRAM. Variants of CRCW PRAM. Advanced Algorithms

Network-oblivious algorithms. Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci and Francesco Silvestri

Parallel Computation/Program Issues

Parallel Numerical Algorithms

Divide and Conquer Algorithms. Sathish Vadhiyar

Transcription:

Dr Izadi CSE-40533 Introduction to Parallel Processing Chapter 5 PRAM and Basic Algorithms Define PRAM and its various submodels Show PRAM to be a natural extension of the sequential computer (RAM) Develop five important parallel algorithms that can serve as building blocks

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 6 5 PRAM Submodels and Assumptions Processors 0 p Shared Memory 0 2 3 m Fig 46 Conceptual view of a parallel randomaccess machine (PRAM) Processor i can do the following in 3 phases of one cycle: 2 Fetch an operand from address s i in shared memory Perform computations on data held in local registers 3 Store a value into address d i in shared memory Reads from Same Location Exclusive Concurrent Writes to Same Location Exclusive Concurrent EREW Least "Powerful", Most "Realistic" ERCW Not Useful CREW Default CRCW Most "Powerful", Further Subdivided Fig 5 Submodels of the PRAM model

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 62 CRCW PRAM is classified according to how concurrent writes are handled These submodels are all different from each other and from EREW and CREW Undefined: In case of multiple writes, the value written is undefined (CRCW-U) Detecting: A code representing detected collision is written (CRCW-D) Common: Multiple writes allowed only if all store the same value (CRCW-C); this is sometimes called the consistent-write submodel Random: The value written is randomly chosen from those offered (CRCW-R) Priority: The processor with the lowest index succeeds in writing (CRCW-P) Max/Min: The largest/smallest of the multiple values is written (CRCW-M) Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or another combination of the multiple values is written One way to order these submodels is by their computational power: EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P Theorem 5: A p-processor CRCW-P (priority) PRAM can be simulated (emulated) by a p-processor EREW PRAM with a slowdown factor of Θ(log p)

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 63 52 Data Broadcasting Broadcasting is built-in for the CREW and CRCW models EREW broadcasting: make p copies of the data in a broadcast vector B Making p copies of B[0] by recursive doubling for k = 0 to log 2 p Processor j, 0 j < p, do Copy B[j] into B[j + 2 k ] 0 2 3 4 5 6 7 8 9 0 B Fig 52 Data broadcasting in EREW PRAM via recursive doubling

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 64 0 2 3 4 5 6 7 8 9 0 B Fig 53 EREW PRAM data broadcasting without redundant copying EREW PRAM algorithm for broadcasting by Processor i Processor i write the data value into B[0] s := while s < p Processor j, 0 j < min(s, p s), do Copy B[j] into B[j + s] s := 2s endwhile Processor j, 0 j < p, read the data value in B[j] EREW PRAM algorithm for all-to-all broadcasting Processor j, 0 j < p, write own data value into B[j] for k = to p Processor j, 0 j < p, do Read the data value in B[(j + k) mod p] Both of the preceding algorithms are time-optimal (shared memory is the only communication mechanism and each processor can read but one value per cycle)

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 65 In the following naive sorting algorithm, processor j determines the rank R[j] of its data element S[j] by examining all the other data elements; it then writes S[j] in element R[j] of the output (sorted) vector Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, 0 j < p, write 0 into R[j] for k = to p Processor j, 0 j < p, do l := (j + k) mod p if S[l] < S[j] or S[l] = S[j] and l < j then R[j] := R[j] + endif Processor j, 0 j < p, write S[j] into S[R[j]] This O(p)-time algorithms is far from being optimal

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 66 53 Semigroup or Fan-in Computation This computation is trivial for a CRCW PRAM of the reduction variety if the reduction operator happens to be 0 2 3 4 5 6 7 8 9 Fig 54 S : 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 0: :2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 0: 0:2 0:3 :4 2:5 3:6 4:7 5:8 6:9 0: 0:2 0:3 0:4 0:5 0:6 0:7 :8 2:9 0: 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 Semigroup computation in EREW PRAM EREW PRAM semigroup computation algorithm Processor j, 0 j < p, copy X[j] into S[j] s := while s < p Processor j, 0 j < p s, do S[j + s] := S[j] S[j + s] s := 2s endwhile Broadcast S[p ] to all processors The preceding algorithm is time-optimal (CRCW can do better: problem 56) Speed-up = p/log 2 p Efficiency = Speed-up/p = /log 2 p Utilization = W(p) pt(p) (p )+(p 2)+(p 4)++(p p/2) plog 2 p /log 2 p

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 67 Semigroup computation with each processor holding n/p data elements: Each processor combine its sublist n/p steps Do semigroup computation on results log 2 p steps Speedup(n, p) = n n/p+2log 2 p = Efficiency(n, p) = Speedup/p = p +(2plog 2 p)/n +(2plog 2 p)/n For p = Θ(n), a sublinear speedup of Θ(n/log n) is obtained The efficiency in this case is Θ(n/log n)/θ(n) = Θ(/log n) Limiting the number of processors to p = O(n/log n), yields: Speedup(n, p) = n/o(log n) = Ω(n/log n) = Ω(p) Efficiency(n, p) = Θ() Using fewer processors than tasks = parallel slack Lower degree of parallelism near the root Higher degree of parallelism near the leaves Fig 55 Intuitive justification of why parallel slack helps improve the efficiency

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 68 54 Parallel Prefix Computation 0 2 3 4 5 6 7 8 9 S : 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9 0: :2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 0: 0:2 0:3 :4 2:5 3:6 4:7 5:8 6:9 0: 0:2 0:3 0:4 0:5 0:6 0:7 :8 2:9 0: 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 Fig 56 Parallel prefix computation in EREW PRAM via recursive doubling

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 69 Two other solutions, based on divide and conquer The p inputs : 2:2 3:3 4:4 5:5 6:6 7:7 p 3:p 3 p 2:p 2 p :p 0: 2:3 4:5 6:7 p 4:p 3 p 2:p Parallel prefix computation of size p/2 0: 0:3 0:5 0:7 0:p 3 0:p 0:2 0:4 0:6 0:p 2 Fig 57 Parallel prefix computation using a divideand-conquer scheme T(p) = T(p/2) + 2 = 2 log 2 p p/2 even-indexed inputs 0 2 4 6 p 2 Parallel prefix computation of size p/2 0 02 024 0246 024(p 2) p/2 odd-indexed inputs 3 5 7 p 3 p Parallel prefix computation of size p/2 3 35 357 3(p 3) 3(p ) 0: 0:2 0:3 0:4 0:5 0:6 0:7 0:(p 3) 0:(p 2) 0:(p ) Fig 58 Another divide-and-conquer scheme for parallel prefix computation T(p) = T(p/2) + = log 2 p Requires commutativity

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 70 55 Ranking the Elements of a Linked List head info next C F A E B D Terminal element Rank: 5 4 3 2 0 (or distance from terminal) Distance from head: 2 3 4 5 6 Fig 59 Example linked list and the ranks of its elements head 0 2 3 4 info next rank A B C D E 4 3 5 3 5 F 0 Fig 50 PRAM data structures representing a linked list and the ranking results List-ranking appears to be hopelessly sequential However, we can in fact use a recursive doubling scheme to determine the rank of each element in optimal time There exist other problems that appear to be unparallizable This is why intuition can be misleading when it comes to determining which computations are or are not efficiently parallelizable (ie, whether a computation is or is not in NC)

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 7 0 2 2 2 2 0 4 4 3 2 0 5 4 3 2 0 Fig 5 Element ranks initially and after each of the three iterations PRAM list ranking algorithm (via pointer jumping) Processor j, 0 j < p, do {initialize the partial ranks} if next[j] = j then rank[j] := 0 else rank[j] := endif while rank[next[head]] 0 Processor j, 0 j < p, do rank[j] := rank[j] + rank[next[j]] next[j] := next[next[j]] endwhile Which PRAM submodel is implicit in the preceding algorithm?

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 72 56 Matrix Multiplication For m m matrices, C = A B means: c ij = a ik b kj k=0 Sequential matrix multiplication algorithm for i = 0 to m do for j = 0 to m do t := 0 for k = 0 to m do t := t + a ik b kj c ij := t j m i = ij Fig 52 PRAM matrix multiplication by using p = m 2 processors PRAM matrix multiplication algorithm using m 2 processors Processor (i, j), 0 i, j < m, do begin t := 0 for k = 0 to m do t := t + a ik b kj c ij := t end

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 73 j i = ij PRAM matrix multiplication algorithm using m processors for j = 0 to m Processor i, 0 i < m, do t := 0 for k = 0 to m do t := t + a ik b kj c ij := t Both of the preceding algorithms are efficient and provide linear speedup Using fewer than m processors: each processor computes m/p rows of C j i = ij m / p rows The preceding solution is not efficient for NUMA parallel architectures Each element of B is fetched m/p times For each such data access, only two arithmetic operations are performed

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 74 Block matrix multiplication 2 p 2 q q=m/ p One processor computes these elements of C that it holds in local memory p Fig 53 Partitioning the matrices for block matrix multiplication Block-band j Blockband i = Block ij Each multiply-add computation on q q blocks needs 2q 2 = 2m 2 /p memory accesses to read the blocks 2q 3 arithmetic operations So, q arithmetic operations are done per memory access We assume that processor (i, j) has local memory to hold Block (i, j) of the result matrix C (q 2 elements) One block-row of B; say Row kq + c of Block (k, j) of B (Elements of A can be brought in one at a time)

Introduction to Parallel Processing: Algorithms and Architectures Instructor s Manual, Vol 2 (4/00), Page 75 For example, as element in row iq + a of column kq + c in block (i, k) of A is brought in, it is multiplied in turn by the locally stored q elements of B, and the results added to the appropriate q elements of C Element of Block (i, k) in Matrix A iq iq+ kq+c kq+c iq iq+ jq jq+ jq+2 jq+b jq+q Elements of Block (k, j) in Matrix B jq jq+ jq+2 Multiply jq+b jq+q iq+2 iq+2 iq+a iq+a Add Elements of Block (i, j) in Matrix C iq+q iq+q Fig 54 How Processor (i, j) operates on an element of A and one block-row of B to update one block-row of C On the Cm* NUMA-type shared-memory multiprocessor, this block algorithm exhibited good, but sublinear, speedup p = 6, speed-up = 5 in multiplying 24 24 matrices; improved to 9 () for larger 36 36 (48 48) matrices The improved locality of block matrix multiplication can also improve the running time on a uniprocessor, or distributed shared-memory multiprocessor with caches Reason: higher cache hit rates