CS256 Applied Theory of Computation

Similar documents
Parallel Algorithms for (PRAM) Computers & Some Parallel Algorithms. Reference : Horowitz, Sahni and Rajasekaran, Computer Algorithms

Complexity and Advanced Algorithms Monsoon Parallel Algorithms Lecture 2

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Fundamental Algorithms

The PRAM Model. Alexandre David

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR Stamp / Signature of the Invigilator

Parallel Models RAM. Parallel RAM aka PRAM. Variants of CRCW PRAM. Advanced Algorithms

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

IE 495 Lecture 3. Septermber 5, 2000

Parallel Models. Hypercube Butterfly Fully Connected Other Networks Shared Memory v.s. Distributed Memory SIMD v.s. MIMD

Parallel Random Access Machine (PRAM)

Fundamental Algorithms

1. (a) O(log n) algorithm for finding the logical AND of n bits with n processors

CSE Introduction to Parallel Processing. Chapter 5. PRAM and Basic Algorithms

CSL 730: Parallel Programming

: Parallel Algorithms Exercises, Batch 1. Exercise Day, Tuesday 18.11, 10:00. Hand-in before or at Exercise Day

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Example of usage of Prefix Sum Compacting an Array. Example of usage of Prexix Sum Compacting an Array

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

CSL 730: Parallel Programming. Algorithms

15-750: Parallel Algorithms

COMP Parallel Computing. PRAM (2) PRAM algorithm design techniques

CSL 860: Modern Parallel

Real parallel computers

Algorithms & Data Structures 2

Optimal Parallel Randomized Renaming

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms

CSE 4500 (228) Fall 2010 Selected Notes Set 2

Hypercubes. (Chapter Nine)

Basic Communication Ops

COMP Parallel Computing. PRAM (4) PRAM models and complexity

Parallel Random-Access Machines

An NC Algorithm for Sorting Real Numbers

Brute Force: Selection Sort

Analyze the obvious algorithm, 5 points Here is the most obvious algorithm for this problem: (LastLargerElement[A[1..n]:

Parallel scan on linked lists

Fast Sorting and Selection. A Lower Bound for Worst Case

Paradigms for Parallel Algorithms

Parallel Distributed Memory String Indexes

Theory of Computing Systems 2004 Springer-Verlag New York, LLC

PRAM Divide and Conquer Algorithms

CS473 - Algorithms I

each processor can in one step do a RAM op or read/write to one global memory location

PRAM (Parallel Random Access Machine)

Introduction to the Analysis of Algorithms. Algorithm

CS256 Applied Theory of Computation

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

The complexity of Sorting and sorting in linear-time. Median and Order Statistics. Chapter 8 and Chapter 9

DPHPC: Performance Recitation session

Sorting (Chapter 9) Alexandre David B2-206

Test 1 Review Questions with Solutions

List Ranking. Chapter 4

Parallel Programs. EECC756 - Shaaban. Parallel Random-Access Machine (PRAM) Example: Asynchronous Matrix Vector Product on a Ring

Graph Algorithms. Chromatic Polynomials. Graph Algorithms

List Ranking. Chapter 4

Chapter 6. Parallel Algorithms. Chapter by M. Ghaari. Last update 1 : January 2, 2019.

Lesson 1 1 Introduction

Parallel algorithms at ENS Lyon

6.886: Algorithm Engineering

time using O( n log n ) processors on the EREW PRAM. Thus, our algorithm improves on the previous results, either in time complexity or in the model o

Sorting (Chapter 9) Alexandre David B2-206

Chapter 2 Parallel Computer Models & Classification Thoai Nam

Chapter 1 Divide and Conquer Algorithm Theory WS 2013/14 Fabian Kuhn

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Parallel Sorting Algorithms

10 Introduction to Distributed Computing

Finding the k Shortest Paths in Parallel 1. E. Ruppert 2. Parallel graph algorithms, Data structures, Shortest paths.

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM )

Greedy Approach: Intro

DIVIDE & CONQUER. Problem of size n. Solution to sub problem 1

Chapter 4: Sorting. Spring 2014 Sorting Fun 1

EE 368. Week 6 (Notes)

Sorting and Selection

CS 561, Lecture 1. Jared Saia University of New Mexico

A Parallel Algorithm for Relational Coarsest Partition Problems and Its Implementation

Models of Computation for Massive Data

Parallel Euler tour and Post Ordering for Parallel Tree Accumulations

SORTING LOWER BOUND & BUCKET-SORT AND RADIX-SORT

PRAM ALGORITHMS: BRENT S LAW

Parallel Algorithmic Techniques for Combinatorial Computation

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

Scan and its Uses. 1 Scan. 1.1 Contraction CSE341T/CSE549T 09/17/2014. Lecture 8

COMP Parallel Computing. PRAM (3) PRAM algorithm design techniques

On Separating the EREW and CREW PRAM Models

1 Definition of Reduction

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012

SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY. Lecture 11 CS2110 Spring 2016

CS302 Topic: Algorithm Analysis #2. Thursday, Sept. 21, 2006

CSCE 750, Spring 2001 Notes 3 Page Symmetric Multi Processors (SMPs) (e.g., Cray vector machines, Sun Enterprise with caveats) Many processors

/463 Algorithms - Fall 2013 Solution to Assignment 3

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

CS 315 Data Structures mid-term 2

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

Lecture 4 CS781 February 3, 2011

Chapter 1 Divide and Conquer Algorithm Theory WS 2015/16 Fabian Kuhn

We can use a max-heap to sort data.

CS473-Algorithms I. Lecture 10. Dynamic Programming. Cevdet Aykanat - Bilkent University Computer Engineering Department

Parallel Connected Components

This matrix is not TU since the submatrix shown below has determinant of

Proceedings of MASPLAS'01 The Mid-Atlantic Student Workshop on Programming Languages and Systems IBM Watson Research Centre April 27, 2001

Transcription:

CS256 Applied Theory of Computation Parallel Computation IV John E Savage

Overview PRAM Work-time framework for parallel algorithms Prefix computations Finding roots of trees in a forest Parallel merging Parallel partitioning Sorting Simulating CREW on EREW PRAM Lect 22 Parallel Computation IV CS256 @John E Savage 2

PRAM Model The PRAM is an abstract programming model Processors operate synchronously, reading from memory, executing locally, and writing to memory. Lect 22 Parallel Computation IV CS256 @John E Savage 3

PRAM Model Four types: EREW, ERCW, CREW, CRCW R read, W write, E exclusive, C common Can Boolean functions be computed quickly? How can a function be represented? Can we use concurrency to good advantage? Is this use of concurrency realistic? Good source: Intro to Parallel Algs by Jaja Lect 22 Parallel Computation IV CS256 @John E Savage 4

Matrix Multiplication on PRAM Input: n n matrices A and B Output: n n matrix C, local vars. C (i,j,l), n begin Compute C(i,j,l) = A(i,l)B(l,j) For h = 1 to log 2 n do if (l n) then C(i,j,l) := C(i,j,2l-1) + C (i,j,2l) If (l = 1) then C(i,j) := C(i,j,l) End Running time on CREW is O(log n). Why is CREW necessary? Lect 22 Parallel Computation IV CS256 @John E Savage 5

Measuring Performance of Parallel Programs Let T(n) time and P(n) processors be used on a parallel machine on a problem with n inputs Cost: C(n) = P(n)T(n) is the time processor product, or work, for problem on n inputs. An equivalent serial algorithm will run in time O(C(n)). If p P(n) processors available, we can implement the algorithm in time O(P(n)T(n)/p) or O(C(n)/p) time. Lect 22 Parallel Computation IV CS256 @John E Savage 6

Advantages of PRAM A body of algorithms exist for this shared memory model. The model ignores algorithmic details of synchronization and communication. It makes explicit the association between operations and processors. PRAM algorithms are robust network-based algorithms can be derived from them. PRAM is MIMD model. Lect 22 Parallel Computation IV CS256 @John E Savage 7

Work-Time Framework for Parallel Algorithms Informal guideline to algorithm performance on PRAM. Work-time framework exhibits parallelism. Use for l i u pardo for parallel operations Also allow serial straight-line and branching ops W(n) (work) is total no. of ops on n inputs T(n) is the running time of algorithm Lect 22 Parallel Computation IV CS256 @John E Savage 8

Work-Time Framework for Parallel Algorithms Sum Input: n = 2 k inputs in array A[n] Output: Sum S = A(1)+A(2)+ + A(n) begin Copy A[n] to B[n] for h = 1 to log 2 n do for 1 i n/2 h pardo B(i) := B(2i-1) + B(2i) S := B(1) Lect 22 Parallel Computation IV CS256 @John E Savage 9

Work-Time Framework for Parallel Algorithms Rescheduling: Generally through rescheduling the following bounds can be achieved. If p procs. available, time is W(n)/p + T(n). Cost C(n) = p(w(n)/p+t(n)) O(W(n)+p T(n)) Lect 22 Parallel Computation IV CS256 @John E Savage 10

When Is a Parallel Computation Optimal? If the smallest parallel work is about the same as the best serial time, the parallel machine cannot ran substantially faster. Lect 22 Parallel Computation IV CS256 @John E Savage 11

Prefix Computation Powerful Operation with Many Applications Input: (x 1,,x n ), n = 2 k elements of S. Output: (s 1,,s n ), s i = x 1 * *x i, * an associative op begin If n = 1 then (s 1 := x 1 ; exit) for 1 i n/2 pardo y i = x 2i-1 * x 2i Compute prefix (z 1,,z n/2 ) from (y 1,,y n/2 ) for 1 i n pardo i even s i := z i/2 i = 1 s 1 := x 1 i odd s i := z (i-1)/2 * x i Lect 22 Parallel Computation IV CS256 @John E Savage 12

Prefix Computation Graphical illustration s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 z 1 z 2 z 3 z 4 z 1 z 2 y 1 y 1 y 1 y 2 y 2 y 3 y 4 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 Lect 22 Parallel Computation IV CS256 @John E Savage 13

Prefix Sum We show algorithm does work W(n) = O(n) and uses time T(n) = O(log n). The work and time to compute (z 1,,z n/2 ) from (y 1,,y n/2 ) are W(n/2) and T(n/2). Additional work to compute prefix sum is bn for b >0. Additional time is a > 0. W(n) = W(n/2) + bn T(n) = T(n/2) + a These imply W(n) = O(n), T(n) = O(log n). Lect 22 Parallel Computation IV CS256 @John E Savage 14

Examples of Prefix Computations Integer addition and multiplication Max and min Boolean AND and OR copy_right: a @ b = a Lect 22 Parallel Computation IV CS256 @John E Savage 15

Prefix Computations Segmented prefix computation Provide two n-tuple vectors: Elements (x 1,,x n ) from S and (f 1,,f n ), f j in B = {0,1}. Output: y i = x i if f j =1 else y i = x i * y i-1. Alternate input: ((x 1,y 1 ),, (x n,y n )) New operation: & : (S B) 2 Ø S & is associative, i.e. (x 1,y 1 )&((x 2,y 2 )&(x 3,y 3 )) = ((x 1,y 1 )&(x 2,y 3 ))&(x 3,y 3 ) Lect 22 Parallel Computation IV CS256 @John E Savage 16

Finding Roots of a Tree in a Forest Pointer doubling in a linked list: replace the pointer from node i to its successor P(i) by a pointer from the successor, namely P(P(i)). Given a forest of rooted trees, find roots {S(i)}. A root r has S(r) = r. begin for 1 i n pardo S(i) := P(i) while S(i) S(S(i) do S(i) := S(S(i)) Lect 22 Parallel Computation IV CS256 @John E Savage 17

Parallel Merging Two sorted sequences of length n can be merged in O(log n) time on O(n) operations. Goal: Compute rank(a:b), the rank elements of A in B, two sets of distinct elements. Approach: rank individual elements of A in B. Rank x in A using binary search of x in A. Lect 22 Parallel Computation IV CS256 @John E Savage 18

Parallel Partitioning on CREW Given sorted arrays of distinct elements A = (a 1,,b n ) and B = (b 1,,b m ), where log 2 m and k(m) = m/log 2 m are integers, partition B into k segments of length log 2 m, B 1,, B k. Partition A into k(m) contiguous segments A 1,, A k(m) such that elements in A i and B i are less than all elements in A i+1 and B i+1 and greater than all elements in A i-1 and B i-1. Procedure: Compute j(i) = rank(b i log m : A) and use to define A i = (a j(i)+1,, a j(i+1) ) Lect 22 Parallel Computation IV CS256 @John E Savage 19

Parallel Partitioning on CREW 1. j(0) := 0, j(k(m)) := n 2. for 1 i k(m)-1 pardo j(i) := rank(b ilog m : A) (use binary search) 3. for 0 i k(m)-1 pardo B i = (b i log m +1,, b (i+1) log m ) A i = (a j(i)+1,, a j(i+1) ) Lect 22 Parallel Computation IV CS256 @John E Savage 20

Parallel Partitioning Theorem Parallel partitioning of sorted lists A = (a 1,,b n ) and B = (b 1,,b m ) takes O(log n) time and O(n+m) operations. Proof Step 1. takes O(1) sequential time. Step 2. takes O(log n) parallel time. It uses O((log n) x (m/log m)) = O(n+m) operations. Step 3. takes O(1) parallel time and uses linear number of operations. Lect 22 Parallel Computation IV CS256 @John E Savage 21

Merging Merging of the segments A i and B i will produce sorted list. However, length of A i may be large. To reduce running time, in parallel partition the pairs (B i,a i ) so that segments have length at most log 2 n. Merging of two sorted lists of length n on CREW takes time O(log n) and O(n) operations. Merging time can be reduced to O(log log n) on CREW by clever using of ranking. Lect 22 Parallel Computation IV CS256 @John E Savage 22

Sorting Merge sort on lists of length 1, 2, 4, 8, etc. can be done in time O(log 2 n) using the partitioning algorithm mentioned above. It can be improved to O(log n log log n) on CREW. Lect 22 Parallel Computation IV CS256 @John E Savage 23

Simulating CRCW on EREW PRAM The CRCW PRAM can be simulated on the EREW PRAM in time that is larger by a factor of the time to sort a list of p items, as we show. Consider a reading cycle. Processor p j puts into location M j value (a j,j) indicating it wishes to read from location a j. Sort pairs. p i reads location M i on first step and loc. i-1 on second. If they hold different addresses, p i reads from address given in M i and puts it into M i. Lect 22 Parallel Computation IV CS256 @John E Savage 24

Simulating CRCW on EREW PRAM Using a segmented copy-right prefix computation (see book), copy value at location a to all locations M k that request it. This step takes O(log p) time for p-processor PRAM. Sorting takes more time. To write, repeat except let p j store value v to be written in M j and let first (arbitrary) value be written. Lect 22 Parallel Computation IV CS256 @John E Savage 25