COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

Similar documents
Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms

Analysis of Algorithms

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Lecture 1: Introduction and Strassen s Algorithm

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Algorithm. Counting Sort Analysis of Algorithms

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Homework 1 Solutions MA 522 Fall 2017

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

Analysis of Algorithms

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

A graphical view of big-o notation. c*g(n) f(n) f(n) = O(g(n))

Chapter 3 Classification of FFT Processor Algorithms

How do we evaluate algorithms?

COP4020 Programming Languages. Functional Programming Prof. Robert van Engelen

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 5. Counting Sort / Radix Sort

Elementary Educational Computer

University of Waterloo Department of Electrical and Computer Engineering ECE 250 Algorithms and Data Structures

Appendix D. Controller Implementation

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Python Programming: An Introduction to Computer Science

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

2. ALGORITHM ANALYSIS

Computational Geometry

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Python Programming: An Introduction to Computer Science

Chapter 4 The Datapath

Big-O Analysis. Asymptotics

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CSE 417: Algorithms and Computational Complexity

Today s objectives. CSE401: Introduction to Compiler Construction. What is a compiler? Administrative Details. Why study compilers?

CS211 Fall 2003 Prelim 2 Solutions and Grading Guide

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Solutions to Final COMS W4115 Programming Languages and Translators Monday, May 4, :10-5:25pm, 309 Havemeyer

Data diverse software fault tolerance techniques

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

Abstract. Chapter 4 Computation. Overview 8/13/18. Bjarne Stroustrup Note:

Abstract Data Types (ADTs) Stacks. The Stack ADT ( 4.2) Stack Interface in Java

CMPT 125 Assignment 2 Solutions

top() Applications of Stacks

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

5.3 Recursive definitions and structural induction

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Cache-Optimal Methods for Bit-Reversals

GPUMP: a Multiple-Precision Integer Library for GPUs

Last class. n Scheme. n Equality testing. n eq? vs. equal? n Higher-order functions. n map, foldr, foldl. n Tail recursion

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

Big-O Analysis. Asymptotics

CSE 305. Computer Architecture

Fast Fourier Transform (FFT) Algorithms

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Ones Assignment Method for Solving Traveling Salesman Problem

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer

CIS 121. Introduction to Trees

Multiprocessors. HPC Prof. Robert van Engelen

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

A log n lower bound to compute any function in parallel Reduction and broadcast in O(log n) time Parallel prefix (scan) in O(log n) time

Isn t It Time You Got Faster, Quicker?

Computers and Scientific Thinking

Examples and Applications of Binary Search

CS473-Algorithms I. Lecture 2. Asymptotic Notation. CS 473 Lecture 2 1

Data Structures Week #9. Sorting

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

Priority Queues. Binary Heaps

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

arxiv: v2 [cs.ds] 24 Mar 2018

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Major CSL Write your name and entry no on every sheet of the answer script. Time 2 Hrs Max Marks 70

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Algorithm Design Techniques. Divide and conquer Problem

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Algorithm Efficiency

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

Lower Bounds for Sorting

Lecture 1: Introduction and Fundamental Concepts 1

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis

Introduction to SWARM Software and Algorithms for Running on Multicore Processors

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Transcription:

COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1

First class summary This course is about parallel computig to achieve high-er performace o idividual problems start with high level PRAM model study algorithms ad asymptotic complexity subsequetly focus o more practical models from implemetatio poit of view shared memory, distributed memory, distributed computig study hardware orgaizatio, programmig models, performace predictio ad aalysis examie various algorithms ad case studies Itroductio 2

Topics today PRAM model executio model programmig model Work-Time model programmig model complexity metrics Bret s theorem: traslatio to PRAM programs Parallel prefix algorithm derivatio applicatios 3

PRAM model of parallel computatio PRAM = Parallel Radom Access Machie p processors shared memory each processor has a uique idetity 1 i p SIMD operatio sychroous PRAM each processor may be active or iactive each istructio executed by all active processors each istructio completes i uit time active? istructios shared memory 1 2 p procs 4

PRAM program PRAM program sequetial program expressios ivolvig processor id i have a uique value i each processor i ca be used as a array idex X[i] := i coditioals specify active processors if oddi the X[i] := X[i] + X[i+1] edif if i 2 the X[i] := 1 else X[i] := -1 edif X[1..4] 1 2 3 4 5

Cocurret memory access - Read Cocurret reads CR all readers of a give locatio see the same value X[i] := y X[i] := B[ i/2 ] Elimiatig bouded-degree cocurret reads replace X[i] := B[ i/2 ] with value of y read cocurretly by all p processors some locatios i B read cocurretly by two processors if oddi the X[i] := B[ i/2 ] edif if evei the X[i] := B[ i/2 ] edif X Ex. p = 6 1 1 2 2 3 3 B 1 2 3 cocurret read is elimiated but umber of steps is doubled 6

Cocurret memory access - Write Cocurret writes CW Stored value depeds o write arbitratio policy: Arbitrary CW odetermiistic choice amog values writte Commo CW All processors that write a value must write the same value, else error Priority CW value writte by processor with lowest processor id Combiig Write all values combied usig a specified associative operatio e.g. + Example p = 6 y := X[i] X 10 20 30 40 50 60 B[ i/2 ] := X[i] y B 7

Cocurret writes: Let B[1:p] be a array of boolea values ad defie c B 1 B 2 B p use p processors ad cocurret writes to compute c i a costat umber of steps a with combiig CW b with a CW policy other tha combiig CW which? 8

Cocurret memory access PRAM variats EREW, CREW, ERCW, CRCW differ i performace, ot expressive power EREW < CREW < CRCW loosely reflect difficulty of model implemetatio The followig are cosidered EREW refereces to processor id i umber of processors p problem size refereces to local variables local h; h := 2*i + 1; X[h] := X[i] expressio evaluatio is sychroous, e.g. X[i] := X[i] + X[i+1] is EREW 9

A PRAM program Simple problem: vector additio give V,W vectors of legth compute Z = V + W PRAM program costructed to operate with arbitrary problem size umber of processors p work to be performed must explicitly be scheduled across processors time complexity with p procs T c,p = PRAM model? p Iput: V[1:], W[1:] i shared memory Output: Z[1:] i shared memory p /p proc id local iteger h, k for h := 1 to /p do k := h-1 p + i if k the Z[k] := V[k] + W[k] edif V W Z 10

Work-Time paradigm W-T parallel programmig model high-level PRAM programmig model specifies available parallelism o explicit schedulig of parallelism over processors simplifies algorithm presetatio ad aalysis W-T programs ca be mechaically traslated to PRAM programs W-T program sequetial program forall costruct specificatio of available parallelism umber of processors is ot a parameter of the model! WT program for vector additio Iput: V[1:], W[1:] Output: Z[1:] forall i i 1: do Z[i] := V[i] + W[i] 11

Programmig otatio for the W-T framework stadard sequetial programmig otatio statemets assigmet statemet compositio alterative costruct if... the... else..edif repetitive costruct for, while expressios arithmetic ad logical fuctios variable referece recursive fuctio ad procedure ivocatio forall statemet specifies T may be executed simultaeously for each value of i i D o restrictio o T ca be a sequece of statemets, ca ivoke recursive fuctios forall i i D do statemet T depedig o i 12

W-T complexity metrics Work complexity W total umber of operatios performed as a fuctio of iput size Step complexity S umber of parallel steps required as a fuctio of iput size assumig ubouded parallelism Iductively defied over costructs of W-T programmig otatio 13

W-T complexity measures: simple example forall i i 2:-1 do R[i] := R[i-1] + R[i] + R[i+1]/3 for h := 1 to k do forall i i 2:-1 do R[i] := R[i-1] + R[i] + R[i+1]/3 R 1 14

Work ad Step Complexity of the forall costruct How to defie work ad time complexity of the forall costruct? P: forall i i D do body T depedig o i assume we ca determie WT i ad ST i for each i i D WP = SP = 15

W-T complexity measures: vector summatio let = 2 k forall i i 1:/2 do S[i] := S[2i - 1] + S[2i] for h := 1 to k do forall i i 1:/2 h do S[i] := S[2i - 1] + S[2i] S 1 = 4, k = 2 16

W-T complexity measures: vector summatio Vector summatio sum - reductio give V[1..], = 2 k compute s = sumv[1:] optimal sequetial time T s = Complexity W = S = Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] P1: forall i i 1: do B[i] := V[i] P2: for h := 1 to k do forall i i 1:/2 h do B[i] := B[2i-1]+B[2i] P3: s := B[1] PRAM model eeded? 18

19 Bret s theorem schedules a W-T program for a p-processor PRAM idea simulate each parallel step i W-T program usig p processors the work W i to be performed i step i ca be completed usig p processors i time boud cocurret rutime T C,p of resultat PRAM program by summig over all S steps Bret s theorem ad T c,p, 1 1 p T p W p W p W c S i i S i i p W i 1, 1 1 1 S p W S p W p W p W p T S i i S i i S i i c

Schedulig W-T vector summatio algorithm W-T vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] P1: forall i i 1: do B[i] := V[i] P2: for h := 1 to k do forall i i 1:/2 h do B[i] := B[2i-1]+B[2i] P3: s := B[1] PRAM vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] p > 0 processor PRAM; processor idex i local iteger j, r; P1: for j := 1 to /p do r := j-1 p + i if r the B[r] := V[r] edif P2: for h := 1 to k do for j := 1 to /2 h /p do r := j-1 p + i if r /2 h the B[r] := B[2r-1]+B[2r] edif P3: if i 1 the s := B[1] edif 20

Performace of traslated W-T program Cout steps eeded to perform the additios Bret s theorem predicts T c 1, p O lg p couts for various p p p 1 p p 3, 2 k, k eve T, p c 1 / p lg 1 p lg Upper boud is tight for this program traslatio retais EREW model 1 2 PRAM vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] p > 0 processor PRAM; processor idex i local iteger j, r; P1: for j := 1 to /p do r := j-1 p + i if r the B[r] := V[r] edif P2: for h := 1 to k do for j := 1 to /2 h /p do r := j-1 p + i if r /2 h the B[r] := B[2r-1] + B[2r] edif P3: if i 1 the s := B[1] edif 21

Parallel prefix-sum Iclusive prefix sum Iput Sequece X of = 2 k elemets, biary associative operator + Output Sequece S of = 2 k elemets, with S i = x 1 +... + x i Example: X = [1, 4, 3, 5, 6, 7, 0, 1] S = [1, 5, 8, 13, 19, 26, 26, 27] T S = Uses of prefix sum efficiet parallel implemetatio of sequetial sca through cosecutive actios ex: Give series of bak trasactios T[1:], with T[i] positive or egative, ad T[1] the opeig deposit > 0 Was the accout ever overdraw? explicit or implicit compoet of may parallel algorithms 22

Prefix sum algorithm Recursive solutio Xi stads for X[i] ad Xij stads for X[i]+X[i+1]+ +X[j] S: X11 X12 X13 X14 X15 X16 X17 X18 Z: X12 X14 X16 X18 Recursive prefix sum Y: X12 X34 X56 X78 X: X1 X2 X3 X4 X5 X6 X7 X8 23

Parallel prefix sum algorithm WT model Iput: X[1..] vector of itegers Output: S[1..] S: Z: Y: X: X11 X12 X13 X14 X12 X12 recur X14 X34 X1 X2 X3 X4 par_prefix_sum X[1..] = var Y[1../2], Z[1../2], S[1..]; S[1] := X[1]; if > 1 the forall 1 i /2 do Y[i] := X[2i-1] + X[2i] Z[1../2] := par_prefix_sumy[1../2]; forall 2 i do if evei the S[i] := Z[i/2] else S[i] := Z[i-1/2] + X[i] edif edif retur S[1..] 24

Balaced trees i arrays Balaced Tree Asced / Desced Key idea view iput data as balaced biary tree sweep tree up ad/or dow Tree ot a data structure but a cotrol structure e.g., recursio Example vector summatio 1 3 3 10 5 11 7 36 + + + 1 3 3 10 5 11 7 26 1 3 3 7 5 11 7 15 + + + + 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 25

I-place prefix sum 1 2 3 4 5 6 7 8 3 7 11 15 + asced phase + desced phase retaied value 10 26 S 36 W 36 Space 10 36 3 10 21 36 PRAM model 1 3 6 10 15 21 28 36 26

I-place prefix-sum algorithm WT model 1 2 3 4 Iput: X[1..] vector of values, = 2 k Output: S[1..] vector of prefix sums 3 7 10 10 3 10 1 3 6 10 parallel_prefix_sum X[1..] = forall i i 1: do S[i] := X[i] for h = 1 to k do forall i i 1:/2 h do S[2 h i] := S[2 h i 2 h-1 ] + S[2 h i] for h = k dowto 1 forall i i 2:/2 h-1 do if oddi the S[2 h-1 i] := S[2 h-1 i 2 h-1 ] + S[2 h-1 i] edif 27

Sca-based primitives Sca operatios parallel prefix operatios ca be used to implemet may useful primitives Suppose we are give SCAN to compute prefix sum of iteger sequeces seq<it> SCANseq<it> step complexity is lg work complexity is PRAM model is EREW The ext three examples have the same complexity as SCAN 28

COPY or DISTRIBUTE seq<it> COPYit v, it { } seq<it> V[1:]; V[1] = v; forall i i 2 : do V[i] := 0; retur SCANV; v = 5 = 7 V = 5 0 0 0 0 0 0 Res = 5 5 5 5 5 5 5 29

ENUMERATE seq<it> ENUMERATEseq<bool> Flag{ } seq<it> V[1:#Flag]; forall i i 1 : #Flag do V[i] := Flag[i]? 1 : 0; retur SCANV; Flag = T T F T F F T V = 1 1 0 1 0 0 1 Res = 1 2 2 3 3 3 4 30

PACK seq<t> PACKseq<T> A, seq<bool> Flag{ } seq<t> R[1:#A]; P := ENUMERATEFlag; forall i i 1 : #Flag do if Flag[i] the R[P[i]] := A[i] edif; retur R[1:P[#Flag]]; A =! @ # $ % ^ & Flag= T T F T F F T P = 1 2 2 3 3 3 4 R =! @ $ & 31

Radix Sort Iput: Output: Auxiliary: A[1:] with b-bit iteger elemets A[1:] sorted FL[1:], FH[1:], BL[1:], BH[1:] for h := 0 to b-1 do forall i i 1: do FL[i] := A[i] bit h == 0 FH[i] := A[i] bit h!= 0 BL := PACKA,FL BH := PACKA,FH m := #BL forall i i 1: do A[i] := if i m the BL[i] else BH[i m]edif S = W = 32

Complexity measures for W-T algorithms Asymptotic time complexity measures optimal sequetial time complexity T s parallel time complexity T c,p Speedup defiitio SP, T p s T, p c limitatio T T pt SP, p s s s O p T, p W / p W c Average available parallelism defiitio W AAP S 33

Objectives i the desig of W-T algorithms Goal 1: costruct work efficiet algorithms a W-T algorithm is work efficiet if W = T s work-iefficiet parallel algorithms have limited appeal o a PRAM with a fixed umber of processors p lim SP, p lim pts W p lim Ts W 0 34

35 Objectives i the desig of W-T algorithms Goal 2: miimize step complexity get optimal speedup usig AAP = T s / S processors whe S is decreased, AAP is icreased with fixed problem size ca use more processors to get greater speedup with fixed umber of processors reach optimal speedup at smaller problem size,, AAP S S T S AAP T T AAP T T AAP SP s s s c s

W-T model advatages Widely developed body of techiques Igores schedulig, commuicatio ad sychroizatio easiest parallel programmig Source-level complexity metrics Work ad step complexity related to ruig time via Bret s theorem Good place to start may real-world algorithms ca be derived startig from W-T algorithms 36