COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1
First class summary This course is about parallel computig to achieve high-er performace o idividual problems start with high level PRAM model study algorithms ad asymptotic complexity subsequetly focus o more practical models from implemetatio poit of view shared memory, distributed memory, distributed computig study hardware orgaizatio, programmig models, performace predictio ad aalysis examie various algorithms ad case studies Itroductio 2
Topics today PRAM model executio model programmig model Work-Time model programmig model complexity metrics Bret s theorem: traslatio to PRAM programs Parallel prefix algorithm derivatio applicatios 3
PRAM model of parallel computatio PRAM = Parallel Radom Access Machie p processors shared memory each processor has a uique idetity 1 i p SIMD operatio sychroous PRAM each processor may be active or iactive each istructio executed by all active processors each istructio completes i uit time active? istructios shared memory 1 2 p procs 4
PRAM program PRAM program sequetial program expressios ivolvig processor id i have a uique value i each processor i ca be used as a array idex X[i] := i coditioals specify active processors if oddi the X[i] := X[i] + X[i+1] edif if i 2 the X[i] := 1 else X[i] := -1 edif X[1..4] 1 2 3 4 5
Cocurret memory access - Read Cocurret reads CR all readers of a give locatio see the same value X[i] := y X[i] := B[ i/2 ] Elimiatig bouded-degree cocurret reads replace X[i] := B[ i/2 ] with value of y read cocurretly by all p processors some locatios i B read cocurretly by two processors if oddi the X[i] := B[ i/2 ] edif if evei the X[i] := B[ i/2 ] edif X Ex. p = 6 1 1 2 2 3 3 B 1 2 3 cocurret read is elimiated but umber of steps is doubled 6
Cocurret memory access - Write Cocurret writes CW Stored value depeds o write arbitratio policy: Arbitrary CW odetermiistic choice amog values writte Commo CW All processors that write a value must write the same value, else error Priority CW value writte by processor with lowest processor id Combiig Write all values combied usig a specified associative operatio e.g. + Example p = 6 y := X[i] X 10 20 30 40 50 60 B[ i/2 ] := X[i] y B 7
Cocurret writes: Let B[1:p] be a array of boolea values ad defie c B 1 B 2 B p use p processors ad cocurret writes to compute c i a costat umber of steps a with combiig CW b with a CW policy other tha combiig CW which? 8
Cocurret memory access PRAM variats EREW, CREW, ERCW, CRCW differ i performace, ot expressive power EREW < CREW < CRCW loosely reflect difficulty of model implemetatio The followig are cosidered EREW refereces to processor id i umber of processors p problem size refereces to local variables local h; h := 2*i + 1; X[h] := X[i] expressio evaluatio is sychroous, e.g. X[i] := X[i] + X[i+1] is EREW 9
A PRAM program Simple problem: vector additio give V,W vectors of legth compute Z = V + W PRAM program costructed to operate with arbitrary problem size umber of processors p work to be performed must explicitly be scheduled across processors time complexity with p procs T c,p = PRAM model? p Iput: V[1:], W[1:] i shared memory Output: Z[1:] i shared memory p /p proc id local iteger h, k for h := 1 to /p do k := h-1 p + i if k the Z[k] := V[k] + W[k] edif V W Z 10
Work-Time paradigm W-T parallel programmig model high-level PRAM programmig model specifies available parallelism o explicit schedulig of parallelism over processors simplifies algorithm presetatio ad aalysis W-T programs ca be mechaically traslated to PRAM programs W-T program sequetial program forall costruct specificatio of available parallelism umber of processors is ot a parameter of the model! WT program for vector additio Iput: V[1:], W[1:] Output: Z[1:] forall i i 1: do Z[i] := V[i] + W[i] 11
Programmig otatio for the W-T framework stadard sequetial programmig otatio statemets assigmet statemet compositio alterative costruct if... the... else..edif repetitive costruct for, while expressios arithmetic ad logical fuctios variable referece recursive fuctio ad procedure ivocatio forall statemet specifies T may be executed simultaeously for each value of i i D o restrictio o T ca be a sequece of statemets, ca ivoke recursive fuctios forall i i D do statemet T depedig o i 12
W-T complexity metrics Work complexity W total umber of operatios performed as a fuctio of iput size Step complexity S umber of parallel steps required as a fuctio of iput size assumig ubouded parallelism Iductively defied over costructs of W-T programmig otatio 13
W-T complexity measures: simple example forall i i 2:-1 do R[i] := R[i-1] + R[i] + R[i+1]/3 for h := 1 to k do forall i i 2:-1 do R[i] := R[i-1] + R[i] + R[i+1]/3 R 1 14
Work ad Step Complexity of the forall costruct How to defie work ad time complexity of the forall costruct? P: forall i i D do body T depedig o i assume we ca determie WT i ad ST i for each i i D WP = SP = 15
W-T complexity measures: vector summatio let = 2 k forall i i 1:/2 do S[i] := S[2i - 1] + S[2i] for h := 1 to k do forall i i 1:/2 h do S[i] := S[2i - 1] + S[2i] S 1 = 4, k = 2 16
W-T complexity measures: vector summatio Vector summatio sum - reductio give V[1..], = 2 k compute s = sumv[1:] optimal sequetial time T s = Complexity W = S = Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] P1: forall i i 1: do B[i] := V[i] P2: for h := 1 to k do forall i i 1:/2 h do B[i] := B[2i-1]+B[2i] P3: s := B[1] PRAM model eeded? 18
19 Bret s theorem schedules a W-T program for a p-processor PRAM idea simulate each parallel step i W-T program usig p processors the work W i to be performed i step i ca be completed usig p processors i time boud cocurret rutime T C,p of resultat PRAM program by summig over all S steps Bret s theorem ad T c,p, 1 1 p T p W p W p W c S i i S i i p W i 1, 1 1 1 S p W S p W p W p W p T S i i S i i S i i c
Schedulig W-T vector summatio algorithm W-T vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] P1: forall i i 1: do B[i] := V[i] P2: for h := 1 to k do forall i i 1:/2 h do B[i] := B[2i-1]+B[2i] P3: s := B[1] PRAM vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] p > 0 processor PRAM; processor idex i local iteger j, r; P1: for j := 1 to /p do r := j-1 p + i if r the B[r] := V[r] edif P2: for h := 1 to k do for j := 1 to /2 h /p do r := j-1 p + i if r /2 h the B[r] := B[2r-1]+B[2r] edif P3: if i 1 the s := B[1] edif 20
Performace of traslated W-T program Cout steps eeded to perform the additios Bret s theorem predicts T c 1, p O lg p couts for various p p p 1 p p 3, 2 k, k eve T, p c 1 / p lg 1 p lg Upper boud is tight for this program traslatio retais EREW model 1 2 PRAM vector summatio algorithm Iput: V[1:] vector of itegers, = 2 k Output: s = sumv[1:] p > 0 processor PRAM; processor idex i local iteger j, r; P1: for j := 1 to /p do r := j-1 p + i if r the B[r] := V[r] edif P2: for h := 1 to k do for j := 1 to /2 h /p do r := j-1 p + i if r /2 h the B[r] := B[2r-1] + B[2r] edif P3: if i 1 the s := B[1] edif 21
Parallel prefix-sum Iclusive prefix sum Iput Sequece X of = 2 k elemets, biary associative operator + Output Sequece S of = 2 k elemets, with S i = x 1 +... + x i Example: X = [1, 4, 3, 5, 6, 7, 0, 1] S = [1, 5, 8, 13, 19, 26, 26, 27] T S = Uses of prefix sum efficiet parallel implemetatio of sequetial sca through cosecutive actios ex: Give series of bak trasactios T[1:], with T[i] positive or egative, ad T[1] the opeig deposit > 0 Was the accout ever overdraw? explicit or implicit compoet of may parallel algorithms 22
Prefix sum algorithm Recursive solutio Xi stads for X[i] ad Xij stads for X[i]+X[i+1]+ +X[j] S: X11 X12 X13 X14 X15 X16 X17 X18 Z: X12 X14 X16 X18 Recursive prefix sum Y: X12 X34 X56 X78 X: X1 X2 X3 X4 X5 X6 X7 X8 23
Parallel prefix sum algorithm WT model Iput: X[1..] vector of itegers Output: S[1..] S: Z: Y: X: X11 X12 X13 X14 X12 X12 recur X14 X34 X1 X2 X3 X4 par_prefix_sum X[1..] = var Y[1../2], Z[1../2], S[1..]; S[1] := X[1]; if > 1 the forall 1 i /2 do Y[i] := X[2i-1] + X[2i] Z[1../2] := par_prefix_sumy[1../2]; forall 2 i do if evei the S[i] := Z[i/2] else S[i] := Z[i-1/2] + X[i] edif edif retur S[1..] 24
Balaced trees i arrays Balaced Tree Asced / Desced Key idea view iput data as balaced biary tree sweep tree up ad/or dow Tree ot a data structure but a cotrol structure e.g., recursio Example vector summatio 1 3 3 10 5 11 7 36 + + + 1 3 3 10 5 11 7 26 1 3 3 7 5 11 7 15 + + + + 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 25
I-place prefix sum 1 2 3 4 5 6 7 8 3 7 11 15 + asced phase + desced phase retaied value 10 26 S 36 W 36 Space 10 36 3 10 21 36 PRAM model 1 3 6 10 15 21 28 36 26
I-place prefix-sum algorithm WT model 1 2 3 4 Iput: X[1..] vector of values, = 2 k Output: S[1..] vector of prefix sums 3 7 10 10 3 10 1 3 6 10 parallel_prefix_sum X[1..] = forall i i 1: do S[i] := X[i] for h = 1 to k do forall i i 1:/2 h do S[2 h i] := S[2 h i 2 h-1 ] + S[2 h i] for h = k dowto 1 forall i i 2:/2 h-1 do if oddi the S[2 h-1 i] := S[2 h-1 i 2 h-1 ] + S[2 h-1 i] edif 27
Sca-based primitives Sca operatios parallel prefix operatios ca be used to implemet may useful primitives Suppose we are give SCAN to compute prefix sum of iteger sequeces seq<it> SCANseq<it> step complexity is lg work complexity is PRAM model is EREW The ext three examples have the same complexity as SCAN 28
COPY or DISTRIBUTE seq<it> COPYit v, it { } seq<it> V[1:]; V[1] = v; forall i i 2 : do V[i] := 0; retur SCANV; v = 5 = 7 V = 5 0 0 0 0 0 0 Res = 5 5 5 5 5 5 5 29
ENUMERATE seq<it> ENUMERATEseq<bool> Flag{ } seq<it> V[1:#Flag]; forall i i 1 : #Flag do V[i] := Flag[i]? 1 : 0; retur SCANV; Flag = T T F T F F T V = 1 1 0 1 0 0 1 Res = 1 2 2 3 3 3 4 30
PACK seq<t> PACKseq<T> A, seq<bool> Flag{ } seq<t> R[1:#A]; P := ENUMERATEFlag; forall i i 1 : #Flag do if Flag[i] the R[P[i]] := A[i] edif; retur R[1:P[#Flag]]; A =! @ # $ % ^ & Flag= T T F T F F T P = 1 2 2 3 3 3 4 R =! @ $ & 31
Radix Sort Iput: Output: Auxiliary: A[1:] with b-bit iteger elemets A[1:] sorted FL[1:], FH[1:], BL[1:], BH[1:] for h := 0 to b-1 do forall i i 1: do FL[i] := A[i] bit h == 0 FH[i] := A[i] bit h!= 0 BL := PACKA,FL BH := PACKA,FH m := #BL forall i i 1: do A[i] := if i m the BL[i] else BH[i m]edif S = W = 32
Complexity measures for W-T algorithms Asymptotic time complexity measures optimal sequetial time complexity T s parallel time complexity T c,p Speedup defiitio SP, T p s T, p c limitatio T T pt SP, p s s s O p T, p W / p W c Average available parallelism defiitio W AAP S 33
Objectives i the desig of W-T algorithms Goal 1: costruct work efficiet algorithms a W-T algorithm is work efficiet if W = T s work-iefficiet parallel algorithms have limited appeal o a PRAM with a fixed umber of processors p lim SP, p lim pts W p lim Ts W 0 34
35 Objectives i the desig of W-T algorithms Goal 2: miimize step complexity get optimal speedup usig AAP = T s / S processors whe S is decreased, AAP is icreased with fixed problem size ca use more processors to get greater speedup with fixed umber of processors reach optimal speedup at smaller problem size,, AAP S S T S AAP T T AAP T T AAP SP s s s c s
W-T model advatages Widely developed body of techiques Igores schedulig, commuicatio ad sychroizatio easiest parallel programmig Source-level complexity metrics Work ad step complexity related to ruig time via Bret s theorem Good place to start may real-world algorithms ca be derived startig from W-T algorithms 36