Parallel Solutons of Indexed Recurrence Equatons Yos Ben-Asher Dep of Math and CS Hafa Unversty 905 Hafa, Israel yos@mathcshafaacl Gad Haber IBM Scence and Technology 905 Hafa, Israel haber@hafascvnetbmcom Abstract A new type of recurrence equatons called ndexed recurrences (IR) s defned, n whch the common noton of X[] = op(x[];x[,]) = :::ns generalzed to X[g()] = op(x[f()];x[h()]) f; g; h : f:::ng7! f:::mg Ths enables us to model sequental loops of the form X[g()] := op(x[f ()];X[h()]; ) as IR equatons Thus, a parallel algorthm that solves a set of IR equatons s n fact a way to transform sequental loops nto parallel ones Note that the crcut evaluaton problem (CP) can also be expressed as a set of IR equatons Therefore an effcent parallel soluton to the general IR problem s not lkely to be found, as such soluton would also solve the CP, showng that P NC In ths paper we ntroduce parallel algorthms for two varants of the IR equatons problem: An O(log n) greedy algorthm for solvng IR equatons where g() s dstnct and h() = g() usng O(n) processors An O(log n) algorthm wth no restrcton on f; g or h, usng up to O(n ) processors However, we show that for general IR, op must be commutatve so that a parallel computaton can be used Introducton We consder a certan generalzaton of ordnary recurrence equatons called ndexed recurrence (IR) equatons Gven an ntalzed array A[::m], asetofn IR equatons have the form A[g()] := op(a[f()];a[h()]) whch can be represented by a sequental loop of the form A[g()] := op(a[f ()];A[h()]); where op s a bnary assocatve operator and where f; g : f::ng 7!f::mg do not nclude references to elements of the A[] array tself The goal s to use the parallel solutons of these IR equatons n order to parallelze sequental loops whose executon can be smulated by a set of IR equatons Ths s smlar to the way that parallel solutons of lnear recurrences (A[] = op(a[, ];A[])) are used to parallelze sequental loops [] of the form: A[] := op(a[, ];A[]); In our work, we analyzed the well known Lvermore Loops [] and checked how many of them ft nto the general frame of IR equatons n compare to ordnary recurrence equatons There are loops n ths code, often used as a benchmark for parallelzng complers, and contan typcal code for scentfc computng Out of the loops we found that: loops,7,8,,5,6, do not contan recurrences of any type; loops,5,,9 contan lnear recurrences; all other loops (except for,0,) contan ndexed recurrences Ordnary Indexed Recurrences Ths secton descrbes the parallel algorthm for computng a set of IR equatons where g() s dstnct and h() = g() Ths case s smpler than the general one, and the parallel algorthm we obtan s more effcent than the one for the general case, and uses O(n) processors It s easy to begn wth the sequental algorthm namely, the followng loop: Array A[::m] wth ntal values; A[g()] := A[f ()] A[g()]; For convenence, we have replaced the notaton of op(x; y) wth x y, where s the sutable bnary and assocatve operaton Note that op s not necessarly a commutatve operaton, therefore our algorthm should preserve the multplcatons order (e the order of operatons)
for = to n do A[] := A[ + ] A[]; After 8 teratons: A 0 [] =A[] A 0 [5]=A[5] A 0 []=A[]A[] A 0 [6]=A[]A[]A[6] A 0 []=A[] A 0 [7]=A[7] A 0 []=A[]A[] A 0 [8]=A[5]A[8] Fgure An example of an Ordnary IR loop The above loop can be vewed as a functon for computng a new value A 0 = F (A; n; f; g) (also denoted as OrdnaryIR(A; f; g; )) where A s the ntal array and A 0 s the array after executng the loop We therefore need to fnd a parallel algorthm that computes F (A; n; f; g) n less than n steps Ths s analogous to the way n whch prefx-sum [] s used to solve ordnary recurrence equatons F (A; op(x; y)) = prefx-sum(a; op(x; y)) The value of A 0 [g()] s a product of a subset of elements n A As an example consder the loop n fg where n every teraton, A[] s updated by A[ + ] A[] Some of the elements A 0 [] preserve ther ntal values, eg A 0 [7] =A[7](snce there s no n such that g() =7) Whle the trace of other A 0 [] contan the multplcatons of several elements, eg A 0 [6] =A[]A[]A[6](snce g() =6;f()=, and then g() =f()and f () =) Fnally, A[] s the last tem n the trace of A 0 [6] snce there s no < such that g() =f() The sequence of multplcatons of every element n A 0 [] (also called the trace of A[g()]) s gven by the followng lemma: Lemma Let A 0 [] denote the value of A[] after the executon of the loop for = ;:::;ndoa[g()] = A[f ()] A[g()] then for all = :::n such that: j = A 0 [g()] = A[f (j k)] :::A[f(j)] A[g()] for t = :::kthe ndces j t satsfy that j t <j t, and g(j t )=f(j t, ) j k s the last ndex for whch g(j t )=f(j t, ), e, there s no j k+ < j k such that g(j k+ ) = f(j k ) Lemma suggests a smple method for computng A 0 [g()] n parallel Let A,t [g()] denote the sub-trace wth t + rghtmost elements n the trace of A 0 [g()], e, A,t [g()] = A[f (j k,t)] :::A[f()] A[g()] Consder the concatenaton (or multplcaton ) of two successve sub-traces: A,(t +t ) [g()] = A,t [g(j)] A,t [g()] where g(j) =f(j k,t )and A[f (j k,t )] s the last element n A,t [g()] Note that A[g(j)] s multpled twce, once as A[g(j)] and once as A[f(j k,t )] Ths can be corrected by takng the trace of ts predecessor A,t [f(j)] so that A,(t +t ) [g()] = A,t [f (j)] A,t [g()] = A[f (j 0 k,t )] :::A[f(j)] A[f (j 0 k,t )] :::A[g()] = A[f (j 0 k,t )] :::A[g(j 0 )] A[g(j)] :::A[g()] where j 0 <jand j 0 s the teraton number n whch A[f(j)] s last updated n the loop The proposed algorthm s a smple greedy algorthm that keeps teratng untl all traces are completed, where n each teraton all possble concatenatons of successve sub-traces are computed n parallel Thus, ntally, we can compute n parallel the frst product of each trace A[g()] = A[f ()] A[g()] (for all = ;:::;n) The concatenaton operaton of two successve sub-traces A,t [g(j)];a,t [g(j)] can be mplemented usng that: the value of a sub-trace A,t [g()] s stored n ts array element A[g()] a ponter N [g()] ponts to the sub-trace A,t [g(j)] to be concatenated to A,t [g()] (to form A,(t+t) [g()]) Hence, A[N [g()]] contans the value of the sub-trace A,t [g(j)] Intally all traces are of length, and can be computed n parallel The code for a concatenaton step of future teratons s therefore as follows: multplcaton- A[g()] = A,(t+t) [g()] = A[N [g()]] A[g()] ponter updatng- N [g()],(t+t) = N [N [g()]], where N [;:::;m]s ntalzed as follows: N [x] = f() 9, n and g() =x 0 Otherwse Snce we start wth traces of length, then for each = ::n N [g()] = N [N [g()]] The way n whch the concatenaton operaton works s depcted n fg showng two parallel concatenatons of sub-traces The operaton N [g()],(t+t) = N [N [g()]] s depcted by the fact that the next-ponter of a new trace s taken to be that of the joned trace The algorthm performs log n teratons In each teraton, the above concatenaton operaton (the multplcaton followed by the updatng)s performed n parallel for all traces A 0 [g()] As a result, n each teraton, ether a trace s fully computed or the number
-5 A[9] A[8] A[7] A[6] A[5] A[] A[] A[] A[] A[0] A[9] - A[7] - A[] - A [8] - A[9] A[8] A[7] A[6] A[5] A[] A[] A[] A[] A[0] A[9] A [8] -6 Ths s not an ordnary IR recurrence due to the nonassocatve nature of the operators f (x) = ax+b c x+d (where = ; ;:::n) However, we can transform the recurrence nto an ordnary IR problem by explotng a useful qualty of these operators as shown n the followng theorem Fgure The concatenaton operaton of two traces of elements n the product of a trace s doubled due to the multplcaton A,(t +t ) [g()] = A,t [g(j)] A,t [g()]: Hence, log n teratons are suffcent Clearly, once a trace has been completed (fully computed) we must not contnue to concatenate any more traces to t It therefore remans to determne when the computaton of a trace has completed In general, n every teraton and for every trace stored n A[g()], the algorthm must determne: the exstence of A[g(j)] such that ts trace can be concatenated to the trace of A[g()] f the computaton of the trace of A[g()] s completed, then no more redundant traces should be added to t A more effcent verson of the algorthm whch forks only up to P processes at the same tme, was programmed and tested on the SmParC [5] smulator Hence, ths verson complexty s T (n; P ) = n log n Fgure shows the P results obtaned for an array of sze n = 50; 000 and for P = #processors << n The Y axs represents the complexty n unts of assembly nstructons The algorthms code s gven n the full paper I n s t r u c t o n s e+06 e+06 600 00000 Parallel IR Soluton Orgnal IR Loop 8 6 6 8 Processors Fgure The results of runnng the OrdnaryIR algorthm for n=50,000 Useful Applcaton for the Ordnary IR Soluton Consder the followng recurrence: X[g()] := A[]X[f ()]+B[] Lemma Let there be two sets of functons f (x) and g (x) defned as follows: f (x) = ax+b c x+d,g (x)= ex+f g x+h k l then f (g (x)) = kx+l m x+n, where = m n a e b f The operaton s defned c d g h as follows for x matrces: A B = A f det(a) =0 AB Otherwse From lemma, also known as Moebus Transformaton, t follows that the values X[g()] :::X[g(n)] of the recurrence shown above, can be computed by the followng steps: Intalze all matrces wth approprate coeffcents: 0 M ::m ntalzed to 0 A[] B[] forall f::ng do n parallel M g() := C[] D[] Multply the matrces: for = tondo M g() := M f () M g() Calculate the values of X[g()] :::X[g(n)]: forall f::ng do n parallel X[g()] := m S[g()]+m m S[g()]+m where m m m m = M g() Note that snce step s an ordnary IR, we can replace t wth a call to OrdnaryIR(M; f; g;) (where s the modfed matrx multplcaton operaton from lemma ) Thus we transformed the recurrence nto an ordnary IR problem whch we already know how to solve We can also produce a parallel soluton to a slghtly more complcated recurrence of the followng form: X[g()] := X[g()] + A[]X[f ()]+B[] Snce g() s dstnct, we can rewrte the above recurrence by replacng the varable X[g()] on the rght hand of the 0 := 0 sgn, wth ts ntal value S[g()], wthout affectng the fnal values of X[g()] :::X[g(n)] Thssallowed snce the dstnctness property of g() guarantees us that each assgnment to X[g()] s the frst one, and therefore each reference to X[g()] s a reference to ts ntal value Thus we can brng the loop to ts Moebus form as follows: for = tondo X[g()] := (S[g()]C[]+A[])X[f ()]+(S[g()]D[]+B[])
producng the followng Moeubus matrces: 8 M g() = S[g()] C[] +A[] S[g()] D[] +B[] C[] D[] As an example consder the recurrence taken from loop number, of the Lvermore Loops benchmark [] The loop s a -D Implct Hydrodynamcs fragment: X[::n; ::7] ntalzed to S for j = to6do for = ton do X[; j] := X[; j] +0:75d0 (Y [] +X[,;j]Z[; j]); The nner loop can be vewed as an ordnary IR problem OrdnaryIR(M; f; g;) where g() =7(,)+j, f()= 7(,)+j, 8M g() = 0:75 Z[; j] S[g()] + 0:75 Y [] 0 and where s the operator from lemma Thus, wthout usng any data dependence analyss technques, we managed to parallelze the loop, to be calculated n O(log n) steps General Indexed Recurrences We now consder a more general case of IR equatons (called GIR) whch can be modeled by the loop: for = tondobegn A[g()] := A[f ()] A[h()]; The greedy method used for the IR case (where g() = h()) s not sutable for GIR Essentally, ths s due to the dfference n the structure of the trace A 0 [g()] n the two cases As depcted n fg, A 0 [g()] n the GIR case s a bnary tree, whereas n the IR case A 0 [g()] s a lst g()= f()= - h()= - GIR: A[] = A[-]A[-] A[6 ]= A[ 5]= A[ ]= A[ ]= A[ ]= A[ ]= A[ ]= g()= f()= - IR: A[] = A[-]A[] A[ 6]= A[ 6] A[ 5] A[ ] A[ ] Fgure Tree structure versus lst structure of the trace The tree structure of the trace mples that the operator must be a commutatve one Clearly, the multplcaton of traces values can be done ether from the left or from the rght end of a current trace value The other problem that a GIR loop presents us wth, s that traces can have an exponental length For example consder the loop 0 for = ::n A[] := A[, ] A[, ] 0,whereA[0]=A[]=aIn ths example the trace A 0 [n] = a n conssts of n multplcatons Therefore, n order for the parallelzaton of GIR loops to be effcent, the computaton of a power (A[] k ) must be regarded as atomc operaton Ths assumpton can also be found n prevous works (eg []) where the multplcaton operaton was used n order to solve recurrences of addtons The GIR algorthm must therefore gather all dentcal elements of a trace and then, usng the power operaton, compute ther product n a sngle operaton As an example, consder the above loop (A[] := A[,]A[,]), where A[0] and A[] have dfferent ntal values After the executon of the loop the trace s a multplcaton of two powers A 0 [] = A[0] fb(,) A[] fb(,),wherefb() s the th Fbonacc number Ths trace s thus, best computed by frst countng thepowersofa[0] and A[] n every trace separately (see fgure 5) Indeed countng powers s suffcent to compute the traces not only for the above loop, but for any GIR loops as well A [ ] A[ ] A[ 0 ] A [ ] A [ ] A [ ] A[ ] A[ ] A[ 0 ] A [ ] = A[ 0 ] A[ ] Fgure 5 The expanson of the recurrence X = X, X, for n = Countng all powers of A[] s elements can be done usng an ntal dependence graph G =< ;E >, showng dependences among the fnal values of A[] s elements The proposed algorthm computes the power of some element A[j] n a trace A 0 [] by countng the number of dfferent paths between correspondng nodes j and,n G Intutvely, each edge <;j>e of the dependence graph G ndcates that A[j] s an operand n the assgnment statement to A[] of thegir-loop Thus, the powerof A[j] n the trace of A 0 [] s n fact the number of dfferent paths leadng from j to n G Computng all powers n every trace s therefore equvalent to countng all paths (CAP) between the nodes of G The partcular varant of CAP needed for GIR-loops s defned as follows: Defnton Let S be the set of nodes wth n-degree 0 (the leaves or buttom nodes) of a DAG G =< ;E> Countngall the paths CAP(G)s an operaton that returns a labeled graph G 0 = < ;E 0 >such that an edge < ; j > [x], S; j S wth the label [x] belongs to G 0 ff there are exactly x paths from j to n G For example let G be a double chan of n nodes v,! v,! :::,! v n, such that there are two edges from v to v + In ths case G 0 = CAP(G) s a DAG such
that there s a sngle edge from v to every v of the form <v ;v > [ ] In order to solve a GIR loop we frst create the dependence graph G, and then computes all the paths n G n parallel G 0 = CAP(G ) G s constructed such that an edge < ;j > [x] E 0 ff the power of A[j] n the trace A 0 [] s exactly x Fnally, the trace of every element A 0 [] s obtaned by computng A 0 [] = A[j ] x :::A[j k ] xk where <;j l > [xl] CAP(G ) l = ;:::;k Thus, once we have the powers x ;:::;x k the trace can be computed n parallel n log k steps The dependence graph G = <;E >nduced by a GIR-loop s defned as follows: g();:::;g(n);f() 0 ;:::;f(n) 0 ;h() 00 ;:::h(n) 00 = where f() 0 (or h() 0 ) represent ntal values of A[] that wll form the trace of the g() nodes The edges n E nclude: for = ::n <g();f()> [] f there exsts j; j < such that g(j) =f() -Deletng marked edges - remove each marked edge from E t Ths step prevents us from recountng edges that were already taken under consderaton n prevous steps -Paths addton- For each node v replace all double edges <v ;v j > [x] ;:::;< v ;v j > [xk] E t wtha Pk sngle edge (labeled by ther sum) <v ;v j > l= xl : v x xk vj v Fgure 8 Summng double edges x m Two separate examples of the above algorthm operaton are gven n fg 9 The new edges added (by path multplcaton and path addton) n every teraton, are denoted by dashed lnes vj <g();h() > [] f there exsts j; j < such that g(j) =h() <g();f() 0 > [] f there s no j; j < such that g(j) =f() G <g();h() 00 > [] f there s no j; j < such that g(j) =h() For example, G of the loop A[] = A[,] A[,] 5 G 0 h() h() h() f() f() f() g() g() g() Fgure 9 Iteratons of two graphs Fgure 6 The dependence graph produced by the recurrence A = A, A, for = ; ; s gven n fg 6 Our algorthm for computng CAP(G) uses log n teratons (t = ;:::;log n), where n each teraton we update the edges of the current graph G t, =<;E t, >to form G t =<;E t >as follows: -E t = E t, -Paths multplcaton - For each <v ;v k > [x] E t and a successve edge <v k ;v j > [y] E t,weaddanew edge <v ;v j > [xy] to E t and mark <v k ;v j > [y] to be deleted: x y x x y k j k Fgure 7 Paths multplcaton y j The full algorthm along wth a verson whch avods spawnng unnecessary processes, and a method for handlng GIR wth non-dstnct g, are descrbed n the full paper References [] John T Feo, "An analyss of the computatonal and parallel complexty of the Lvermore Loops", Journal of Parallel Computng No7, 988, pp 6-85 [] H S Stone, "An effcent Parallel Algorthm for the Soluton of a Trdagonal Lnear System of equatons", J ACM 0,7 (97) [] J Jaja, "An Introducton to parallel algorthms", Addson- Wesley publshng company, 99 [] P M Kogge, H S Stone, "A Parallel Algorthm for the Effcent Soluton of a General Class of Recurrence Equatons", IEEE Transactons on Computers, C(8):786-79, August 97 [5] G Haber, Y Ben-Asher, "On the Usage of smulators to detect neffcency of parallel programs caused by "bad" schedulngs: the SIMPARC approach", Accepted for publcaton n the Journal of Systems and Software 5