Parallel Solutions of Indexed Recurrence Equations

Similar documents
CMPS 10 Introduction to Computer Science Lecture Notes

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Module Management Tool in Software Development Organizations

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

An Optimal Algorithm for Prufer Codes *

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Analysis of Continuous Beams in General

Programming in Fortran 90 : 2017/2018

Problem Set 3 Solutions

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Hermite Splines in Lie Groups as Products of Geodesics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallel matrix-vector multiplication

Conditional Speculative Decimal Addition*

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach


Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

CE 221 Data Structures and Algorithms

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Support Vector Machines

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Solving two-person zero-sum game by Matlab

CS1100 Introduction to Programming

Mathematics 256 a course in differential equations for engineering students

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A Binarization Algorithm specialized on Document Images and Photos

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Related-Mode Attacks on CTR Encryption Mode

Transaction-Consistent Global Checkpoints in a Distributed Database System

Lecture 5: Multilayer Perceptrons

The Codesign Challenge

Loop Transformations, Dependences, and Parallelization

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Brave New World Pseudocode Reference

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

Meta-heuristics for Multidimensional Knapsack Problems

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

3D vector computer graphics

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Load Balancing for Hex-Cell Interconnection Network

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

CS 534: Computer Vision Model Fitting

Accounting for the Use of Different Length Scale Factors in x, y and z Directions

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Array transposition in CUDA shared memory

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

UNIT 2 : INEQUALITIES AND CONVEX SETS

A fault tree analysis strategy using binary decision diagrams

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

TN348: Openlab Module - Colocalization

Fast Computation of Shortest Path for Visiting Segments in the Plane

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES

Reducing Frame Rate for Object Tracking

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Algorithm To Convert A Decimal To A Fraction

On Some Entertaining Applications of the Concept of Set in Computer Science Course

such that is accepted of states in , where Finite Automata Lecture 2-1: Regular Languages be an FA. A string is the transition function,

Optimal Workload-based Weighted Wavelet Synopses

Design and Analysis of Algorithms

Line Clipping by Convex and Nonconvex Polyhedra in E 3

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

S1 Note. Basis functions.

Polyhedral Compilation Foundations

Cluster Analysis of Electrical Behavior

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

$OJRULWKPV. (Feodor F. Dragan) Department of Computer Science Kent State University

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

An Efficient Label Setting/Correcting Shortest Path Algorithm

Query Clustering Using a Hybrid Query Similarity Measure

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Active Contours/Snakes

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Constructing Minimum Connected Dominating Set: Algorithmic approach

A Heuristic for Mining Association Rules In Polynomial Time*

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Machine Learning: Algorithms and Applications

Reading. 14. Subdivision curves. Recommended:

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

X- Chart Using ANOM Approach

Gaussian elimination. System of Linear Equations. Gaussian elimination. System of Linear Equations

ELEC 377 Operating Systems. Week 6 Class 3

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Transcription:

Parallel Solutons of Indexed Recurrence Equatons Yos Ben-Asher Dep of Math and CS Hafa Unversty 905 Hafa, Israel yos@mathcshafaacl Gad Haber IBM Scence and Technology 905 Hafa, Israel haber@hafascvnetbmcom Abstract A new type of recurrence equatons called ndexed recurrences (IR) s defned, n whch the common noton of X[] = op(x[];x[,]) = :::ns generalzed to X[g()] = op(x[f()];x[h()]) f; g; h : f:::ng7! f:::mg Ths enables us to model sequental loops of the form X[g()] := op(x[f ()];X[h()]; ) as IR equatons Thus, a parallel algorthm that solves a set of IR equatons s n fact a way to transform sequental loops nto parallel ones Note that the crcut evaluaton problem (CP) can also be expressed as a set of IR equatons Therefore an effcent parallel soluton to the general IR problem s not lkely to be found, as such soluton would also solve the CP, showng that P NC In ths paper we ntroduce parallel algorthms for two varants of the IR equatons problem: An O(log n) greedy algorthm for solvng IR equatons where g() s dstnct and h() = g() usng O(n) processors An O(log n) algorthm wth no restrcton on f; g or h, usng up to O(n ) processors However, we show that for general IR, op must be commutatve so that a parallel computaton can be used Introducton We consder a certan generalzaton of ordnary recurrence equatons called ndexed recurrence (IR) equatons Gven an ntalzed array A[::m], asetofn IR equatons have the form A[g()] := op(a[f()];a[h()]) whch can be represented by a sequental loop of the form A[g()] := op(a[f ()];A[h()]); where op s a bnary assocatve operator and where f; g : f::ng 7!f::mg do not nclude references to elements of the A[] array tself The goal s to use the parallel solutons of these IR equatons n order to parallelze sequental loops whose executon can be smulated by a set of IR equatons Ths s smlar to the way that parallel solutons of lnear recurrences (A[] = op(a[, ];A[])) are used to parallelze sequental loops [] of the form: A[] := op(a[, ];A[]); In our work, we analyzed the well known Lvermore Loops [] and checked how many of them ft nto the general frame of IR equatons n compare to ordnary recurrence equatons There are loops n ths code, often used as a benchmark for parallelzng complers, and contan typcal code for scentfc computng Out of the loops we found that: loops,7,8,,5,6, do not contan recurrences of any type; loops,5,,9 contan lnear recurrences; all other loops (except for,0,) contan ndexed recurrences Ordnary Indexed Recurrences Ths secton descrbes the parallel algorthm for computng a set of IR equatons where g() s dstnct and h() = g() Ths case s smpler than the general one, and the parallel algorthm we obtan s more effcent than the one for the general case, and uses O(n) processors It s easy to begn wth the sequental algorthm namely, the followng loop: Array A[::m] wth ntal values; A[g()] := A[f ()] A[g()]; For convenence, we have replaced the notaton of op(x; y) wth x y, where s the sutable bnary and assocatve operaton Note that op s not necessarly a commutatve operaton, therefore our algorthm should preserve the multplcatons order (e the order of operatons)

for = to n do A[] := A[ + ] A[]; After 8 teratons: A 0 [] =A[] A 0 [5]=A[5] A 0 []=A[]A[] A 0 [6]=A[]A[]A[6] A 0 []=A[] A 0 [7]=A[7] A 0 []=A[]A[] A 0 [8]=A[5]A[8] Fgure An example of an Ordnary IR loop The above loop can be vewed as a functon for computng a new value A 0 = F (A; n; f; g) (also denoted as OrdnaryIR(A; f; g; )) where A s the ntal array and A 0 s the array after executng the loop We therefore need to fnd a parallel algorthm that computes F (A; n; f; g) n less than n steps Ths s analogous to the way n whch prefx-sum [] s used to solve ordnary recurrence equatons F (A; op(x; y)) = prefx-sum(a; op(x; y)) The value of A 0 [g()] s a product of a subset of elements n A As an example consder the loop n fg where n every teraton, A[] s updated by A[ + ] A[] Some of the elements A 0 [] preserve ther ntal values, eg A 0 [7] =A[7](snce there s no n such that g() =7) Whle the trace of other A 0 [] contan the multplcatons of several elements, eg A 0 [6] =A[]A[]A[6](snce g() =6;f()=, and then g() =f()and f () =) Fnally, A[] s the last tem n the trace of A 0 [6] snce there s no < such that g() =f() The sequence of multplcatons of every element n A 0 [] (also called the trace of A[g()]) s gven by the followng lemma: Lemma Let A 0 [] denote the value of A[] after the executon of the loop for = ;:::;ndoa[g()] = A[f ()] A[g()] then for all = :::n such that: j = A 0 [g()] = A[f (j k)] :::A[f(j)] A[g()] for t = :::kthe ndces j t satsfy that j t <j t, and g(j t )=f(j t, ) j k s the last ndex for whch g(j t )=f(j t, ), e, there s no j k+ < j k such that g(j k+ ) = f(j k ) Lemma suggests a smple method for computng A 0 [g()] n parallel Let A,t [g()] denote the sub-trace wth t + rghtmost elements n the trace of A 0 [g()], e, A,t [g()] = A[f (j k,t)] :::A[f()] A[g()] Consder the concatenaton (or multplcaton ) of two successve sub-traces: A,(t +t ) [g()] = A,t [g(j)] A,t [g()] where g(j) =f(j k,t )and A[f (j k,t )] s the last element n A,t [g()] Note that A[g(j)] s multpled twce, once as A[g(j)] and once as A[f(j k,t )] Ths can be corrected by takng the trace of ts predecessor A,t [f(j)] so that A,(t +t ) [g()] = A,t [f (j)] A,t [g()] = A[f (j 0 k,t )] :::A[f(j)] A[f (j 0 k,t )] :::A[g()] = A[f (j 0 k,t )] :::A[g(j 0 )] A[g(j)] :::A[g()] where j 0 <jand j 0 s the teraton number n whch A[f(j)] s last updated n the loop The proposed algorthm s a smple greedy algorthm that keeps teratng untl all traces are completed, where n each teraton all possble concatenatons of successve sub-traces are computed n parallel Thus, ntally, we can compute n parallel the frst product of each trace A[g()] = A[f ()] A[g()] (for all = ;:::;n) The concatenaton operaton of two successve sub-traces A,t [g(j)];a,t [g(j)] can be mplemented usng that: the value of a sub-trace A,t [g()] s stored n ts array element A[g()] a ponter N [g()] ponts to the sub-trace A,t [g(j)] to be concatenated to A,t [g()] (to form A,(t+t) [g()]) Hence, A[N [g()]] contans the value of the sub-trace A,t [g(j)] Intally all traces are of length, and can be computed n parallel The code for a concatenaton step of future teratons s therefore as follows: multplcaton- A[g()] = A,(t+t) [g()] = A[N [g()]] A[g()] ponter updatng- N [g()],(t+t) = N [N [g()]], where N [;:::;m]s ntalzed as follows: N [x] = f() 9, n and g() =x 0 Otherwse Snce we start wth traces of length, then for each = ::n N [g()] = N [N [g()]] The way n whch the concatenaton operaton works s depcted n fg showng two parallel concatenatons of sub-traces The operaton N [g()],(t+t) = N [N [g()]] s depcted by the fact that the next-ponter of a new trace s taken to be that of the joned trace The algorthm performs log n teratons In each teraton, the above concatenaton operaton (the multplcaton followed by the updatng)s performed n parallel for all traces A 0 [g()] As a result, n each teraton, ether a trace s fully computed or the number

-5 A[9] A[8] A[7] A[6] A[5] A[] A[] A[] A[] A[0] A[9] - A[7] - A[] - A [8] - A[9] A[8] A[7] A[6] A[5] A[] A[] A[] A[] A[0] A[9] A [8] -6 Ths s not an ordnary IR recurrence due to the nonassocatve nature of the operators f (x) = ax+b c x+d (where = ; ;:::n) However, we can transform the recurrence nto an ordnary IR problem by explotng a useful qualty of these operators as shown n the followng theorem Fgure The concatenaton operaton of two traces of elements n the product of a trace s doubled due to the multplcaton A,(t +t ) [g()] = A,t [g(j)] A,t [g()]: Hence, log n teratons are suffcent Clearly, once a trace has been completed (fully computed) we must not contnue to concatenate any more traces to t It therefore remans to determne when the computaton of a trace has completed In general, n every teraton and for every trace stored n A[g()], the algorthm must determne: the exstence of A[g(j)] such that ts trace can be concatenated to the trace of A[g()] f the computaton of the trace of A[g()] s completed, then no more redundant traces should be added to t A more effcent verson of the algorthm whch forks only up to P processes at the same tme, was programmed and tested on the SmParC [5] smulator Hence, ths verson complexty s T (n; P ) = n log n Fgure shows the P results obtaned for an array of sze n = 50; 000 and for P = #processors << n The Y axs represents the complexty n unts of assembly nstructons The algorthms code s gven n the full paper I n s t r u c t o n s e+06 e+06 600 00000 Parallel IR Soluton Orgnal IR Loop 8 6 6 8 Processors Fgure The results of runnng the OrdnaryIR algorthm for n=50,000 Useful Applcaton for the Ordnary IR Soluton Consder the followng recurrence: X[g()] := A[]X[f ()]+B[] Lemma Let there be two sets of functons f (x) and g (x) defned as follows: f (x) = ax+b c x+d,g (x)= ex+f g x+h k l then f (g (x)) = kx+l m x+n, where = m n a e b f The operaton s defned c d g h as follows for x matrces: A B = A f det(a) =0 AB Otherwse From lemma, also known as Moebus Transformaton, t follows that the values X[g()] :::X[g(n)] of the recurrence shown above, can be computed by the followng steps: Intalze all matrces wth approprate coeffcents: 0 M ::m ntalzed to 0 A[] B[] forall f::ng do n parallel M g() := C[] D[] Multply the matrces: for = tondo M g() := M f () M g() Calculate the values of X[g()] :::X[g(n)]: forall f::ng do n parallel X[g()] := m S[g()]+m m S[g()]+m where m m m m = M g() Note that snce step s an ordnary IR, we can replace t wth a call to OrdnaryIR(M; f; g;) (where s the modfed matrx multplcaton operaton from lemma ) Thus we transformed the recurrence nto an ordnary IR problem whch we already know how to solve We can also produce a parallel soluton to a slghtly more complcated recurrence of the followng form: X[g()] := X[g()] + A[]X[f ()]+B[] Snce g() s dstnct, we can rewrte the above recurrence by replacng the varable X[g()] on the rght hand of the 0 := 0 sgn, wth ts ntal value S[g()], wthout affectng the fnal values of X[g()] :::X[g(n)] Thssallowed snce the dstnctness property of g() guarantees us that each assgnment to X[g()] s the frst one, and therefore each reference to X[g()] s a reference to ts ntal value Thus we can brng the loop to ts Moebus form as follows: for = tondo X[g()] := (S[g()]C[]+A[])X[f ()]+(S[g()]D[]+B[])

producng the followng Moeubus matrces: 8 M g() = S[g()] C[] +A[] S[g()] D[] +B[] C[] D[] As an example consder the recurrence taken from loop number, of the Lvermore Loops benchmark [] The loop s a -D Implct Hydrodynamcs fragment: X[::n; ::7] ntalzed to S for j = to6do for = ton do X[; j] := X[; j] +0:75d0 (Y [] +X[,;j]Z[; j]); The nner loop can be vewed as an ordnary IR problem OrdnaryIR(M; f; g;) where g() =7(,)+j, f()= 7(,)+j, 8M g() = 0:75 Z[; j] S[g()] + 0:75 Y [] 0 and where s the operator from lemma Thus, wthout usng any data dependence analyss technques, we managed to parallelze the loop, to be calculated n O(log n) steps General Indexed Recurrences We now consder a more general case of IR equatons (called GIR) whch can be modeled by the loop: for = tondobegn A[g()] := A[f ()] A[h()]; The greedy method used for the IR case (where g() = h()) s not sutable for GIR Essentally, ths s due to the dfference n the structure of the trace A 0 [g()] n the two cases As depcted n fg, A 0 [g()] n the GIR case s a bnary tree, whereas n the IR case A 0 [g()] s a lst g()= f()= - h()= - GIR: A[] = A[-]A[-] A[6 ]= A[ 5]= A[ ]= A[ ]= A[ ]= A[ ]= A[ ]= g()= f()= - IR: A[] = A[-]A[] A[ 6]= A[ 6] A[ 5] A[ ] A[ ] Fgure Tree structure versus lst structure of the trace The tree structure of the trace mples that the operator must be a commutatve one Clearly, the multplcaton of traces values can be done ether from the left or from the rght end of a current trace value The other problem that a GIR loop presents us wth, s that traces can have an exponental length For example consder the loop 0 for = ::n A[] := A[, ] A[, ] 0,whereA[0]=A[]=aIn ths example the trace A 0 [n] = a n conssts of n multplcatons Therefore, n order for the parallelzaton of GIR loops to be effcent, the computaton of a power (A[] k ) must be regarded as atomc operaton Ths assumpton can also be found n prevous works (eg []) where the multplcaton operaton was used n order to solve recurrences of addtons The GIR algorthm must therefore gather all dentcal elements of a trace and then, usng the power operaton, compute ther product n a sngle operaton As an example, consder the above loop (A[] := A[,]A[,]), where A[0] and A[] have dfferent ntal values After the executon of the loop the trace s a multplcaton of two powers A 0 [] = A[0] fb(,) A[] fb(,),wherefb() s the th Fbonacc number Ths trace s thus, best computed by frst countng thepowersofa[0] and A[] n every trace separately (see fgure 5) Indeed countng powers s suffcent to compute the traces not only for the above loop, but for any GIR loops as well A [ ] A[ ] A[ 0 ] A [ ] A [ ] A [ ] A[ ] A[ ] A[ 0 ] A [ ] = A[ 0 ] A[ ] Fgure 5 The expanson of the recurrence X = X, X, for n = Countng all powers of A[] s elements can be done usng an ntal dependence graph G =< ;E >, showng dependences among the fnal values of A[] s elements The proposed algorthm computes the power of some element A[j] n a trace A 0 [] by countng the number of dfferent paths between correspondng nodes j and,n G Intutvely, each edge <;j>e of the dependence graph G ndcates that A[j] s an operand n the assgnment statement to A[] of thegir-loop Thus, the powerof A[j] n the trace of A 0 [] s n fact the number of dfferent paths leadng from j to n G Computng all powers n every trace s therefore equvalent to countng all paths (CAP) between the nodes of G The partcular varant of CAP needed for GIR-loops s defned as follows: Defnton Let S be the set of nodes wth n-degree 0 (the leaves or buttom nodes) of a DAG G =< ;E> Countngall the paths CAP(G)s an operaton that returns a labeled graph G 0 = < ;E 0 >such that an edge < ; j > [x], S; j S wth the label [x] belongs to G 0 ff there are exactly x paths from j to n G For example let G be a double chan of n nodes v,! v,! :::,! v n, such that there are two edges from v to v + In ths case G 0 = CAP(G) s a DAG such

that there s a sngle edge from v to every v of the form <v ;v > [ ] In order to solve a GIR loop we frst create the dependence graph G, and then computes all the paths n G n parallel G 0 = CAP(G ) G s constructed such that an edge < ;j > [x] E 0 ff the power of A[j] n the trace A 0 [] s exactly x Fnally, the trace of every element A 0 [] s obtaned by computng A 0 [] = A[j ] x :::A[j k ] xk where <;j l > [xl] CAP(G ) l = ;:::;k Thus, once we have the powers x ;:::;x k the trace can be computed n parallel n log k steps The dependence graph G = <;E >nduced by a GIR-loop s defned as follows: g();:::;g(n);f() 0 ;:::;f(n) 0 ;h() 00 ;:::h(n) 00 = where f() 0 (or h() 0 ) represent ntal values of A[] that wll form the trace of the g() nodes The edges n E nclude: for = ::n <g();f()> [] f there exsts j; j < such that g(j) =f() -Deletng marked edges - remove each marked edge from E t Ths step prevents us from recountng edges that were already taken under consderaton n prevous steps -Paths addton- For each node v replace all double edges <v ;v j > [x] ;:::;< v ;v j > [xk] E t wtha Pk sngle edge (labeled by ther sum) <v ;v j > l= xl : v x xk vj v Fgure 8 Summng double edges x m Two separate examples of the above algorthm operaton are gven n fg 9 The new edges added (by path multplcaton and path addton) n every teraton, are denoted by dashed lnes vj <g();h() > [] f there exsts j; j < such that g(j) =h() <g();f() 0 > [] f there s no j; j < such that g(j) =f() G <g();h() 00 > [] f there s no j; j < such that g(j) =h() For example, G of the loop A[] = A[,] A[,] 5 G 0 h() h() h() f() f() f() g() g() g() Fgure 9 Iteratons of two graphs Fgure 6 The dependence graph produced by the recurrence A = A, A, for = ; ; s gven n fg 6 Our algorthm for computng CAP(G) uses log n teratons (t = ;:::;log n), where n each teraton we update the edges of the current graph G t, =<;E t, >to form G t =<;E t >as follows: -E t = E t, -Paths multplcaton - For each <v ;v k > [x] E t and a successve edge <v k ;v j > [y] E t,weaddanew edge <v ;v j > [xy] to E t and mark <v k ;v j > [y] to be deleted: x y x x y k j k Fgure 7 Paths multplcaton y j The full algorthm along wth a verson whch avods spawnng unnecessary processes, and a method for handlng GIR wth non-dstnct g, are descrbed n the full paper References [] John T Feo, "An analyss of the computatonal and parallel complexty of the Lvermore Loops", Journal of Parallel Computng No7, 988, pp 6-85 [] H S Stone, "An effcent Parallel Algorthm for the Soluton of a Trdagonal Lnear System of equatons", J ACM 0,7 (97) [] J Jaja, "An Introducton to parallel algorthms", Addson- Wesley publshng company, 99 [] P M Kogge, H S Stone, "A Parallel Algorthm for the Effcent Soluton of a General Class of Recurrence Equatons", IEEE Transactons on Computers, C(8):786-79, August 97 [5] G Haber, Y Ben-Asher, "On the Usage of smulators to detect neffcency of parallel programs caused by "bad" schedulngs: the SIMPARC approach", Accepted for publcaton n the Journal of Systems and Software 5