Effects of Parallelism Degree on Run-Time Parallelization of Loops

Size: px

Start display at page:

Download "Effects of Parallelism Degree on Run-Time Parallelization of Loops"

Kerry Richards
5 years ago
Views:

1 Effects of Parallelism Degree on Run-Time Parallelization of Loops Chengzhong Xu Department of Electrical and Computer Engineering Wayne State University, Detroit, MI Abstract Due to the overhead for exploiting and managing parallelism, run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques that exploit different degrees of parallelism for loops with dynamic cross-iteration dependences. The DOALL approach exploits iteration-level parallelism. It restructures the loop into a sequence of do-parallel loops, separated by barrier operations. Iterations of a do-parallel loop are run in parallel. By contrast, the DOACROSS approach exposes fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences. The DOACROSS approach has variants that identify different amounts of parallelism among consecutive reads to the same memory location. We evaluate the algorithms for loops using various structures, memory access patterns, and computational workloads on symmetric multiprocessors. The algorithms are scheduled using block cyclic decomposition strategies. The experimental results show that the DOACROSS technique outperforms the DOALL, even though the latter is widely used in compile-time parallelization of loops. Of the DOACROSS variants, the algorithm allowing partially concurrent reads performs best because it incurs only slightly more overhead than the algorithm disallowing concurrent reads. The benefit from allowing fully concurrent reads is significant for small loops that do not have enough parallelism. However, it is likely to be outweighed by its cost for large loops or loops with light workload. 1 Introduction Loop parallelization exploits parallelism among instruction sequences or loop iterations. Techniques for exploiting instructionlevel parallelism are prevalent in today s microprocessors. On multiprocessors, loop parallelization techniques focus on loop-level parallelism. They partition and allocate loop iterations among processors with respect to cross-iteration dependences. Their primary objective is to expose enough parallelism to keep processors busy all the time while minimizing synchronization overheads. On multiprocessors, there are two important parallelization techniques that exploit different degrees of parallelism. The DOALL technique assumes loop iterations as the basic scheduling and execution units [2, ]. It decomposes the iterations into a sequence of subsets, called wavefronts. Iterations within the same wavefront are run in parallel. A barrier synchronization operation is used to preserve cross-iteration dependences between two wavefronts. The DOALL technique reduces the run-time scheduling overhead at a sacrifice of a certain amount of parallelism. By contrast, the DOACROSS technique exploits finegrained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences among memory references. The DOACROSS technique maximizes parallelism at the expense of frequent synchronization. Note that in the literature the terms DOACROSS and DOALL often refer to the loops with and without cross-iterations dependences, respectively. We borrow the terms as parallelization techniques in this paper because the DOALL technique essentially restructures a loop into a sequence of DOALL loops. Both the DOALL and DOACROSS techniques are used to parallelize DOACROSS loops. Chen and Yew [] studied programs from the PERFECT benchmark suite and revealed the significant advantages of parallelizing DOACROSS loops. DOACROSS loops can be characterized as static and dynamic in terms of the time when cross-iteration dependence information is available (at compile-time or run-time). Figure 1 shows an example of dynamic loops due to the presence of indirect access patterns on data array X. Dynamic loops appear frequently in scientific and engineering applications [16]. Examples include SPICE for circuit simulation, CHARMM and DISCOVER for molecular dynamics simulation of organic systems, and FIDAP for modeling complex fluid flows [4]. for i=1tondo :::= X[v[i]] + ::: X[u[i]] = ::: endfor Figure 1: A general form of loops with indirect access patterns, where u and v are input-dependent functions. For parallelizing static loops, the DOALL technique plays a dominant role because it employs a simple execution model after exploiting parallelism at compile-time [8]. For dynamic loops, however, this may not be the case. Since parallelism in a dynamic loop has to be identified at run-time, the cost for building wavefronts in the DOALL technique becomes a major source of runtime overhead. The DOACROSS incurs run-time overhead for

2 analyzing reference-wise dependences, but it provides more flexibility to subsequent scheduling. This paper compares the DOALL and DOACROSS approaches for parallelizing loops at run-time, focusing on the effect of parallelism degree on the performance of parallelization. In [18], we devised two DOACROSS algorithms exploiting different amounts of parallelism and demonstrated the effectiveness of the DOACROSS algorithms on symmetric multiprocessors. This paper presents a DOALL algorithm that allows parallel construction of the wavefronts and compares this algorithm with the DOACROSS algorithms, focusing on the influences of parallelism degree. We show that the DOACROSS algorithms have advantages over the DOALL, even though the latter is preferred for compile-time parallelization. Of the DOACROSS variants, the algorithm that allows concurrent reads may over-expose parallelism for large loops. Its benefits will be outweighed by the run-time cost of exploiting and managing the extra amount of parallelism for large loops or loops with light workload. The rest of the paper is organized as follows. Section 2 reviews run-time parallelization techniques and qualitatively compares the DOACROSS and DOALL techniques. Section briefly presents three DOACROSS algorithms that expose different amounts of parallelism. Section 4 presents a DOALL algorithm. Section evaluates the algorithms, focusing on the effects of parallelism degree and granularity. Section concludes the paper with a summary of evaluation results. 2 Run-time Parallelization Techniques In the past, many run-time parallelization algorithms have been developed for different types of loops on both shared-memory and distributed-memory machines [6, 9, 14]. Most of the algorithms follow a so-called INSPECTOR/EXECUTORapproach. With this approach, a loop under consideration is transformed at compile-time into an inspector routine and an executor routine. At run-time, the inspector detects cross-iteration dependences and produces a parallel schedule; the executor performs the actual loop operations in parallel based on the dependence information exploited by the inspector. The keys to success with this approach is to shorten the time spent on dependence analyses without losing valuable parallelism and to reduce the synchronization overhead in the executor. An alternative to the INSPECTOR/EXECUTOR approach is a speculative execution scheme that was recently proposed by Rauchwerger, Amato, and Padua [1]. In the speculative execution scheme, the target loop is first handled as a doall regardless of its inherent parallelism degree. If a subsequent test at run-time finds that the loop was not fully parallel, the whole computation is then rolled back and executed sequentially. Although the speculative execution yields good results when the loop is in fact executable as a doall, it fails in most applications that have partially parallel loops. The INSPECTOR/EXECUTORscheme provides a run-time parallelization framework, and leaves strategies for dependence analysis and scheduling unspecified. The scheme can also be restructured to decouple the scheduling function from the inspector and to merge it with the executor. The scheduling function can even be extracted to serve as a stand-alone routine between the inspector and the executor. There are many run-time parallelization algorithms belongingtotheinspector/executor scheme. They differ from each other mainly in their structures and strategies used in each routine, in addition to the type of target loops considered. Pioneering work on using the INSPECTOR/EXECUTOR scheme for run-time parallelization is due to Saltz and his colleagues [1]. They considered loops without output dependences (i.e. the indexing function used in the assignments of the loop body is an identity function), and proposed an effective DOALL INSPEC- TOR/EXECUTOR scheme. Its inspector partitions the set of iterations into a number of wavefronts, which maintain cross-iteration flow dependences. Iterations within the same wavefront can be executed concurrently, but those in different wavefronts must be processed in order. The executor of the DOALL scheme enforces anti-flow dependences during the execution of iterations in the same wavefront. The DOALL INSPECTOR/EXECUTOR scheme has been shown to be effective in many real applications. It is applicable, however, only to those loops without output dependences. The basic scheme was recently generalized by Leung and Zahorjan to allow general cross-iteration dependences [1]. In their algorithm, the inspector generates a wavefront-based schedule and maintains output and anti-flow dependences as well as flow dependences; the executor simply performs the loop operations according to the wavefronts of iterations. Note that the inspector in the above scheme is sequential. It requires time commensurate with that of a serial loop execution. Parallelization of the inspector loop was also investigated by Saltz, et al. [1] and Leung and Zahorjan [9]. Their techniques respect flow dependences, but ignore anti-flow and output dependences. Most recently, Rauchwerger, Amato and Padua presented a parallel inspector algorithm for a general form of loops [1]. They extracted the function of scheduling and explicitly presented an inspector/scheduler/executor scheme. DOALL INSPECTOR/EXECUTOR schemes assume a loop iteration as the basic scheduling unit in the inspector and the basic synchronization object in the executor. An alternative to this scheme is DOACROSS INSPECTOR/EXECUTOR parallelization techniques which assume a memory reference in the loop body as the basic unit of scheduling and synchronization. Processors running the executor are assigned iterations in a wrapped manner and each spinwaits as needed for operations that are necessary for its execution. An early study of DOACROSS run-time parallelization techniques was conducted by Zhu and Yew [2]. They proposed a scheme that integrates the functions of dependence analysis and scheduling into a single executor. Later, the scheme was improved by Midkiff and Padua to allow concurrent reads to the same array element by several iterations [12]. Even though the integrated scheme allows concurrent analysis of cross-iteration dependences, tight coupling of the dependence analysis and the executor causes high synchronization overhead in the executor. Most recently, Chen, et. al., developed the DOACROSS technique by decoupling the function of the dependence analysis from the executor [6]. We refer to their technique as the CTY algorithm. Separation of the inspector and executor not only reduces synchronization overhead in the executor, but also provides the possibility of reusing the dependence information developed in the inspector across multiple invocations of the same loop. Their inspector is parallel at the sacrifice of concurrent reads

3 to the same array element. Their algorithm was recently further improved by Xu and Chaudhary by allowing concurrent reads of the same array element in different iterations and by increasing the overlap of dependent iterations [18]. DOALL and DOACROSS are two competing techniques for run-time loop parallelization. DOALL parallelizes loops at iteration level, while DOACROSS supports parallelism at a finegrained memory access level. Consider the loop and index arrays shown in Figure 2. The first two iterations can be either indefor i=1tondo if ( exp(i) ) X[u1[i]] = F(X[v1[i]], :::) else X[u2[i]] = F(X[v2[i]], :::) endfor u1=[,7,...] u2=[,,...] v1=[7,2,...] v2=[4,,...] Figure 2: An example of loops with conditional cross-iteration dependences, where F is anarbitraryoperator. pendent (when exp(1) is false and exp(2) is true), flow dependent (when exp(1) is true and exp(2) is false), anti-flow dependent (when both exp(1) and exp(2) are true), or output dependent (when both exp(1) and exp(2) are false). The nondeterministic cross-iteration dependences are due to control dependences between statements in the loop body. We call such dependences conditional cross-iteration dependences. Control dependences can be converted into data dependences by a if-conversion technique at compile-time [1]. The compile-time technique, however, may not be helpful for loop-carried dependence analysis at run-time. By the DOALL technique, loops with conditional cross-iteration dependences must be handled sequentially. However, the DOACROSS technique can handle this class of loops easily. At run-time, the executor, upon testing a branch condition, may set all operands in the non-taken branch available so as to release processors waiting for those operands. Furthermore, the DOACROSS technique overlaps dependent iterations. The first two iterations in Figure 2 have an anti-flow dependence when both exp(1) and exp(2) are true. The memory access to X(4) in the second iteration, however, can be overlapped with the execution of iteration 1 without destroying the anti-flow dependences. The DOACROSS INSPECTOR/EXECUTOR parallelization technique provides the potential to exploit fine-grained parallelism across loops. Fine-grained parallelism does not necessarily lead to overall performance gains without an efficient implementation of the executor. One main contribution of this paper is to show that multi-threaded implementations favor the DOACROSS technique. The Time-Stamp DOACROSS Algorithms This section briefly presents three DOACROSS algorithms that feature parallel dependence analysis and scheduling. They expose different amounts of parallelism among consecutive reads to the same memory location. For more details, please see [18]. Consider the general form of loops in Figure 1. It defines a two dimensional iteration-reference space. The inspector of a timestamp algorithm examines the memory references in a loop and constructs a dependence chain for each data array element of the loop. In addition to the precedence order, the inspector also assigns a stamp to each reference in a dependence chain, which indicates its earliest access time relative to the other references in the chain. A reference can be activated if and only if the preceding references are finished. The executor schedules references to a chain through a logical clock. At a given time, only those references whose stamps are equal to or larger than the time are allowed to proceed. Dependence chains are associated with clocks ticking at different speeds. Assume stamps of references as discrete integers. The stamps are stored in a two-dimensional array stamp. Let(i; j) indicate the j th access of the i th iteration. stamp[i][j] represents the stamp of reference (i; j). By scanning through the iteration space sequentially, processors can easily construct a time-stamp array that has the following features. The stamp difference between any two directly connected references is one except for pairs of write-afterread and read-after-read references; both reads in pairs of readafter-read have an equivalent stamp; and in pairs of write-afterread, their difference is the size of the read group minus one. Figure shows an example derived from the target loop assuming u = [1; ; ; 14; 1; 14; 12; 11; ; 12; 4; 8; ; 1; 1; ] v = [; 1; 1; 1; ; 8; 1; 1; 1; 1; 1; 1; ; 1; 11; ]: Iteration Read v ( 1 ) ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 1 ) ( ) ( ) ( 1 ) ( ) ( ) ( 2 ) ( ) ( 2 ) ( 2 ) ( 1 ) Write u ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 2 ) ( 4 ) ( 7 ) ( 8 ) ( ) Figure : Sequentially constructed dependence chains labeled by array elements. The numbers in parentheses are the stamps of references. Basedonsuchtime-stamparrays,asimpleclockruleintheexecutor can preserve all dependences and allow consecutive reads to be run in parallel..1 A Parallel Algorithm Allowing Partially Concurrent Reads (PCR) Building dependence chains requires an examination of all references at least once in the loop. To construct the time-stamp array in parallel, one key issue is how to stamp the references in a dependence chain across iteration regions on different processors. Since no processors (except the first) have knowledge about the references in previous regions, they are unable to stamp their local references in a local chain without the assignment of its head. To allow processors to continue with the examination of other references in their local regions in parallel, the time-stamp inspector uses a conservative approach to assign a conservative number to the second reference of a local chain and leave the first one to be decided in a subsequent global analysis.

4 Read ( 1 ) ( 1 ) ( 2 ) (1 ) (1 ) (12 ) (12 ) ( ) (17 ) (17) Write u ( 1 ) ( 2 ) ( 2 ) (26 ) (19 ) (26 ) (27 ) Iteration v Figure 4: A fully stamped dependence chain labeled by array element. Numbers in parentheses are stamps of references. Using this conservative approach, most of the stamp table can be constructed in parallel. Upon completion of the local analysis, processors communicate with each other to determine the stamps of undecided references in the stamp table. Figure 4 shows the complete dependence chains associated with array elements, 1 and 1. Processor temporarily assigns 26 to the reference (12; 1), assuming all 24 accesses in regions from to 2 are in the same dependence chain. In the subsequent cross-processor analysis, processor 2 sets stamp[8][1] to 2 after communicating with processor (processor 1 marks no reference to the same location). At the same time, processor communicates with processor 2, but gets an undecided stamp on the reference (8; 1), and hence assigns another conservative number, 16 plus 1, to reference (12; ), assuming all accesses in regions and 1 are in the same dependence chain. The extra one is due to the total number of dependent references in region 2. Note that the communications from processor to processor 2 and from processor 2 to processor 1 are in parallel. Processor 2 can provide processor only the number of references in the local region until the end of the communication with processor. Accordingly, the time-stamp algorithm presents a special clocking rule that sets the clock of a dependence chain to n +2if an activated reference in region r is a local head, where n is the total number of accesses from region to region r 1. For example, the reference (2; ) in Figure 4 first triggers the reference (4; 1). Activation of reference (4; 1) will set the clock to 1 because there are 8 accesses in the first region, which consequently triggers the following two reads. Note that this parallel inspector algorithm only allows consecutive reads in the same region to be performed in parallel. Read operations in different regions must be performed sequentially even though they are totally independent of each other. In the dependence chain on element 1 in Figure 4, for example, the reads (9; ) and (1; ) are activated after reads (6; ) and (7; ). Weareable to assign reads (9; ) and (1; ) the same stamp as reads (6; ) and (7; ), and assign the reference (1; 1) a stamp of 14. This dependence chain, however, will destroy the anti-flow dependences from (6; ) and (7; ) to (14; ) in the executor if reference (9; ) or (1; ) starts earlier than one of the reads in region 1..2 A Parallel Algorithm Allowing Fully Concurrent Reads (FCR) The basic idea of the algorithm is to treat write operations and groups of consecutive reads as a macro-reference. For a write reference or the first read operation in a read group in a dependence chain, the inspector stamps the reference with the total number of macro-references ahead. Other accesses in a read group are assigned the same stamp as the first read. Correspondingly, in the executor, the clock of a dependence chain is incremented by one time unit on a write reference and by a fraction of a time unit on a read operation. The magnitude of an increment on a read operation is the reciprocal of its read group size. Figure presents sequentially stamped dependence chains. In addition to the stamp, each read reference is also associated with an extra data item recording its read group size. In an implementation, the variable for read group size can be combined with the variable for stamp. For simplicity of presentation, however, they are declared as two separate integers. Look at the dependence chain on element 1. The reference (4; 1) triggers four subsequent reads: (6; ), (7; ), (9; ) and (1; ) simultaneously. Activation of each of these reads increments the clock time by 1=4. After all of them are finished, the clock time reaches 4, which in turn activates the reference (1; 1). Following are the details of the algorithm Iteration Read v (1,1) (1,1) (1,1) (2,) (1,2) (1,1) (,4) (,4) (1,1) (,4)(,4) (2,) (,1) (2,) (2,1) (1,2) Write u ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 2 ) ( 4 ) ( 4 ) ( ) ( ) Figure : Sequentially constructed dependence chains labeled by array element. Numbers in parentheses are stamps of references. As in the PCR algorithm, the inspector first partitions the iteration space of a target loop into a number of regions. Each region is assigned to a different processor. Each processor first stamps local references, except the head macro-references at the beginning of the local dependence chains. References next to head macroreferences are stamped with conservative numbers using the conservative approach as in the PCR algorithm. Processors then communicate with each other to merge consecutive read groups and stamp undecided macro-references in the time table. The base of the algorithm is a three dimensional stamp table: stamp. stamp[i][j][] records the stamp of reference (i; j) in the iteration-reference space. stamp[i][j][1] stores the size of a read group to which reference (i; j) belongs. If reference (i; j) is a write, stamp[i][j][1] is not used. The local inspector at each processor passes once through its iteration-reference region, forms read groups, and assigns appropriate stamps onto its inner references. Inner references refer to those whose stamps can be determined locally using conservative approaches. All write operations except for the heads of local dependence chains are inner references. All reads of a read group are also inner references if the group is neither the head nor the tail of any local dependence chain. Figure 6 presents a partially stamped dependence chain on element 1. From this figure, it can be seen that region 1 establishes a right-open single read group; region 2 establishes a right-open group with two members; region builds a group open at both ends. The local inspector stamps all inner references of each region. A subsequent global inspector merges consecutive read groups that are open to each other and assigns appropriate stamps to newly

5 Read Write (1, u) (-1,u) (,-1) (-19,u) (u,-21) ( u ) ( u ) ( 26 ) 1 Iteration v Figure 6: A partially stamped dependence chain constructed in parallel. formed closed read groups and undecided writes. Figure 7 is a fully stamped chain evolved from Figure 6. Read Write Iteration v (1, 1) (1,4) (1,4) (1,4) (1,4) ( 2 ) ( 17 ) ( 26 ) Figure 7: A fully stamped dependence chain constructed in parallel. u u In this section, we present a DOALL INSPECTOR/EXECUTOR algorithm for run-time parallelization. The algorithm breaks down parallelization functions into three routines: inspector, scheduler and executor. The inspector examines the memory references in a loop and constructs a reference-wise dependence chain for each data element accessed in the loop. The scheduler then derives more restrictive iteration-wise dependence relations from the referencewise dependence chains. Finally, the executor restructures the loop into a sequence of wavefronts and executes iterations accordingly. Iteration-wise dependence information is usually represented by a vector of integers, denoted by wf. Each element of the vector, wf[i], indicates the earliest invocation time of iteration i. The wavefront vector bridges the scheduler and executer, being their output and input, respectively. The reference-wise dependence chains can be represented in different ways. Different data structures will lead to different inspector and scheduler algorithms. Desirable features of the structure are low in memory overhead, simple to construct in parallel, and easy for the scheduler to generate wavefront vectors. In [14], Rauchwerger et al. presented an algorithm (RAP, for short) that uses a reference array R to collect all the references to an element in iteration order and a hierarchy vector H to record the index of the first reference of each wavefront level in the array. For example, for the loop in Figure 1 and a memory access pattern defined by u and v in Section, the reference array of element 1 and its hierarchy vector are shown in Figure 8. For example, H[2]=2 and H[] = 6 indicate that iteration 6 (R[2]) and iteration 1 (R[6]) are the first references of the th and 1 th wavefronts, respectively. Its scheduler uses these two data structures as look-up tables for determining the predecessors and successors of all the references. (a) R W R R R R W W Iteration Type Level The executor uses an extra clocking rule to allow concurrent reads and meanwhile preserve anti-flow dependences. That is, if the reference is a read in region r, the clock of its dependence chain is incremented by 1=b if the read is a head of region r; otherwise, it is set to n +1+1=b + frac,whereb is the read group size and frac is the fraction part of the ordinal clock. For example, look at the dependence chain associated with memory location 1 in Figure 7. Activation of reference (2; ) increments time[1] by one because it is a single read group. Reference (4; 1) triggers all four subsequent reads simultaneously by setting time[1] to 1. Suppose the four references are performed in the following order: (6; ), (9; ), (7; ) and (1; ). Reference (6; ) increments time[1] by 1=4. Reference (9; ), however, sets time[1] to 16:. The subsequent two reads add 1=4 to time[1] each. Upon completeness of all reads, their subsequent write (1; 1) is activated. The purpose of the read-number clock time is to record the number of activated reads in a group. 4 A DOALL algorithm (b) Reference array (a) and hierarchy vector (b) of element 1 Figure 8: The reference array and hierarchy vector of element 1 for the loop in Figure 1 with a memory access pattern defined by u and v. Since the reference array in the RAP algorithm stores the reference levels of a dependence chain, it is hard to construct in parallel using the ideas from the DOACROSS algorithms. We cannot assign conservative levels to references because their levels will be used to derive consecutive iteration-wise wavefront levels. In the following, we present a new algorithm that simply uses a dependence table, as shown in Figure 9, to record the memory-reference dependences. The table is different from the stamp array of the time-stamp algorithms. First, each dependence chain of the table reflects only the order of precedence of its memory references. There are no levels associated with the references. Second, not all memory references have a role in the dependence chains. Since the DOALL approach concerns only iteration-wise dependences, a read operation can be ignored if there is another write in the same iteration which accesses to the same memory location. For example, the read in iteration 12 is overruled by the write in the same iteration. Read Write Iteration v Figure 9: Dependence table for the loop in Figure 1 with an memory access pattern defined by V and U. Each cell of the table is defined as u

6 struct Cell f int iter; /* Current iteration index */ int elm; /* Item to be referenced */ char type; /* Reference type: RD/WR */ int *xleft; /* (xleft, yleft) points to */ int *yleft; /* its predecessor*/ g For parallel construction of the dependence table in the inspector phase, each processor maintains a pair of pointers (head[i]; tail[i]) for each memory location i, which point to the head and tail of its associated dependence chain, respectively. As in the DOACROSS algorithm, the inspector needs cross-processor communication to connect those adjacent local dependence chains associated with the same memory location. Processor k can find the predecessor (successor) for its local dependence chain i by scanning tail[i] (head[i]) of processors from k 1 to (from k +1to N-1). Based on the dependence table, processors then construct a wavefront vector in parallel through synchronizing accesses to each element of the wavefront vector. Specifically, for iteration i, a processor determines the dependence levels of all its references and sets wf[i] to the highest level. The dependence level of a reference is calculated according to the following rules. For a reference r of iteration i, (S1) if the reference r isa writeand its immediate predecessor is a read, the processor examines all following consecutive reads and sets the reference level to maxfwf[k]g +1,wherek is an iteration that contains one of the reads. (S2) if the reference r is a write and its immediate predecessor is a write (say in iteration k), the reference level is set to wf[k] +1; (S) if the reference r is a read, the processor backtracks the reference s dependence chain until it meets a write (say in iteration k). The reference level is then set to wf[k] +1. Notice that synchronization of accesses to wavefront vector elements requires a blocking read if its target has an invalid value. The waiting reads associated with an element should be woken up by each write to the element. Applying the rules to the dependence table in Figure 9, we obtain a wavefront vector of : Taking the wavefront vector as an input, the executor can be described as follows: for k= to d-1 do forall i such that wf[i]=k perform iteration i endfor barrier endfor where d is the number of wavefronts in a schedule. Performance Evaluations We implemented the DOACROSS and DOALL run-time parallelization algorithms on a SUN Enterprise E1. It is a symmetric multiprocessor, configured with four 167MHz UltraSPARC processors and a memory of 12 MB. Each processor module has an external cache of 1 MB. Each DOACROSS algorithm was implemented as a sequence of three routines: a local inspector, aglobal insepctor, andanexecutor. The implementation of the DOALL algorithm has an extra scheduler between its global inspector and executor. All routines are separated by barrier synchronization operations. They were programmed in the Single-Program-Multiple-Data (SPMD) paradigm and as multi-threaded codes. At the beginning, threads were created in a bound mode so that threads are bound to different CPUs and run to completion. Performance of a run-time parallelization algorithm is dependent on a number of factors. One is the structure of target loops. This experiment targets the same synthetic loop structure used in [6, 18]. It comprises a sequence of interleaved reads and writes in the loop body. Each iteration executes a delay() function, reflecting the workload caused by its memory references. In [18], another socalled multiple-reads-single-write loop structure was considered in a preliminary evaluation of the DOACROSS algorithms. It was found that the FCR and the PCR algorithms gain few benefits from the extra reads in that loop structure. Another major factor affecting the overall performance is memory access patterns defined by index arrays. Uniform and nonuniform memory access patterns were considered. A uniform access pattern (UNIFORM, for short) assumes all array elements have the same probability of being accessed by a memory reference. A non-uniform access pattern (NUNIFORM, for short) refers to the pattern where 9% of references are to 1% of array elements. Non-uniform access patterns reflect hot reference spots and result in long dependence chains. In addition to the loop structure and memory access pattern, the performance of a parallelization algorithm is also critically affected by its implementation strategies. One of the major issues is loop distribution. For the local and global inspectors, an iteration-wise block decomposition is straightforward because it not only ensures balanced workload among processors but also allows processors to be run in parallel. Loops in the scheduler and executor can be distributed in different ways: cyclic, block, or block cyclic. Their effects on the performance will be evaluated, too. The experiments measure the overall run-time execution time for a given loop and a memory access pattern. Each data point obtained in the experiments is the average of five runs, using different seeds for the generation of pseudo-random numbers. Since a seed determines a pseudo-random sequence, algorithms are able to be evaluated under the same test instances..1 Impact of access patterns and loop sizes Figure 1 presents speedups of different parallelization algorithms over serial loops ranging from 16 to 1684 iterations in a loop. Assume each iteration has four memory references and contains

7 microseconds workload in the delay function. Loops in the scheduler and executor are decomposed in a cyclic way. Section. will show that the cyclic decomposition yields the best performance for all algorithms. such small loops won t gain any benefits on a system with four processors. This is in agreement with what the speedup curve of the DOALL algorithm shows in Figure 1(b). Similar observations can be made for loops with uniform access patterns from the plot of parallelism degrees in Figure 11 and the curve of speedups in Figure Uniform, SRSW Non-uniform, SRSW 128 Speedup 2 1 Average Degree of Parallelism (a) Uniform access patterns Speedup (b) Non-uniform access patterns Figure 1: Speedups of parallelization algorithms over serial loops of different sizes, where load=s. Overall, the figure shows that the DOACROSS algorithms significantly outperform the DOALL algorithm, even though both techniques are capable of gaining speedups for loops that exhibit uniform and non-uniform memory access patterns. The gap between their speedup curves indicates the importance of exploiting fine-grained reference-level parallelism at run-time. Referencelevel parallelism is especially important to small loops because they normally have very limited degrees of iteration-level parallelism. We refer to the degree of iteration-level parallelism as the number of iterations in a wavefront of the DOALL algorithm. Figure 11 plots average degrees of parallelism in loops of different sizes. As can be observed, the degree of parallelism is proportional to the loop size. A loop that exhibits non-uniform access patterns has at most four degrees of parallelism until its loop size reaches beyond 12. This implies that iteration-level parallelization techniques for Figure 11: Average degree of parallelism exploited by the DOALL algorithm It is worth noting that higher degrees of parallelism may not necessarily lead to better performances. From Figure 1(a), it can be seen that the speedup due to the DOALL algorithm starts to drop when loop size reaches beyond 496. The average degree of parallelism at the point is 128, which is evidently excessive on a system with only four processors. The performance degradation is due to the cost of barrier synchronizations. The DOACROSS technique delivers high speedups for small loops because it is able to exploit enough parallelism to keep all processors busy all the time. Expectedly, the amount of finegrained parallelism will quickly become excessive as the loop size increases. An interesting feature with the DOACROSS algorithms is that their speedups stabilize at the highest level when parallelism is over-exposed, rather than declining as in the DOALL technique. It is because the cost of point-to-point synchronizations used in the DOACROSS executor is independent of the parallelism degree. Consequently, the execution time of the executor is proportional to loop size. Of the DOACROSS variants, the FCR and PCR algorithms are superior to the CTY for small loops, while the CTY algorithm is preferred for large ones. The FCR and PCR algorithms improve on the CTY by allowing consecutive reads to the same location to be run simultaneously. The extra amount of parallelism does benefit to small loops. For large loops, however, the benefit could be outweighed by the cost of exploiting and managing the parallelism. Since the FCR algorithm incurs more overhead than the CTY in the global inspector, it obtains lower speedups for large loops. In contrast, the PCR algorithm can obtain almost the same speedup as the CTY because the PCR algorithm causes only slightly more overhead in the local inspector. To better understand the relative performance of the algorithms, we break down their execution into a number of principal tasks and

8 present their percentages of overall execution time in Figure 12. Initialization curve indicates the cost for memory allocation, initialization, and thread creations. The cost spent in the local inspector, global inspector, or scheduler is indicated by the range between the curves of its predecessor and itself. Remainders above the global inspector and scheduler curves are due to the executors of the DOACROSS and DOALL algorithms, respectively. From the figure, it can be seen that both the CTY and PCR algorithms spend approximately % of time in their initializations, and 1.% in their local and global inspectors each for large loops. The FCR algorithm spends about 1% of time in initialization because it uses a three-dimensional stamp table (instead of a two-dimensional stamp array in the CTY and PCR algorithms) and different auxiliary data structures: head and tail. All these structures need to be allocated at run-time. The FCR algorithm spends more time in the global inspector for exploiting concurrency reads across different regions. Compared with the CTY and PCR algorithms, the FCR reduces the time spent in the executor significantly. In the case that dependence analyses can be reused across multiple loop invocations, the FCR algorithm is expected to achieve even better performances. Figure 12(d) shows that the DOALL algorithm spends high percentages of time in the scheduler for generating iteration-wise dependences from reference-wise dependence information. The percentage decreases as the loop size increases because the cost of barrier synchronization operations in the executor increases..2 Impact of loop workload It is known that the cost of a run-time parallelization algorithm in the inspector and scheduler is independent of the workload of iterations. The time spent in the executor, however, is proportional to the amount of loop workload. The larger the workload, the smaller is the cost percentage in the inspector and scheduler. The experiments above assumed microseconds workload at each iteration. Figure 1 presents speedups of the algorithms under the assumption of microseconds iteration workload, instead. From the figure, it can be seen that all algorithms lose certain amount of speedups. However, relative performances of the algorithms remain the same as revealed by Figure 1. The performance gap between the FCR algorithm and the CTY and PCR algorithms is enlarged because the relative cost of global analysis in the FCR algorithm increases as the workload decreases. Percentages Percentages Percentages Scheduler Global Analysis Local Analysis Initialization (a) CTY algorithm Scheduler Global Analysis Local Analysis Initialization (b) PCR algorithm Scheduler Global Analysis Local Analysis Initialization (c) FCR algorithm Scheduler Global Analysis Local Analysis Initialization. Impact of loop distributions Generally, a loop iteration space can be assigned to threads in either static or dynamic approaches. Static approaches assign iterations to threads prior to their executions, while dynamic approaches make decisions at run-time. Their advantages and disadvantages were discussed in [19] in a general context of task mapping and load balancing. In this experiment, we tried three simple static assignment strategies: cyclic, block, and block-cyclic because their simplicity lends themselves to be implemented efficiently at run-time. Let b denote the block size. A block cyclic distribution algorithm as- Percentages (d) DOALL algorithm Figure 12: Breakdown of execution time in percentage on different stages

9 Speedup 2 Total Execution Time (sec) Block Size (a) Uniform access patterns (a) Uniform dependence patterns Speedup 2 Total Execution Time (sec) Block Size (b) Non-uniform access patterns (b) Non-uniform dependence patterns Figure 1: Speedups of algorithms over serial code for various loops, where load=s. Figure 14: The effect of block cyclic decompositions signs iteration i to thread ((i mod bm) modb),wherem is the number of threads. It is reduced to a cyclic distribution if b =1 and to a block distribution if b = N=M,whereN is the number of iterations. Figure 14 shows the effect of block cyclic decompositions on the execution time of a loop with 124 iterations. The figure shows that the DOACROSS algorithms are very sensitive to the block size. They prefer cyclic or small block cyclic distributions because they lead to good load balance among processors. Each plot of the DOACROSS algorithms has a knee, beyond which its execution time increases sharply. The FCR and PCR algorithms have the largest knees, reflecting the fact the algorithms exploit the largest degree of parallelism. Given more processors, they are projected to perform better than the CTY. The DOALL algorithm uses a block cyclic decomposition in the scheduler. Its overall execution time is insensitive to the block size. In the case of non-uniform access patterns, a large block cyclic decomposition is slightly superior to the cyclic distribution. 6 Conclusions In this paper, we considered two run-time loop parallelization techniques, DOALL and DOACROSS, which expose different granularities and degrees of parallelism in a loop. The DOALL technique exploits iteration-level parallelism. It restructures the loop into a sequence of doall loops, separated by barrier operations. The DOACROSS technique supports fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently through inserting synchronization operations to preserve dependences. Of the DOACROSS techniques, there are variants, CTY, PCR, and FCR, which expose different amounts of parallelism among concurrent reads to the same memory location. Both approaches follow a so-called INSPECTOR/EXECUTOR scheme. We evaluated the performance of the algorithms on a symmetric multiprocessor for loops with different structures, memory access patterns, and computational workload. In the executor, loops are distributed among processors in a block cyclic way. The experimental results showed that:

10 The DOACROSS technique outperforms the DOALL even though the latter plays a dominant role in compile-time loop parallelization. This is because the DOALL algorithm spends a high percentage of time in the scheduler routine for the construction of iteration-wise wavefronts. Hot reference spots have more negative effects on the DOALL algorithm due to limited iteration-level parallelism. The DOACROSS technique identifies fine grained reference parallelism at the cost of frequent synchronization in the executor. Multithreaded implementations reduced the synchronization overhead of the algorithms. Of the DOACROSS variants, the PCR algorithm performs best because it incurs only slightly more overheads than the CTY algorithm. The FCR algorithm improves on the CTY algorithm for small loops that do not have enough parallelism. For large loops or loops with light workload, its benefits are likely to be outweighed by its extra run-time cost. Loops that are to be executed repeatedly favor the DOALL and FCR algorithms. It is because they spend a high percentage of time in dependence analysis and scheduling and the time can be saved in subsequent loop invocations. The DOACROSS algorithms are sensitive to the block size of loop distribution. Cyclic and small block cyclic distributions yield better performance. Future work includes examination of the issues of load balancing and exploiting locality in parallelizing loops that are to be executed repeatedly and evaluations of the algorithms for loops from real applications. Acknowledgements This work was supported in part by a startup grant from Wayne State University. The author would like to thank Roy Sumit for his help in experiments and Vipin Chaudhary for his insights into the experimental data. Thanks also go to Loren Schwiebert for his advises in presentation. References [1] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, Conversion of control dependence to data dependence, in Proc. of 1th ACM Symposium on Principles of Programming Languages, ACM Press, New York, pages , Jan [2] D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys, Vol.26, No.4, Dec. 1994, pages [] U. Banerjee, R. Eigenmann, A. Nicolau, D. A. Padua, Automatic program parallelization, In Proc. of the IEEE, Vol.81, No.2, February 199, pages [4] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland. Massively parallel methods for engineering and science problems. Comm. ACM,7(4): 1-41, April [] D. K. Chen and P. C. Yew, An empirical study on DOACROSS loops, In Proc. of Supercomputing 91, pages [6] D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proc.of Supercomputing 1994, pages 18-27, Nov [7] R. Cytron, DOACROSS: beyond vectorization for multiprocessors, In Proc. of International Conference on Parallel Processing, 1986, pages [8] J. Ju and V. Chaudhary. Unique sets oriented partitioning of nested loops with non-uniform dependences, In Proc. of International Conference on Parallel Processing, [9] S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 8-91, May 199. [1] S.-T. Leung and J. Zahorjan. Extending the applicability and improving the performance of runtime parallelization, Technical Report, Department of Computer Science, University of Washington, 199. [11] J. T.Lim, A. R. Hurson, K. Kavi, and B. Lee, A loop allocation policy for DOACROSS loops, In Proc. of Symposium on Parallel and Distributed Processing, 1996, pages [12] S. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Trans. on Computers, C-6(12),December [1] L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time paralle- lization of loops with privatization and reduction parallelization. In ACM SIGPLAN Conf. on Programming language Design and Implementation, June,199. [14] L. Rauchwerger, N. M. Amato, and D. A. Padua. Run-time methods for parallelizing partially parallel loops. Technical Report, UIUC, 199. [1] J. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Comput.,4(),May [16] Z. Shen, Z. Li, and P. C. Yew, An empirical study on array subscripts and data dependencies, in Proc. of ICPP, pp. II 14 to II 12, [17] SunSoft, Multhreaded Programming Guide, 199. [18] C. Xu and V. Chaudhary, Time-stamping algorithms for parallelization of loops at run-time, Int. Symposium of Parallel Processing, April Also available at [19] C. Xu and L. Lau, Load Balancing in Parallel Computers: Theory and Practice, Kluwer Academic Publishers, ISBN X, Nov [2] C. Zhu and P. C. Yew, A scheme to enforce data dependence on large multi- processor systems. IEEE Trans. Softw. Eng., 1(6):726-79, 1987.

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 5, MAY 2001 433 Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences Cheng-Zhong Xu, Member,