Effects of Parallelism Degree on Run-Time Parallelization of Loops

Size: px
Start display at page:

Download "Effects of Parallelism Degree on Run-Time Parallelization of Loops"

Transcription

1 Effects of Parallelism Degree on Run-Time Parallelization of Loops Chengzhong Xu Department of Electrical and Computer Engineering Wayne State University, Detroit, MI Abstract Due to the overhead for exploiting and managing parallelism, run-time loop parallelization techniques with the aim of maximizing parallelism may not necessarily lead to the best performance. In this paper, we present two parallelization techniques that exploit different degrees of parallelism for loops with dynamic cross-iteration dependences. The DOALL approach exploits iteration-level parallelism. It restructures the loop into a sequence of do-parallel loops, separated by barrier operations. Iterations of a do-parallel loop are run in parallel. By contrast, the DOACROSS approach exposes fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences. The DOACROSS approach has variants that identify different amounts of parallelism among consecutive reads to the same memory location. We evaluate the algorithms for loops using various structures, memory access patterns, and computational workloads on symmetric multiprocessors. The algorithms are scheduled using block cyclic decomposition strategies. The experimental results show that the DOACROSS technique outperforms the DOALL, even though the latter is widely used in compile-time parallelization of loops. Of the DOACROSS variants, the algorithm allowing partially concurrent reads performs best because it incurs only slightly more overhead than the algorithm disallowing concurrent reads. The benefit from allowing fully concurrent reads is significant for small loops that do not have enough parallelism. However, it is likely to be outweighed by its cost for large loops or loops with light workload. 1 Introduction Loop parallelization exploits parallelism among instruction sequences or loop iterations. Techniques for exploiting instructionlevel parallelism are prevalent in today s microprocessors. On multiprocessors, loop parallelization techniques focus on loop-level parallelism. They partition and allocate loop iterations among processors with respect to cross-iteration dependences. Their primary objective is to expose enough parallelism to keep processors busy all the time while minimizing synchronization overheads. On multiprocessors, there are two important parallelization techniques that exploit different degrees of parallelism. The DOALL technique assumes loop iterations as the basic scheduling and execution units [2, ]. It decomposes the iterations into a sequence of subsets, called wavefronts. Iterations within the same wavefront are run in parallel. A barrier synchronization operation is used to preserve cross-iteration dependences between two wavefronts. The DOALL technique reduces the run-time scheduling overhead at a sacrifice of a certain amount of parallelism. By contrast, the DOACROSS technique exploits finegrained reference-level parallelism. It allows dependent iterations to be run concurrently by inserting point-to-point synchronization operations to preserve dependences among memory references. The DOACROSS technique maximizes parallelism at the expense of frequent synchronization. Note that in the literature the terms DOACROSS and DOALL often refer to the loops with and without cross-iterations dependences, respectively. We borrow the terms as parallelization techniques in this paper because the DOALL technique essentially restructures a loop into a sequence of DOALL loops. Both the DOALL and DOACROSS techniques are used to parallelize DOACROSS loops. Chen and Yew [] studied programs from the PERFECT benchmark suite and revealed the significant advantages of parallelizing DOACROSS loops. DOACROSS loops can be characterized as static and dynamic in terms of the time when cross-iteration dependence information is available (at compile-time or run-time). Figure 1 shows an example of dynamic loops due to the presence of indirect access patterns on data array X. Dynamic loops appear frequently in scientific and engineering applications [16]. Examples include SPICE for circuit simulation, CHARMM and DISCOVER for molecular dynamics simulation of organic systems, and FIDAP for modeling complex fluid flows [4]. for i=1tondo :::= X[v[i]] + ::: X[u[i]] = ::: endfor Figure 1: A general form of loops with indirect access patterns, where u and v are input-dependent functions. For parallelizing static loops, the DOALL technique plays a dominant role because it employs a simple execution model after exploiting parallelism at compile-time [8]. For dynamic loops, however, this may not be the case. Since parallelism in a dynamic loop has to be identified at run-time, the cost for building wavefronts in the DOALL technique becomes a major source of runtime overhead. The DOACROSS incurs run-time overhead for

2 analyzing reference-wise dependences, but it provides more flexibility to subsequent scheduling. This paper compares the DOALL and DOACROSS approaches for parallelizing loops at run-time, focusing on the effect of parallelism degree on the performance of parallelization. In [18], we devised two DOACROSS algorithms exploiting different amounts of parallelism and demonstrated the effectiveness of the DOACROSS algorithms on symmetric multiprocessors. This paper presents a DOALL algorithm that allows parallel construction of the wavefronts and compares this algorithm with the DOACROSS algorithms, focusing on the influences of parallelism degree. We show that the DOACROSS algorithms have advantages over the DOALL, even though the latter is preferred for compile-time parallelization. Of the DOACROSS variants, the algorithm that allows concurrent reads may over-expose parallelism for large loops. Its benefits will be outweighed by the run-time cost of exploiting and managing the extra amount of parallelism for large loops or loops with light workload. The rest of the paper is organized as follows. Section 2 reviews run-time parallelization techniques and qualitatively compares the DOACROSS and DOALL techniques. Section briefly presents three DOACROSS algorithms that expose different amounts of parallelism. Section 4 presents a DOALL algorithm. Section evaluates the algorithms, focusing on the effects of parallelism degree and granularity. Section concludes the paper with a summary of evaluation results. 2 Run-time Parallelization Techniques In the past, many run-time parallelization algorithms have been developed for different types of loops on both shared-memory and distributed-memory machines [6, 9, 14]. Most of the algorithms follow a so-called INSPECTOR/EXECUTORapproach. With this approach, a loop under consideration is transformed at compile-time into an inspector routine and an executor routine. At run-time, the inspector detects cross-iteration dependences and produces a parallel schedule; the executor performs the actual loop operations in parallel based on the dependence information exploited by the inspector. The keys to success with this approach is to shorten the time spent on dependence analyses without losing valuable parallelism and to reduce the synchronization overhead in the executor. An alternative to the INSPECTOR/EXECUTOR approach is a speculative execution scheme that was recently proposed by Rauchwerger, Amato, and Padua [1]. In the speculative execution scheme, the target loop is first handled as a doall regardless of its inherent parallelism degree. If a subsequent test at run-time finds that the loop was not fully parallel, the whole computation is then rolled back and executed sequentially. Although the speculative execution yields good results when the loop is in fact executable as a doall, it fails in most applications that have partially parallel loops. The INSPECTOR/EXECUTORscheme provides a run-time parallelization framework, and leaves strategies for dependence analysis and scheduling unspecified. The scheme can also be restructured to decouple the scheduling function from the inspector and to merge it with the executor. The scheduling function can even be extracted to serve as a stand-alone routine between the inspector and the executor. There are many run-time parallelization algorithms belongingtotheinspector/executor scheme. They differ from each other mainly in their structures and strategies used in each routine, in addition to the type of target loops considered. Pioneering work on using the INSPECTOR/EXECUTOR scheme for run-time parallelization is due to Saltz and his colleagues [1]. They considered loops without output dependences (i.e. the indexing function used in the assignments of the loop body is an identity function), and proposed an effective DOALL INSPEC- TOR/EXECUTOR scheme. Its inspector partitions the set of iterations into a number of wavefronts, which maintain cross-iteration flow dependences. Iterations within the same wavefront can be executed concurrently, but those in different wavefronts must be processed in order. The executor of the DOALL scheme enforces anti-flow dependences during the execution of iterations in the same wavefront. The DOALL INSPECTOR/EXECUTOR scheme has been shown to be effective in many real applications. It is applicable, however, only to those loops without output dependences. The basic scheme was recently generalized by Leung and Zahorjan to allow general cross-iteration dependences [1]. In their algorithm, the inspector generates a wavefront-based schedule and maintains output and anti-flow dependences as well as flow dependences; the executor simply performs the loop operations according to the wavefronts of iterations. Note that the inspector in the above scheme is sequential. It requires time commensurate with that of a serial loop execution. Parallelization of the inspector loop was also investigated by Saltz, et al. [1] and Leung and Zahorjan [9]. Their techniques respect flow dependences, but ignore anti-flow and output dependences. Most recently, Rauchwerger, Amato and Padua presented a parallel inspector algorithm for a general form of loops [1]. They extracted the function of scheduling and explicitly presented an inspector/scheduler/executor scheme. DOALL INSPECTOR/EXECUTOR schemes assume a loop iteration as the basic scheduling unit in the inspector and the basic synchronization object in the executor. An alternative to this scheme is DOACROSS INSPECTOR/EXECUTOR parallelization techniques which assume a memory reference in the loop body as the basic unit of scheduling and synchronization. Processors running the executor are assigned iterations in a wrapped manner and each spinwaits as needed for operations that are necessary for its execution. An early study of DOACROSS run-time parallelization techniques was conducted by Zhu and Yew [2]. They proposed a scheme that integrates the functions of dependence analysis and scheduling into a single executor. Later, the scheme was improved by Midkiff and Padua to allow concurrent reads to the same array element by several iterations [12]. Even though the integrated scheme allows concurrent analysis of cross-iteration dependences, tight coupling of the dependence analysis and the executor causes high synchronization overhead in the executor. Most recently, Chen, et. al., developed the DOACROSS technique by decoupling the function of the dependence analysis from the executor [6]. We refer to their technique as the CTY algorithm. Separation of the inspector and executor not only reduces synchronization overhead in the executor, but also provides the possibility of reusing the dependence information developed in the inspector across multiple invocations of the same loop. Their inspector is parallel at the sacrifice of concurrent reads

3 to the same array element. Their algorithm was recently further improved by Xu and Chaudhary by allowing concurrent reads of the same array element in different iterations and by increasing the overlap of dependent iterations [18]. DOALL and DOACROSS are two competing techniques for run-time loop parallelization. DOALL parallelizes loops at iteration level, while DOACROSS supports parallelism at a finegrained memory access level. Consider the loop and index arrays shown in Figure 2. The first two iterations can be either indefor i=1tondo if ( exp(i) ) X[u1[i]] = F(X[v1[i]], :::) else X[u2[i]] = F(X[v2[i]], :::) endfor u1=[,7,...] u2=[,,...] v1=[7,2,...] v2=[4,,...] Figure 2: An example of loops with conditional cross-iteration dependences, where F is anarbitraryoperator. pendent (when exp(1) is false and exp(2) is true), flow dependent (when exp(1) is true and exp(2) is false), anti-flow dependent (when both exp(1) and exp(2) are true), or output dependent (when both exp(1) and exp(2) are false). The nondeterministic cross-iteration dependences are due to control dependences between statements in the loop body. We call such dependences conditional cross-iteration dependences. Control dependences can be converted into data dependences by a if-conversion technique at compile-time [1]. The compile-time technique, however, may not be helpful for loop-carried dependence analysis at run-time. By the DOALL technique, loops with conditional cross-iteration dependences must be handled sequentially. However, the DOACROSS technique can handle this class of loops easily. At run-time, the executor, upon testing a branch condition, may set all operands in the non-taken branch available so as to release processors waiting for those operands. Furthermore, the DOACROSS technique overlaps dependent iterations. The first two iterations in Figure 2 have an anti-flow dependence when both exp(1) and exp(2) are true. The memory access to X(4) in the second iteration, however, can be overlapped with the execution of iteration 1 without destroying the anti-flow dependences. The DOACROSS INSPECTOR/EXECUTOR parallelization technique provides the potential to exploit fine-grained parallelism across loops. Fine-grained parallelism does not necessarily lead to overall performance gains without an efficient implementation of the executor. One main contribution of this paper is to show that multi-threaded implementations favor the DOACROSS technique. The Time-Stamp DOACROSS Algorithms This section briefly presents three DOACROSS algorithms that feature parallel dependence analysis and scheduling. They expose different amounts of parallelism among consecutive reads to the same memory location. For more details, please see [18]. Consider the general form of loops in Figure 1. It defines a two dimensional iteration-reference space. The inspector of a timestamp algorithm examines the memory references in a loop and constructs a dependence chain for each data array element of the loop. In addition to the precedence order, the inspector also assigns a stamp to each reference in a dependence chain, which indicates its earliest access time relative to the other references in the chain. A reference can be activated if and only if the preceding references are finished. The executor schedules references to a chain through a logical clock. At a given time, only those references whose stamps are equal to or larger than the time are allowed to proceed. Dependence chains are associated with clocks ticking at different speeds. Assume stamps of references as discrete integers. The stamps are stored in a two-dimensional array stamp. Let(i; j) indicate the j th access of the i th iteration. stamp[i][j] represents the stamp of reference (i; j). By scanning through the iteration space sequentially, processors can easily construct a time-stamp array that has the following features. The stamp difference between any two directly connected references is one except for pairs of write-afterread and read-after-read references; both reads in pairs of readafter-read have an equivalent stamp; and in pairs of write-afterread, their difference is the size of the read group minus one. Figure shows an example derived from the target loop assuming u = [1; ; ; 14; 1; 14; 12; 11; ; 12; 4; 8; ; 1; 1; ] v = [; 1; 1; 1; ; 8; 1; 1; 1; 1; 1; 1; ; 1; 11; ]: Iteration Read v ( 1 ) ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 1 ) ( ) ( ) ( 1 ) ( ) ( ) ( 2 ) ( ) ( 2 ) ( 2 ) ( 1 ) Write u ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 2 ) ( 4 ) ( 7 ) ( 8 ) ( ) Figure : Sequentially constructed dependence chains labeled by array elements. The numbers in parentheses are the stamps of references. Basedonsuchtime-stamparrays,asimpleclockruleintheexecutor can preserve all dependences and allow consecutive reads to be run in parallel..1 A Parallel Algorithm Allowing Partially Concurrent Reads (PCR) Building dependence chains requires an examination of all references at least once in the loop. To construct the time-stamp array in parallel, one key issue is how to stamp the references in a dependence chain across iteration regions on different processors. Since no processors (except the first) have knowledge about the references in previous regions, they are unable to stamp their local references in a local chain without the assignment of its head. To allow processors to continue with the examination of other references in their local regions in parallel, the time-stamp inspector uses a conservative approach to assign a conservative number to the second reference of a local chain and leave the first one to be decided in a subsequent global analysis.

4 Read ( 1 ) ( 1 ) ( 2 ) (1 ) (1 ) (12 ) (12 ) ( ) (17 ) (17) Write u ( 1 ) ( 2 ) ( 2 ) (26 ) (19 ) (26 ) (27 ) Iteration v Figure 4: A fully stamped dependence chain labeled by array element. Numbers in parentheses are stamps of references. Using this conservative approach, most of the stamp table can be constructed in parallel. Upon completion of the local analysis, processors communicate with each other to determine the stamps of undecided references in the stamp table. Figure 4 shows the complete dependence chains associated with array elements, 1 and 1. Processor temporarily assigns 26 to the reference (12; 1), assuming all 24 accesses in regions from to 2 are in the same dependence chain. In the subsequent cross-processor analysis, processor 2 sets stamp[8][1] to 2 after communicating with processor (processor 1 marks no reference to the same location). At the same time, processor communicates with processor 2, but gets an undecided stamp on the reference (8; 1), and hence assigns another conservative number, 16 plus 1, to reference (12; ), assuming all accesses in regions and 1 are in the same dependence chain. The extra one is due to the total number of dependent references in region 2. Note that the communications from processor to processor 2 and from processor 2 to processor 1 are in parallel. Processor 2 can provide processor only the number of references in the local region until the end of the communication with processor. Accordingly, the time-stamp algorithm presents a special clocking rule that sets the clock of a dependence chain to n +2if an activated reference in region r is a local head, where n is the total number of accesses from region to region r 1. For example, the reference (2; ) in Figure 4 first triggers the reference (4; 1). Activation of reference (4; 1) will set the clock to 1 because there are 8 accesses in the first region, which consequently triggers the following two reads. Note that this parallel inspector algorithm only allows consecutive reads in the same region to be performed in parallel. Read operations in different regions must be performed sequentially even though they are totally independent of each other. In the dependence chain on element 1 in Figure 4, for example, the reads (9; ) and (1; ) are activated after reads (6; ) and (7; ). Weareable to assign reads (9; ) and (1; ) the same stamp as reads (6; ) and (7; ), and assign the reference (1; 1) a stamp of 14. This dependence chain, however, will destroy the anti-flow dependences from (6; ) and (7; ) to (14; ) in the executor if reference (9; ) or (1; ) starts earlier than one of the reads in region 1..2 A Parallel Algorithm Allowing Fully Concurrent Reads (FCR) The basic idea of the algorithm is to treat write operations and groups of consecutive reads as a macro-reference. For a write reference or the first read operation in a read group in a dependence chain, the inspector stamps the reference with the total number of macro-references ahead. Other accesses in a read group are assigned the same stamp as the first read. Correspondingly, in the executor, the clock of a dependence chain is incremented by one time unit on a write reference and by a fraction of a time unit on a read operation. The magnitude of an increment on a read operation is the reciprocal of its read group size. Figure presents sequentially stamped dependence chains. In addition to the stamp, each read reference is also associated with an extra data item recording its read group size. In an implementation, the variable for read group size can be combined with the variable for stamp. For simplicity of presentation, however, they are declared as two separate integers. Look at the dependence chain on element 1. The reference (4; 1) triggers four subsequent reads: (6; ), (7; ), (9; ) and (1; ) simultaneously. Activation of each of these reads increments the clock time by 1=4. After all of them are finished, the clock time reaches 4, which in turn activates the reference (1; 1). Following are the details of the algorithm Iteration Read v (1,1) (1,1) (1,1) (2,) (1,2) (1,1) (,4) (,4) (1,1) (,4)(,4) (2,) (,1) (2,) (2,1) (1,2) Write u ( 1 ) ( 1 ) ( 2 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 1 ) ( 2 ) ( 2 ) ( 1 ) ( 2 ) ( 4 ) ( 4 ) ( ) ( ) Figure : Sequentially constructed dependence chains labeled by array element. Numbers in parentheses are stamps of references. As in the PCR algorithm, the inspector first partitions the iteration space of a target loop into a number of regions. Each region is assigned to a different processor. Each processor first stamps local references, except the head macro-references at the beginning of the local dependence chains. References next to head macroreferences are stamped with conservative numbers using the conservative approach as in the PCR algorithm. Processors then communicate with each other to merge consecutive read groups and stamp undecided macro-references in the time table. The base of the algorithm is a three dimensional stamp table: stamp. stamp[i][j][] records the stamp of reference (i; j) in the iteration-reference space. stamp[i][j][1] stores the size of a read group to which reference (i; j) belongs. If reference (i; j) is a write, stamp[i][j][1] is not used. The local inspector at each processor passes once through its iteration-reference region, forms read groups, and assigns appropriate stamps onto its inner references. Inner references refer to those whose stamps can be determined locally using conservative approaches. All write operations except for the heads of local dependence chains are inner references. All reads of a read group are also inner references if the group is neither the head nor the tail of any local dependence chain. Figure 6 presents a partially stamped dependence chain on element 1. From this figure, it can be seen that region 1 establishes a right-open single read group; region 2 establishes a right-open group with two members; region builds a group open at both ends. The local inspector stamps all inner references of each region. A subsequent global inspector merges consecutive read groups that are open to each other and assigns appropriate stamps to newly

5 Read Write (1, u) (-1,u) (,-1) (-19,u) (u,-21) ( u ) ( u ) ( 26 ) 1 Iteration v Figure 6: A partially stamped dependence chain constructed in parallel. formed closed read groups and undecided writes. Figure 7 is a fully stamped chain evolved from Figure 6. Read Write Iteration v (1, 1) (1,4) (1,4) (1,4) (1,4) ( 2 ) ( 17 ) ( 26 ) Figure 7: A fully stamped dependence chain constructed in parallel. u u In this section, we present a DOALL INSPECTOR/EXECUTOR algorithm for run-time parallelization. The algorithm breaks down parallelization functions into three routines: inspector, scheduler and executor. The inspector examines the memory references in a loop and constructs a reference-wise dependence chain for each data element accessed in the loop. The scheduler then derives more restrictive iteration-wise dependence relations from the referencewise dependence chains. Finally, the executor restructures the loop into a sequence of wavefronts and executes iterations accordingly. Iteration-wise dependence information is usually represented by a vector of integers, denoted by wf. Each element of the vector, wf[i], indicates the earliest invocation time of iteration i. The wavefront vector bridges the scheduler and executer, being their output and input, respectively. The reference-wise dependence chains can be represented in different ways. Different data structures will lead to different inspector and scheduler algorithms. Desirable features of the structure are low in memory overhead, simple to construct in parallel, and easy for the scheduler to generate wavefront vectors. In [14], Rauchwerger et al. presented an algorithm (RAP, for short) that uses a reference array R to collect all the references to an element in iteration order and a hierarchy vector H to record the index of the first reference of each wavefront level in the array. For example, for the loop in Figure 1 and a memory access pattern defined by u and v in Section, the reference array of element 1 and its hierarchy vector are shown in Figure 8. For example, H[2]=2 and H[] = 6 indicate that iteration 6 (R[2]) and iteration 1 (R[6]) are the first references of the th and 1 th wavefronts, respectively. Its scheduler uses these two data structures as look-up tables for determining the predecessors and successors of all the references. (a) R W R R R R W W Iteration Type Level The executor uses an extra clocking rule to allow concurrent reads and meanwhile preserve anti-flow dependences. That is, if the reference is a read in region r, the clock of its dependence chain is incremented by 1=b if the read is a head of region r; otherwise, it is set to n +1+1=b + frac,whereb is the read group size and frac is the fraction part of the ordinal clock. For example, look at the dependence chain associated with memory location 1 in Figure 7. Activation of reference (2; ) increments time[1] by one because it is a single read group. Reference (4; 1) triggers all four subsequent reads simultaneously by setting time[1] to 1. Suppose the four references are performed in the following order: (6; ), (9; ), (7; ) and (1; ). Reference (6; ) increments time[1] by 1=4. Reference (9; ), however, sets time[1] to 16:. The subsequent two reads add 1=4 to time[1] each. Upon completeness of all reads, their subsequent write (1; 1) is activated. The purpose of the read-number clock time is to record the number of activated reads in a group. 4 A DOALL algorithm (b) Reference array (a) and hierarchy vector (b) of element 1 Figure 8: The reference array and hierarchy vector of element 1 for the loop in Figure 1 with a memory access pattern defined by u and v. Since the reference array in the RAP algorithm stores the reference levels of a dependence chain, it is hard to construct in parallel using the ideas from the DOACROSS algorithms. We cannot assign conservative levels to references because their levels will be used to derive consecutive iteration-wise wavefront levels. In the following, we present a new algorithm that simply uses a dependence table, as shown in Figure 9, to record the memory-reference dependences. The table is different from the stamp array of the time-stamp algorithms. First, each dependence chain of the table reflects only the order of precedence of its memory references. There are no levels associated with the references. Second, not all memory references have a role in the dependence chains. Since the DOALL approach concerns only iteration-wise dependences, a read operation can be ignored if there is another write in the same iteration which accesses to the same memory location. For example, the read in iteration 12 is overruled by the write in the same iteration. Read Write Iteration v Figure 9: Dependence table for the loop in Figure 1 with an memory access pattern defined by V and U. Each cell of the table is defined as u

6 struct Cell f int iter; /* Current iteration index */ int elm; /* Item to be referenced */ char type; /* Reference type: RD/WR */ int *xleft; /* (xleft, yleft) points to */ int *yleft; /* its predecessor*/ g For parallel construction of the dependence table in the inspector phase, each processor maintains a pair of pointers (head[i]; tail[i]) for each memory location i, which point to the head and tail of its associated dependence chain, respectively. As in the DOACROSS algorithm, the inspector needs cross-processor communication to connect those adjacent local dependence chains associated with the same memory location. Processor k can find the predecessor (successor) for its local dependence chain i by scanning tail[i] (head[i]) of processors from k 1 to (from k +1to N-1). Based on the dependence table, processors then construct a wavefront vector in parallel through synchronizing accesses to each element of the wavefront vector. Specifically, for iteration i, a processor determines the dependence levels of all its references and sets wf[i] to the highest level. The dependence level of a reference is calculated according to the following rules. For a reference r of iteration i, (S1) if the reference r isa writeand its immediate predecessor is a read, the processor examines all following consecutive reads and sets the reference level to maxfwf[k]g +1,wherek is an iteration that contains one of the reads. (S2) if the reference r is a write and its immediate predecessor is a write (say in iteration k), the reference level is set to wf[k] +1; (S) if the reference r is a read, the processor backtracks the reference s dependence chain until it meets a write (say in iteration k). The reference level is then set to wf[k] +1. Notice that synchronization of accesses to wavefront vector elements requires a blocking read if its target has an invalid value. The waiting reads associated with an element should be woken up by each write to the element. Applying the rules to the dependence table in Figure 9, we obtain a wavefront vector of : Taking the wavefront vector as an input, the executor can be described as follows: for k= to d-1 do forall i such that wf[i]=k perform iteration i endfor barrier endfor where d is the number of wavefronts in a schedule. Performance Evaluations We implemented the DOACROSS and DOALL run-time parallelization algorithms on a SUN Enterprise E1. It is a symmetric multiprocessor, configured with four 167MHz UltraSPARC processors and a memory of 12 MB. Each processor module has an external cache of 1 MB. Each DOACROSS algorithm was implemented as a sequence of three routines: a local inspector, aglobal insepctor, andanexecutor. The implementation of the DOALL algorithm has an extra scheduler between its global inspector and executor. All routines are separated by barrier synchronization operations. They were programmed in the Single-Program-Multiple-Data (SPMD) paradigm and as multi-threaded codes. At the beginning, threads were created in a bound mode so that threads are bound to different CPUs and run to completion. Performance of a run-time parallelization algorithm is dependent on a number of factors. One is the structure of target loops. This experiment targets the same synthetic loop structure used in [6, 18]. It comprises a sequence of interleaved reads and writes in the loop body. Each iteration executes a delay() function, reflecting the workload caused by its memory references. In [18], another socalled multiple-reads-single-write loop structure was considered in a preliminary evaluation of the DOACROSS algorithms. It was found that the FCR and the PCR algorithms gain few benefits from the extra reads in that loop structure. Another major factor affecting the overall performance is memory access patterns defined by index arrays. Uniform and nonuniform memory access patterns were considered. A uniform access pattern (UNIFORM, for short) assumes all array elements have the same probability of being accessed by a memory reference. A non-uniform access pattern (NUNIFORM, for short) refers to the pattern where 9% of references are to 1% of array elements. Non-uniform access patterns reflect hot reference spots and result in long dependence chains. In addition to the loop structure and memory access pattern, the performance of a parallelization algorithm is also critically affected by its implementation strategies. One of the major issues is loop distribution. For the local and global inspectors, an iteration-wise block decomposition is straightforward because it not only ensures balanced workload among processors but also allows processors to be run in parallel. Loops in the scheduler and executor can be distributed in different ways: cyclic, block, or block cyclic. Their effects on the performance will be evaluated, too. The experiments measure the overall run-time execution time for a given loop and a memory access pattern. Each data point obtained in the experiments is the average of five runs, using different seeds for the generation of pseudo-random numbers. Since a seed determines a pseudo-random sequence, algorithms are able to be evaluated under the same test instances..1 Impact of access patterns and loop sizes Figure 1 presents speedups of different parallelization algorithms over serial loops ranging from 16 to 1684 iterations in a loop. Assume each iteration has four memory references and contains

7 microseconds workload in the delay function. Loops in the scheduler and executor are decomposed in a cyclic way. Section. will show that the cyclic decomposition yields the best performance for all algorithms. such small loops won t gain any benefits on a system with four processors. This is in agreement with what the speedup curve of the DOALL algorithm shows in Figure 1(b). Similar observations can be made for loops with uniform access patterns from the plot of parallelism degrees in Figure 11 and the curve of speedups in Figure Uniform, SRSW Non-uniform, SRSW 128 Speedup 2 1 Average Degree of Parallelism (a) Uniform access patterns Speedup (b) Non-uniform access patterns Figure 1: Speedups of parallelization algorithms over serial loops of different sizes, where load=s. Overall, the figure shows that the DOACROSS algorithms significantly outperform the DOALL algorithm, even though both techniques are capable of gaining speedups for loops that exhibit uniform and non-uniform memory access patterns. The gap between their speedup curves indicates the importance of exploiting fine-grained reference-level parallelism at run-time. Referencelevel parallelism is especially important to small loops because they normally have very limited degrees of iteration-level parallelism. We refer to the degree of iteration-level parallelism as the number of iterations in a wavefront of the DOALL algorithm. Figure 11 plots average degrees of parallelism in loops of different sizes. As can be observed, the degree of parallelism is proportional to the loop size. A loop that exhibits non-uniform access patterns has at most four degrees of parallelism until its loop size reaches beyond 12. This implies that iteration-level parallelization techniques for Figure 11: Average degree of parallelism exploited by the DOALL algorithm It is worth noting that higher degrees of parallelism may not necessarily lead to better performances. From Figure 1(a), it can be seen that the speedup due to the DOALL algorithm starts to drop when loop size reaches beyond 496. The average degree of parallelism at the point is 128, which is evidently excessive on a system with only four processors. The performance degradation is due to the cost of barrier synchronizations. The DOACROSS technique delivers high speedups for small loops because it is able to exploit enough parallelism to keep all processors busy all the time. Expectedly, the amount of finegrained parallelism will quickly become excessive as the loop size increases. An interesting feature with the DOACROSS algorithms is that their speedups stabilize at the highest level when parallelism is over-exposed, rather than declining as in the DOALL technique. It is because the cost of point-to-point synchronizations used in the DOACROSS executor is independent of the parallelism degree. Consequently, the execution time of the executor is proportional to loop size. Of the DOACROSS variants, the FCR and PCR algorithms are superior to the CTY for small loops, while the CTY algorithm is preferred for large ones. The FCR and PCR algorithms improve on the CTY by allowing consecutive reads to the same location to be run simultaneously. The extra amount of parallelism does benefit to small loops. For large loops, however, the benefit could be outweighed by the cost of exploiting and managing the parallelism. Since the FCR algorithm incurs more overhead than the CTY in the global inspector, it obtains lower speedups for large loops. In contrast, the PCR algorithm can obtain almost the same speedup as the CTY because the PCR algorithm causes only slightly more overhead in the local inspector. To better understand the relative performance of the algorithms, we break down their execution into a number of principal tasks and

8 present their percentages of overall execution time in Figure 12. Initialization curve indicates the cost for memory allocation, initialization, and thread creations. The cost spent in the local inspector, global inspector, or scheduler is indicated by the range between the curves of its predecessor and itself. Remainders above the global inspector and scheduler curves are due to the executors of the DOACROSS and DOALL algorithms, respectively. From the figure, it can be seen that both the CTY and PCR algorithms spend approximately % of time in their initializations, and 1.% in their local and global inspectors each for large loops. The FCR algorithm spends about 1% of time in initialization because it uses a three-dimensional stamp table (instead of a two-dimensional stamp array in the CTY and PCR algorithms) and different auxiliary data structures: head and tail. All these structures need to be allocated at run-time. The FCR algorithm spends more time in the global inspector for exploiting concurrency reads across different regions. Compared with the CTY and PCR algorithms, the FCR reduces the time spent in the executor significantly. In the case that dependence analyses can be reused across multiple loop invocations, the FCR algorithm is expected to achieve even better performances. Figure 12(d) shows that the DOALL algorithm spends high percentages of time in the scheduler for generating iteration-wise dependences from reference-wise dependence information. The percentage decreases as the loop size increases because the cost of barrier synchronization operations in the executor increases..2 Impact of loop workload It is known that the cost of a run-time parallelization algorithm in the inspector and scheduler is independent of the workload of iterations. The time spent in the executor, however, is proportional to the amount of loop workload. The larger the workload, the smaller is the cost percentage in the inspector and scheduler. The experiments above assumed microseconds workload at each iteration. Figure 1 presents speedups of the algorithms under the assumption of microseconds iteration workload, instead. From the figure, it can be seen that all algorithms lose certain amount of speedups. However, relative performances of the algorithms remain the same as revealed by Figure 1. The performance gap between the FCR algorithm and the CTY and PCR algorithms is enlarged because the relative cost of global analysis in the FCR algorithm increases as the workload decreases. Percentages Percentages Percentages Scheduler Global Analysis Local Analysis Initialization (a) CTY algorithm Scheduler Global Analysis Local Analysis Initialization (b) PCR algorithm Scheduler Global Analysis Local Analysis Initialization (c) FCR algorithm Scheduler Global Analysis Local Analysis Initialization. Impact of loop distributions Generally, a loop iteration space can be assigned to threads in either static or dynamic approaches. Static approaches assign iterations to threads prior to their executions, while dynamic approaches make decisions at run-time. Their advantages and disadvantages were discussed in [19] in a general context of task mapping and load balancing. In this experiment, we tried three simple static assignment strategies: cyclic, block, and block-cyclic because their simplicity lends themselves to be implemented efficiently at run-time. Let b denote the block size. A block cyclic distribution algorithm as- Percentages (d) DOALL algorithm Figure 12: Breakdown of execution time in percentage on different stages

9 Speedup 2 Total Execution Time (sec) Block Size (a) Uniform access patterns (a) Uniform dependence patterns Speedup 2 Total Execution Time (sec) Block Size (b) Non-uniform access patterns (b) Non-uniform dependence patterns Figure 1: Speedups of algorithms over serial code for various loops, where load=s. Figure 14: The effect of block cyclic decompositions signs iteration i to thread ((i mod bm) modb),wherem is the number of threads. It is reduced to a cyclic distribution if b =1 and to a block distribution if b = N=M,whereN is the number of iterations. Figure 14 shows the effect of block cyclic decompositions on the execution time of a loop with 124 iterations. The figure shows that the DOACROSS algorithms are very sensitive to the block size. They prefer cyclic or small block cyclic distributions because they lead to good load balance among processors. Each plot of the DOACROSS algorithms has a knee, beyond which its execution time increases sharply. The FCR and PCR algorithms have the largest knees, reflecting the fact the algorithms exploit the largest degree of parallelism. Given more processors, they are projected to perform better than the CTY. The DOALL algorithm uses a block cyclic decomposition in the scheduler. Its overall execution time is insensitive to the block size. In the case of non-uniform access patterns, a large block cyclic decomposition is slightly superior to the cyclic distribution. 6 Conclusions In this paper, we considered two run-time loop parallelization techniques, DOALL and DOACROSS, which expose different granularities and degrees of parallelism in a loop. The DOALL technique exploits iteration-level parallelism. It restructures the loop into a sequence of doall loops, separated by barrier operations. The DOACROSS technique supports fine-grained reference-level parallelism. It allows dependent iterations to be run concurrently through inserting synchronization operations to preserve dependences. Of the DOACROSS techniques, there are variants, CTY, PCR, and FCR, which expose different amounts of parallelism among concurrent reads to the same memory location. Both approaches follow a so-called INSPECTOR/EXECUTOR scheme. We evaluated the performance of the algorithms on a symmetric multiprocessor for loops with different structures, memory access patterns, and computational workload. In the executor, loops are distributed among processors in a block cyclic way. The experimental results showed that:

10 The DOACROSS technique outperforms the DOALL even though the latter plays a dominant role in compile-time loop parallelization. This is because the DOALL algorithm spends a high percentage of time in the scheduler routine for the construction of iteration-wise wavefronts. Hot reference spots have more negative effects on the DOALL algorithm due to limited iteration-level parallelism. The DOACROSS technique identifies fine grained reference parallelism at the cost of frequent synchronization in the executor. Multithreaded implementations reduced the synchronization overhead of the algorithms. Of the DOACROSS variants, the PCR algorithm performs best because it incurs only slightly more overheads than the CTY algorithm. The FCR algorithm improves on the CTY algorithm for small loops that do not have enough parallelism. For large loops or loops with light workload, its benefits are likely to be outweighed by its extra run-time cost. Loops that are to be executed repeatedly favor the DOALL and FCR algorithms. It is because they spend a high percentage of time in dependence analysis and scheduling and the time can be saved in subsequent loop invocations. The DOACROSS algorithms are sensitive to the block size of loop distribution. Cyclic and small block cyclic distributions yield better performance. Future work includes examination of the issues of load balancing and exploiting locality in parallelizing loops that are to be executed repeatedly and evaluations of the algorithms for loops from real applications. Acknowledgements This work was supported in part by a startup grant from Wayne State University. The author would like to thank Roy Sumit for his help in experiments and Vipin Chaudhary for his insights into the experimental data. Thanks also go to Loren Schwiebert for his advises in presentation. References [1] J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, Conversion of control dependence to data dependence, in Proc. of 1th ACM Symposium on Principles of Programming Languages, ACM Press, New York, pages , Jan [2] D. F. Bacon, S. L. Graham, and O. J. Sharp, Compiler Transformations for High-Performance Computing, ACM Computing Surveys, Vol.26, No.4, Dec. 1994, pages [] U. Banerjee, R. Eigenmann, A. Nicolau, D. A. Padua, Automatic program parallelization, In Proc. of the IEEE, Vol.81, No.2, February 199, pages [4] W. J. Camp, S. J. Plimpton, B. A. Hendrickson, and R. W. Leland. Massively parallel methods for engineering and science problems. Comm. ACM,7(4): 1-41, April [] D. K. Chen and P. C. Yew, An empirical study on DOACROSS loops, In Proc. of Supercomputing 91, pages [6] D. K. Chen, P. C. Yew, and J. Torrellas. An efficient algorithm for the run-time parallelization of doacross loops. In Proc.of Supercomputing 1994, pages 18-27, Nov [7] R. Cytron, DOACROSS: beyond vectorization for multiprocessors, In Proc. of International Conference on Parallel Processing, 1986, pages [8] J. Ju and V. Chaudhary. Unique sets oriented partitioning of nested loops with non-uniform dependences, In Proc. of International Conference on Parallel Processing, [9] S.-T. Leung and J. Zahorjan. Improving the performance of runtime parallelization. In 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, pages 8-91, May 199. [1] S.-T. Leung and J. Zahorjan. Extending the applicability and improving the performance of runtime parallelization, Technical Report, Department of Computer Science, University of Washington, 199. [11] J. T.Lim, A. R. Hurson, K. Kavi, and B. Lee, A loop allocation policy for DOACROSS loops, In Proc. of Symposium on Parallel and Distributed Processing, 1996, pages [12] S. Midkiff and D. Padua. Compiler algorithms for synchronization. IEEE Trans. on Computers, C-6(12),December [1] L. Rauchwerger and D. Padua. The LRPD test: Speculative run-time paralle- lization of loops with privatization and reduction parallelization. In ACM SIGPLAN Conf. on Programming language Design and Implementation, June,199. [14] L. Rauchwerger, N. M. Amato, and D. A. Padua. Run-time methods for parallelizing partially parallel loops. Technical Report, UIUC, 199. [1] J. Saltz, R. Mirchandaney, and K. Crowley. Run-time parallelization and scheduling of loops. IEEE Trans. Comput.,4(),May [16] Z. Shen, Z. Li, and P. C. Yew, An empirical study on array subscripts and data dependencies, in Proc. of ICPP, pp. II 14 to II 12, [17] SunSoft, Multhreaded Programming Guide, 199. [18] C. Xu and V. Chaudhary, Time-stamping algorithms for parallelization of loops at run-time, Int. Symposium of Parallel Processing, April Also available at [19] C. Xu and L. Lau, Load Balancing in Parallel Computers: Theory and Practice, Kluwer Academic Publishers, ISBN X, Nov [2] C. Zhu and P. C. Yew, A scheme to enforce data dependence on large multi- processor systems. IEEE Trans. Softw. Eng., 1(6):726-79, 1987.

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 5, MAY 2001 433 Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences Cheng-Zhong Xu, Member,

More information

Run-Time Parallelization: It s Time Has Come

Run-Time Parallelization: It s Time Has Come Run-Time Parallelization: It s Time Has Come Lawrence Rauchwerger Department of Computer Science Texas A&M University College Station, TX 77843 Corresponding Author: Lawrence Rauchwerger telephone: (409)

More information

A Scalable Method for Run Time Loop Parallelization

A Scalable Method for Run Time Loop Parallelization A Scalable Method for Run Time Loop Parallelization Lawrence Rauchwerger Nancy M. Amato David A. Padua Center for Supercomputing R&D Department of Computer Science Center for Supercomputing R&D University

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Point-to-Point Synchronisation on Shared Memory Architectures

Point-to-Point Synchronisation on Shared Memory Architectures Point-to-Point Synchronisation on Shared Memory Architectures J. Mark Bull and Carwyn Ball EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland, U.K. email:

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT)

Comparing Gang Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Comparing Scheduling with Dynamic Space Sharing on Symmetric Multiprocessors Using Automatic Self-Allocating Threads (ASAT) Abstract Charles Severance Michigan State University East Lansing, Michigan,

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C.

Worker-Checker A Framework for Run-time. National Tsing Hua University. Hsinchu, Taiwan 300, R.O.C. Worker-Checker A Framework for Run-time Parallelization on Multiprocessors Kuang-Chih Liu Chung-Ta King Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. e-mail:

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

An Inspector-Executor Algorithm for Irregular Assignment Parallelization An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors * Hsin-Ta Chiao and Shyan-Ming Yuan Department of Computer and Information Science National Chiao Tung University

More information

Hardware-Supported Pointer Detection for common Garbage Collections

Hardware-Supported Pointer Detection for common Garbage Collections 2013 First International Symposium on Computing and Networking Hardware-Supported Pointer Detection for common Garbage Collections Kei IDEUE, Yuki SATOMI, Tomoaki TSUMURA and Hiroshi MATSUO Nagoya Institute

More information

Semi supervised clustering for Text Clustering

Semi supervised clustering for Text Clustering Semi supervised clustering for Text Clustering N.Saranya 1 Assistant Professor, Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore 1 ABSTRACT: Based on clustering

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Feedback Guided Scheduling of Nested Loops

Feedback Guided Scheduling of Nested Loops Feedback Guided Scheduling of Nested Loops T. L. Freeman 1, D. J. Hancock 1, J. M. Bull 2, and R. W. Ford 1 1 Centre for Novel Computing, University of Manchester, Manchester, M13 9PL, U.K. 2 Edinburgh

More information

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor

Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Jeffrey Oplinger, David Heine, Shih-Wei Liao, Basem A. Nayfeh, Monica S. Lam and Kunle Olukotun Computer Systems Laboratory

More information

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization

The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization 160 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 10, NO. 2, FEBRUARY 1999 The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Points-to Analysis of RMI-based Java Programs

Points-to Analysis of RMI-based Java Programs Computing For Nation Development, February 25 26, 2010 Bharati Vidyapeeth s Institute of Computer Applications and Management, New Delhi Points-to Analysis of RMI-based Java Programs Yogesh Pingle M.E.(Comp.),

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation

An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation 230 The International Arab Journal of Information Technology, Vol. 6, No. 3, July 2009 An Empirical Performance Study of Connection Oriented Time Warp Parallel Simulation Ali Al-Humaimidi and Hussam Ramadan

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Parallel Discrete Event Simulation

Parallel Discrete Event Simulation Parallel Discrete Event Simulation Dr.N.Sairam & Dr.R.Seethalakshmi School of Computing, SASTRA Univeristy, Thanjavur-613401. Joint Initiative of IITs and IISc Funded by MHRD Page 1 of 8 Contents 1. Parallel

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Lecture 10 Midterm review

Lecture 10 Midterm review Lecture 10 Midterm review Announcements The midterm is on Tue Feb 9 th in class 4Bring photo ID 4You may bring a single sheet of notebook sized paper 8x10 inches with notes on both sides (A4 OK) 4You may

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Hierarchical Pointer Analysis for Distributed Programs

Hierarchical Pointer Analysis for Distributed Programs Hierarchical Pointer Analysis for Distributed Programs Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu April 14, 2006 1 Introduction Many distributed, parallel

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP

Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu

More information

The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization

The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization The LRPD Test: Speculative Run Time Parallelization of Loops with Privatization and Reduction Parallelization Lawrence Rauchwerger and David Padua University of Illinois at Urbana-Champaign Abstract Current

More information

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Data Flow Graph Partitioning Schemes

Data Flow Graph Partitioning Schemes Data Flow Graph Partitioning Schemes Avanti Nadgir and Harshal Haridas Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802 Abstract: The

More information

Optimizing Closures in O(0) time

Optimizing Closures in O(0) time Optimizing Closures in O(0 time Andrew W. Keep Cisco Systems, Inc. Indiana Univeristy akeep@cisco.com Alex Hearn Indiana University adhearn@cs.indiana.edu R. Kent Dybvig Cisco Systems, Inc. Indiana University

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*) David K. Poulsen and Pen-Chung Yew Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Low-Power FIR Digital Filters Using Residue Arithmetic

Low-Power FIR Digital Filters Using Residue Arithmetic Low-Power FIR Digital Filters Using Residue Arithmetic William L. Freking and Keshab K. Parhi Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E. Minneapolis, MN

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Heckaton. SQL Server's Memory Optimized OLTP Engine

Heckaton. SQL Server's Memory Optimized OLTP Engine Heckaton SQL Server's Memory Optimized OLTP Engine Agenda Introduction to Hekaton Design Consideration High Level Architecture Storage and Indexing Query Processing Transaction Management Transaction Durability

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Test Set Compaction Algorithms for Combinational Circuits

Test Set Compaction Algorithms for Combinational Circuits Proceedings of the International Conference on Computer-Aided Design, November 1998 Set Compaction Algorithms for Combinational Circuits Ilker Hamzaoglu and Janak H. Patel Center for Reliable & High-Performance

More information

Summary: Open Questions:

Summary: Open Questions: Summary: The paper proposes an new parallelization technique, which provides dynamic runtime parallelization of loops from binary single-thread programs with minimal architectural change. The realization

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

A Linear-Time Heuristic for Improving Network Partitions

A Linear-Time Heuristic for Improving Network Partitions A Linear-Time Heuristic for Improving Network Partitions ECE 556 Project Report Josh Brauer Introduction The Fiduccia-Matteyses min-cut heuristic provides an efficient solution to the problem of separating

More information

Profiling Dependence Vectors for Loop Parallelization

Profiling Dependence Vectors for Loop Parallelization Profiling Dependence Vectors for Loop Parallelization Shaw-Yen Tseng Chung-Ta King Chuan-Yi Tang Department of Computer Science National Tsing Hua University Hsinchu, Taiwan, R.O.C. fdr788301,king,cytangg@cs.nthu.edu.tw

More information

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT

More information

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching

A Compiler-Directed Cache Coherence Scheme Using Data Prefetching A Compiler-Directed Cache Coherence Scheme Using Data Prefetching Hock-Beng Lim Center for Supercomputing R & D University of Illinois Urbana, IL 61801 hblim@csrd.uiuc.edu Pen-Chung Yew Dept. of Computer

More information

Latency Hiding on COMA Multiprocessors

Latency Hiding on COMA Multiprocessors Latency Hiding on COMA Multiprocessors Tarek S. Abdelrahman Department of Electrical and Computer Engineering The University of Toronto Toronto, Ontario, Canada M5S 1A4 Abstract Cache Only Memory Access

More information

Dynamic Broadcast Scheduling in DDBMS

Dynamic Broadcast Scheduling in DDBMS Dynamic Broadcast Scheduling in DDBMS Babu Santhalingam #1, C.Gunasekar #2, K.Jayakumar #3 #1 Asst. Professor, Computer Science and Applications Department, SCSVMV University, Kanchipuram, India, #2 Research

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Multiprocessor scheduling

Multiprocessor scheduling Chapter 10 Multiprocessor scheduling When a computer system contains multiple processors, a few new issues arise. Multiprocessor systems can be categorized into the following: Loosely coupled or distributed.

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

Reading Assignment. Lazy Evaluation

Reading Assignment. Lazy Evaluation Reading Assignment Lazy Evaluation MULTILISP: a language for concurrent symbolic computation, by Robert H. Halstead (linked from class web page Lazy evaluation is sometimes called call by need. We do an

More information

Parallelizing While Loops for Multiprocessor Systems

Parallelizing While Loops for Multiprocessor Systems CSRD Technical Report 1349 Parallelizing While Loops for Multiprocessor Systems Lawrence Rauchwerger and David Padua Center for Supercomputing Research and Development University of Illinois at Urbana-Champaign

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures

SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures SAS Meets Big Iron: High Performance Computing in SAS Analytic Procedures Robert A. Cohen SAS Institute Inc. Cary, North Carolina, USA Abstract Version 9targets the heavy-duty analytic procedures in SAS

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Speculative Parallelization Technology s only constant is CHANGE. Devarshi Ghoshal Sreesudhan

Speculative Parallelization Technology s only constant is CHANGE. Devarshi Ghoshal Sreesudhan Speculative Parallelization Technology s only constant is CHANGE Devarshi Ghoshal Sreesudhan Agenda Moore s law What is speculation? What is parallelization? Amdahl s law Communication between parallely

More information

Increasing Parallelism of Loops with the Loop Distribution Technique

Increasing Parallelism of Loops with the Loop Distribution Technique Increasing Parallelism of Loops with the Loop Distribution Technique Ku-Nien Chang and Chang-Biau Yang Department of pplied Mathematics National Sun Yat-sen University Kaohsiung, Taiwan 804, ROC cbyang@math.nsysu.edu.tw

More information

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University

Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes. Todd A. Whittaker Ohio State University Parallel Algorithms for the Third Extension of the Sieve of Eratosthenes Todd A. Whittaker Ohio State University whittake@cis.ohio-state.edu Kathy J. Liszka The University of Akron liszka@computer.org

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori

Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes and Tori The Computer Journal, 46(6, c British Computer Society 2003; all rights reserved Speed-up of Parallel Processing of Divisible Loads on k-dimensional Meshes Tori KEQIN LI Department of Computer Science,

More information

SPM Management Using Markov Chain Based Data Access Prediction*

SPM Management Using Markov Chain Based Data Access Prediction* SPM Management Using Markov Chain Based Data Access Prediction* Taylan Yemliha Syracuse University, Syracuse, NY Shekhar Srikantaiah, Mahmut Kandemir Pennsylvania State University, University Park, PA

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

An Efficient Technique of Instruction Scheduling on a Superscalar-Based Multiprocessor

An Efficient Technique of Instruction Scheduling on a Superscalar-Based Multiprocessor An Efficient Technique of Instruction Scheduling on a Superscalar-Based Multiprocessor Rong-Yuh Hwang Department of Electronic Engineering, National Taipei Institute of Technology, Taipei, Taiwan, R. O.

More information