Loop Scheduling and Partitions for Hiding Memory Latencies

Size: px

Start display at page:

Download "Loop Scheduling and Partitions for Hiding Memory Latencies"

Easter Hall
5 years ago
Views:

1 Loop Scheuling an Partitions for Hiing Memory Latencies Fei Chen Ewin Hsing-Mean Sha Dept. of Computer Science an Engineering University of Notre Dame Notre Dame, IN Tel: 29µ Fax: 29µ Abstract Partition Scheuling with Prefetching (PSP) is a memory latency hiing technique which combines the loop pipelining technique with ata prefetching. In PSP, the iteration space is first ivie into regular partitions. Then two parts of the scheule, the ALU part an the memory part, are prouce an balance to prouce an overall scheule with high throughput. These two parts are execute simultaneously, an hence the remote memory latency are overlappe. We stuy the optimal partition shape an size so that a well balance overall scheule can be obtaine. Experiments on DSP benchmarks show that the propose methoology consistently prouces optimal or near optimal solutions. Introuction Because CPU spee has increase ramatically compare with memory spee, the slowness of memory hiners the overall system performance. A well planne ata prefetching scheme may reuce the memory miss penalty by overlapping the processor computations with the memory access operations to achieve high throughput computation. Multiimensional (MD) problems are of particular interests. These problems, for example a large number of DSP applications, are characterize by neste loops with uniform ate epenencies. Loop pipelining techniques are wiely use to expose the instruction level parallelism so that a goo scheule with high throughput can be obtaine. In this paper, we evelop a methoology calle Partition Scheuling with Prefetching (PSP) algorithm which combines the loop pipelining technique with a ata prefetching approach. This technique can be use in computational intensive applications (especially multi-imensional DSP applications) when two level memory hierarchies are existe. These two level memory are abstracte as the local memory an the remote memory. We assume it takes longer time to This work was partially supporte by NSF MIP 95-6, NSF MIP access the remote memory than it oes for the local memory. We also assume a process contains multiple ALUs an multiple memory units. The ALUs are for oing the computations. The memory units are special harwares we introuce for performing operations to prefetch ata from the remote memory to the local memory. The partition (tiling) technique is incorporate into the PSP algorithm. We partition the whole iteration space an execute one partition at a time. The benefit is that ata locality is improve within the partition, an therefore the number of prefetching operations is reuce. We stuy the legal partition shape an provie formulas for etermining an optimize partition size which guarantees a balance overall scheule. Furthermore, we estimate the requirement of local memory size for executing this partition. The estimate gives esigners a goo inication of local memory requirement. Traitional prefetching schemes [2, 5, 6] can be harware or software base. They either only gave the ynamic prefetching ecisions or i not give the complete static scheules. Partitioning the iteration space were not consiere in those approaches either. Several multi-imensional loop pipelining techniques have been propose. For example, Passos an Sha prove that in multi-imensional case (or neste loops), full-parallelism for MDFGs can always be achieve by using multi-imensional retiming [4]. However, none of the above research efforts inclue the prefetching iea in their loop scheuling algorithms. Experiments are one in many DSP benchmarks an the results are compare with other scheuling algorithms, such as list scheuling algorithm an PBS algorithm []. Experiments show that the average length obtaine by PSP is 26 7% of the one using list scheuling an 6 9% of the PBS. Since partitioning is not use in PBS, the result of our experiments also shows that partitioning the iteration space is very important for optimizing the overall scheule. 2 Algorithm framework Moeling the ALU computation

2 A neste loop of computation can be represente by a multi-imensional ata flow graph (MDFG) [4]. An MDFG G V E tµ is a noe- an ege-weighte irecte graph, where V is the set of computation noes, E V V is the set of epenence eges, is a function from E to Z n representing the multi-imensional inter-iteration epenency (elay) between two noes, where n is the number of imensions, an t is a function from V to positive integers representing the computation time of each noe. Partition Partition 2 O O Partition Partition 2 Figure 2: An illegal partition of the iteration space; A legal partition. for ( y = ; y<=m; y++) for ( x= ; x<=n; x++) { v[x][y] = v3[x-][y-] + 5; v2[x][y] = v3[x+][y-] *.2; v3[x][y] = v[x][y]+v2[x][y]*2; } (,) (,) 2 (,) (,) 3 (-,) 2 (,) (,) 3 (-,) Figure : a coe of neste loops; corresponing MDFG; (c) retime MDFG. The execution of all noes in V one time represents an iteration. An iteration is ientifie by a vector ĵ, which is equivalent to a multi-imensional inex starting from µ. Inter-iteration epenencies are represente by vector-weighte eges. For any iteration ĵ, an ege e from u to v (u e v) with elay vector eµ means that the computation of noe v at iteration ĵ epens on the execution of noe u at iteration ĵ eµ. An ege with elay vector µ represents a ata epenency within the same iteration. Figure is an example coe of a neste loop, with the corresponing MDFG in Figure. We call a legal MDFG G V E tµ is realizable, or implementable, if there exists a scheule vector s for the MDFG, such that s eµ for any e ¾ E [4]. This scheule vector s is regare as the normal vector for a set of parallel equitemporal hyperplanes, of which the iterations in the same hyperplane will be execute in sequence. For example, in Figure, s µ. Multi-imensional retiming technique [4] is use in our algorithm. In our stuy, the legal retiming vector r is chosen as the base vector orthogonal to s. Using Figure as an example, since s µ the base retiming vector r can be µ. The graph after MD retiming is shown in Figure (c). Partitioning the iteration space Instea of executing the whole iteration space in orer by rows or columns, we first partition it an then execute the partitions one by one. The two bounaries of a partition are calle the partition vectors. We will enote them by P x an P y Due to the epenencies in the MDFG, partition vectors (c) cannot be arbitrarily chosen. For example, note the iteration space in Figure 2, where ots represent iterations an vectors represent the inter-iteration epenencies. If we partition the iteration space to rectangular shape, as shown in Figure 2, this partition metho is illegal, because of the forwar epenencies from È ÖØ Ø ÓÒ to È ÖØ Ø ÓÒ ½ (the thin vectors) an the backwar epenencies from È ÖØ Ø ÓÒ ½ to È ÖØ Ø ÓÒ (the bol vectors). Due to these two-way epenencies between partitions, we cannot execute either one first. This partition is therefore not implementable an is illegal. In contrast, consier the alternative partition metho shown in Figure 2. Since there are no two-way epenencies, a feasible partition execution sequence exists. For example, execute È ÖØ Ø ÓÒ ½ first, then È ÖØ Ø ÓÒ ¾, an so on. Therefore, it is a legal partition. Architecture moel We assume a processor contains multiple ALUs an multiple special harwares calle memory units. Associate with the processor is a small local memory. Accessing the local memory is fast. There is also a large remote memory. However, accessing it is slow. The goal of our technique is to prefetch the operans of all computations into the local memory before the actual computations are taken place. These prefetching operations are performe by memory units. Two types of prefetching instructions, prefetch an keep, are supporte by memory units. The prefetch instruction prefetches the ata from the remote memory to the local memory; the keep instruction keeps the ata in the local memory for the execution of the next partition. Both of them are issue to make sure those ata about to be reference will be appeare in the local memory. PSP algorithm framework The scheule generate by PSP consists of two parts: the ALU part an the memory parts. In the ALU part of the scheule, we first use the multi-imensional rotation scheuling algorithm [3] to create the ALU scheule for one iteration. We then uplicate this one iteration ALU scheule an appen the copies consecutively to form the n iteration ALU scheule (where n is the number of iterations in the partition). The memory part of the scheule will be execute at the same time as the ALU part. It gives the global scheule for memory operations to prefetch all the operans neee by 2

3 iteration 4 iteration ALU CS : CS 2: CS 3: CS 4: CS 5: CS 6: CS 7: CS 8: CS 9: CS : CS : CS 2: CS 3: CS 4: CS 5: CS 6: CS 7: ALU FUs ALU4 Memory Units Prefetch Part Keep 2 Part mem unit mem unit 5 Figure 3: The overall scheule of a partition. Assuming there are four ALU functional units an five memory units. the next partition into the local memory. We call the partition which is being execute the current partition, an call the one that will be execute next the next partition. For all other partitions which have not been execute except the next one, we call them other partitions (see Figure 4(c)). In the memory part scheuling, if a non-zero elay ege passes from the current partition into other partitions, a prefetch operation is neee. Each irecte ege from the current partition to the next partition correspons to a keep operation. The framework of our algorithm is illustrate in Algorithm PSP. Figure 3 gives an example of our overall scheule the ALU part as well as the memory part. There are four iterations in one partition. In the ALU part, each iteration takes 4 control steps (CS) to finish, an hence all four iterations take len ALUµ 6 control steps. In the memory part, all prefetch operations are scheule from the top, an then the keep operations. The length of the memory part of the scheule is len memµ 7 Since these parts are execute simultaneously, the overall scheule length is the maximum of them, which is len overallµ max len ALUµ len memµ 7 If we ivie the overall scheule length by the number of iterations in the partition, we get the average scheule length ave len overallµ PSP scheuling We use multi-imensional rotation scheuling algorithm to scheule the ALU part for each iteration. Multi-imensional rotation scheuling is a loop pipelining technique which implicitly uses the multi-imensional retiming heuristic for scheuling cyclic graphs. The rotation scheuling is escribe in etail in [3]. Given an initial scheule, the rotation technique repeately transforms the ALU part of the scheule to a more compact one uner the resource constraint. Consier an example of MD rotation scheuling performe on the Algorithm PSP Partition Scheuling with Prefetching Input: Initial MDFG; ALU an memory constraint; execution time for ALU an memory operations. Output: An optimal partition an the optimize ALU an memory scheules for executing the partition. : /* using list scheuling to obtain ALU scheule */ 2: S initial ALU part scheule for one iteration 3: repeat 4: /* reucing the length of one iteration ALU part of the scheule by using MD rotation scheuling */ 5: S rotate the current scheule S 6: G r retime MDFG 7: /* ecie the optimal partition shape an size */ 8: Obtain the legal partition irections P x P y accoring to G r 9: Obtain the partition size so that the balancing property (Theorem 3) is satisfie : /* prouce the overall scheule */ : Number the iterations 2: Entire ALU part scheuling 3: Memory part scheuling 4: /* evaluation */ 5: Calculate the average length of the overall scheule 6: Calculate the local memory requirement 7: until the average scheule length cannot be reuce (-,) (,) (,) (-2,) (,) other partitions current partition (c) next partition Figure 4: The original MDFG of Wave Digital Filter; The retime MDFG after rotating noe ; (c) Soli eges represent prefetch operations; ashe eges represent keep operations; ot eges are the ata epenencies insie the partition, hence no memory operation is neee. Wave Digital Filter shown in Table, with the corresponing MDFG in Figure 4. Notation n iµ in the table conveys the computation noe n in the original i-th iteration in the partition. Accoring to the ata epenencies in the original MDFG shown in Figure 4, we have an initial scheule with length three, shown in the left part of Table. During the rotation, computation noe in control step (CS) is rotate, an the corresponing noe in the MDFG is retime by the base retiming vector r µ The scheule length is reuce to 2 after the rotation. In PSP scheuling algorithm, the ALU part then applies the same scheule pattern for each iteration in the partition. Iterations are execute one after the other in the ALU part of the scheule. Scheuling of the memory part consists of several steps. First, given the retime MDFG as a result from the MD rotation, we nee to ecie the irections of the legal partition vectors. Secon, the iterations in the partition shoul be numbere so that they can be scheule in that orer. Thir an 3

4 Ë Initial scheule Scheule after rotation ALU ALU2 ALU ALU2 ½ ¾ µ µ µ µ 2 µ 3 µ 2 µ 3 µ Iteration execution sequence in the partition. (s) hyperplane 3 hyperplane 2 y x Table : The ALU part of the scheule. hyperplane (r) CCW region P 4 (CCW) 2 5 Figure 6: Iterations will be execute from left to right in the P x irection an then precee to the next hyperplane along the irection perpenicular to P x ; Iteration orers in the partition. CW region 3(CW) Figure 5: The CW an CCW regions relative to vector p; The extreme CW an CCW vectors of vectors 2 an k an the partition vector P x an P y the most important step, calculate the optimal partition size to ensure a balance scheule. Fourth an the last, actually create the memory part of the scheule. We will explain these steps below in great etail. Among all the elay vectors in an MDFG, two extreme vectors, clockwise (CW) an counterclockwise (CCW), are the most important for eciing the irections of the legal partition vectors. They are given by the following efinition. Definition The extreme (outermost) clockwise vector CW of a vector set D 2 k satisfies these two conitions: () CW ¾ D; (2) all the vectors in D CW are in counterclockwise region of CW. The efinition of CCW vector is similar. Figure 5 illustrates the clockwise an counterclockwise regions relative to a vector p The magnitue of the cross prouct of two vectors p an p 2, enote by p Å p 2, is use to etermine the relative position of p an p 2 If p p x p yµ an p 2 p 2 x p 2 yµ then p Å p 2 p x p 2 y p 2 x p y If p Å p 2 is positive, then p is clockwise from p 2 with respect to the origin µ; if this cross prouct is negative, then p is counterclockwise from p 2 Legal partition vectors can only be outsie of CW an CCW or aligne with them. For example in Figure 5, we choose P y to be aligne with CCW, an P x to the aligne with x-axis, which is outsie of CW. This is a legal choice of partition vectors. In PSP algorithm, we assume the y elements of the elay vectors of the input MDFG are always which is often the case in real applications with neste loops. Therefore, vector s µ is always the legal scheuling vector. After choosing the base retiming vector r as µ the positive x-axis is always a legal irection for the partition vector. In our algorithm, the irection of the counterclockwise partition vector, P y is chosen to be aligne with the vector CCW; while the irection of the clockwise extreme partition vector, P x, is aligne with the positive x-axis. For convenience, we use P x an P y to enote the base partition vectors showing these two irections (The elements in the base partition vector have no common ivisors). The actual partition vectors are then enote by P x f x P x an P y f y P y where f x an f y are calle partition factors, which is relate to the size of the partition. The next step is to number the iterations within the partition so that they can be scheule in that orer. The iterations are numbere from left to right in the P x irection, as illustrate in Figure 6, an then to the next hyperplane along with the irection of the vector perpenicular to P x Figure 6 shows an example of the iteration orer. The black ots represent the iterations in the partition, while the numbers give the orer. The numbering can be easily one by sorting the iteration inices of ifferent iterations whoever has the smaller y element or has the same y element but the smaller x element will get the smaller number. We will iscuss how to obtain the optimal partition size in Section 4. Ë ALU part memory part ALU ALU2 MEM MEM2 ½ µ µ P2 2µ 2 µ P2 3µ 2 µ ¾ 2 µ 3 µ µ 2µ P3 2µ µ P3 3µ µ 2 µ 3 µ 2µ 3µ K 2µ µ K3 µ µ 2 2µ 3 2µ K3 µ µ 3µ 4µ 2 3µ 3 3µ K 4µ µ Table 2: The overall scheule with respect to the MDFG of Wave Digital Filter in Figure 4. After obtaining the partition irections an size, we can start to scheule the memory part. Prefetch operations are scheule as early as possible, because they o not have any ata epenencies. Keep operations have the ata epenencies from the ALU part. Therefore, a keep operation must be scheule after the corresponing computation, whichever provies the result of that ata instance, is finishe. For each keep, we efine the earliest starting time ( Ë) as the con- 4

5 O #iter= =fxfy( ) =fxfy*base_area Partition Partition Partition I II III base_area M =fy N β α O 2 =fx (c) Figure 7: f y restriction; f x restriction; (c) The number of iterations (# Ø Ö) insie one partition. trol step when the corresponing value is finishe computing. Then, starting from Ë we scheule the keep operation at the earliest available place in the memory part. Table 2 is the overall scheule of Wave Digital Filter shown in Figure 4. Here we assume two ALUs an two memory units. The ALU part of the scheule is a uplication of the four iterations of the scheule shown in Table. In the memory part, the notation Pn iµ x yµ conveys prefetch the ata instance which correspons to the elay vector x yµ from noe n in the i-th iteration. For example, P2 3µ 2 µ means prefetch the ata corresponing to the elay vector 2 µ from noe 2 3µ. We assume in Table 2 each prefetch operation takes two time units, that is, T pre 2 The own arrows ( ) in the table represent the continuation of the prefetch operation. Similarly, Kn iµ x yµ enotes the keep operation. We assume each keep operation takes one time unit, i.e., T keep In this example, the length of the overall scheule L is 8 Since there are 4 iterations in the partition, the average length of the overall length, enote by L ave is L 4 2 which is equal to the lower boun. 4 Partition size an memory size In the previous section, we have ecie the partition irections, enote by P x an P y Here we will etermine the two partition factors f x an f y so that a balance scheule can be achieve. First, we impose the restriction to f y that it shoul be large enough so that no elays can pass through the entire partition along the irection of P y For example, the partition vector P y in Figure 7 is not large enough, because the elay vector crosses both the bottom an the top bounaries of the partition. Denoting the set of all the non-zero elay vectors in the MDFG as D the above restriction can be represente by inequality: f y P y y y x yµ ¾ D Partition vector P x is restricte so that no elays starting from the current partition can reach two partitions later. In other wors, in Figure 7, elay eges starting from Partition I cannot reach Partition III. Therefore, we have NM P x sinα sin α βµ f sinα x P x sin α βµ This gives us: δ δ U V U V h=.y-y β α W l W P Q R S (c) =(x,y) =(x,y) area(uvw) =l*h=l*(.y-y) area(pqrs)= *y =fx* *y P Q y S R () Figure 8: Calculating the number of the elay eges crossing the bounary of the current partition an entering the next partition; (c)() Calculating the number of the elay eges crossing the bounary of the current partition an entering other partitions. f x sin α βµ P x sinα x yµ ¾ D Since f x is an integer, sin α βµ P x sinα this inequality is equivalent to f x x yµ ¾ D Below we erive the conitions for a balance scheule. Lemma shows how to calculate the length of the ALU part of the scheule, referring to Figure 7(c). Lemma The length of the ALU part of the scheule is L ALU #iter L ALU f x f y P x Å P y µ where L ALU enotes the length of the one-iteration ALU part of the scheule, an #iter is the number of iterations in the partition. Then we estimate how many memory operations are neee by calculating the areas of two shae regions in Figure 8. Given a elay vector x yµ region UVW in the current partition, shown in Figure 8, is the region where will enter the next partition. Similarly, region PQRS is where will enter other partitions. We enote the areas of the above two regions as A goto next µ an A goto others µ respectively, with respect to a given elay vector x yµ Lemma 2 Given a elay vector x yµ A goto next µ f y P y y yµ an A goto others µ f x y P x sin α βµ sinα Note that the number of elay eges entering the next partition, i.e. keep operations, is very close to the area of UVW Summing up all these areas for every istinct we get the total number of keep operations, # Ô A goto next µ sin α βµ f y P y y yµ sinα for all x yµ Similarly, the total number of prefetch operations is #ÔÖ Ø A goto others µ area PQRSµ P x y f x y P x Theorem 3 gives the conitions of what we call as a balance scheule. The iea here is to scheule prefetch 5

6 =(,) =(,) Figure 9: One memory location is neee for elay µ; P x y memory locations are neee for elay x y µ when y operations from the top of the memory part of the scheule, an scheule the keep from the bottom. The left-han sie of Inequality is the estimate length of the memory part scheule, an we only allow it to be at most T keep control steps longer than the ALU part, as shown in the righthan sie. The reason of leaving out T keep steps is to make rooms for those potential keep operations corresponing to the computational noes at the last control step in the ALU part. Corollary 4 concerns about the average overall scheule length. Theorem 3 Assume that N ALU N mem T ALU T keep an Inequality is satisfie. #pre T pre N mem #keep T keep L ALU # Ø Ö T keep N mem The length of the memory part of the scheule is at most T keep control steps longer than that of the ALU part. Corollary 4 If the partition satisfies the conitions presente in Theorem 3, the average length of the overall scheule is at most T keep #iter plus the average length of the ALU part of the scheule. Experiments show that rotation scheuling in most cases can generates the ALU part of the scheule which achieves the lower boun, i.e., L ALU boun ALUµ Therefore, the overall scheule either reaches its lower boun or is very T close to it; the ifference is at most keep P x ÅP y Now we estimate the local memory size for executing the partition. We classify the memory usage into two categories: basic memory for the working set an reserve memory for prefetch an keep operations. The former correspons to all the internal elay eges in the partition. The elay ege µ in Figure 9 inicates a ata instance prouce in Iteration I an consume in the next Iteration I Only one memory location is neee to hol this ata because we can reuse the same location for later iterations. In general, we nee x memory locations for each x µ However, when x µ as shown in Figure 9, a whole row of intermeiate values nee to be () kept. Thus a total of P x memory locations are neee. In general, for each x y µ where y P x y memory locations are neee. Summarizing the above, the size of the basic memory for the working set is equal to Size ws x y µ x when y P x y when y Now let us consier the secon category: reserve memory for prefetch an keep operations. These operations represent the ata instances pre-loae or pre-occupie in the local memory before we execute this partition. Each one of them nees a reserve memory location. The total number of these pre-occupie ata is two times the total number of memory operations (one for the pre-loae ata for the current partition; the other for the new generate ata for the next partition). Therefore, the size of this part of the memory is Size reserve 2 #pre #keepµ Finally, the local memory neee to execute this partition is Local size Size ws Size reserve 5 Experimental Results In this section, the effectiveness of the PSP algorithm is evaluate by running a set of simulations on DSP benchmarks. Table 3 an Table 4 show our scheuling results. The first column presents the benchmarks names. The secon to fourth columns are the parameters of the input MDFG, with the secon column showing the number of noes an the thir an fourth columns showing the ALU an memory unit resource constraints. The partition generate by the algorithm is shown in the fifth to seventh columns. The final scheule is shown in the next three columns. Column L gives the length of the overall scheule an Column L ave is the average ( #iter L ). In orer to compare our results with the lower boun, as well as the results from other algorithms, we calculate the lower bouns of the scheule length, N an put them in Column LB. We also ran the same set of benchmarks using list scheuling an Prefetch Balance rotation Scheuling (PBS). The results are shown in Columns List an PBS, respectively, where the sub-column len is the scheule length an the sub-column ratio is the ratio comparing the PSP scheule length with that of list scheuling an PBS scheuling, i.e. ratio L ave len N alu The abbreviations for our benchmarks WDF, IIR, DPCM, 2D an Floy stan for Wave Digital filter, Infinite Impulse Response filter, Differential Pulse-Coe Moulation evice, Two Dimensional filter, an Floy-Steinberg algorithm, respectively. In Table 3, we assume that each ALU operation takes time unit, each keep operation also takes time unit, an each prefetch takes 2 time units, while in Table 4, we assume each prefetch takes time units. In the PBS experiments in Table 4, the graphs are first 6

7 Benchmark Parameters Partition PSP Scheule List PBS N N alu N mem P x P y #iter L L ave LB len ratio len ratio WDF() (3,) (-4,2) % % WDF(2) 3 (4,) (-3,) % 4 6.3% IIR (6,) (-4,2) % 6.3% DPCM (6,) (-4,2) % % 2D() (3,) (,) % 2 % 2D(2) (2,) (-4,2) % 3 75% MDFG (4,) (-3,) % 4 6.3% MDFG (4,) (-6,6) % 8 5.5% Floy (4,) (-6,2) % 6 % Table 3: Experimental results on DSP filter benchmarks assuming T pre f etch 2 Benchmark Parameters Partition PSP Scheule List PBS unfol by 2 2 N N alu N mem P x P y #iter L L ave LB len ratio len ratio WDF() (3,) (-4,7) % % WDF(2) 3 (4,) (-2,4) % 5 8.2% IIR (6,) (-4,7) % 6.3 % DPCM (6,) (-4,7) % % 2D() (3,) (,4) % 2 6% 2D(2) (2,) (-6,8) % 5 4.2% MDFG (4,) (-2,4) % % MDFG (4,) (-35,35) % % Floy (4,) (-2,4) % 6% Table 4: Experimental results on DSP filter benchmarks assuming T pre f etch unfole by a factor of 2 2 before performing PBS scheuling. As we can see, list scheuling rarely achieves the optimal scheule length; the scheules are often ominate by a long memory part. In orer wors, the list scheules are not well balance. Although PBS is better than list scheuling, it too becomes less effective to generate a balance scheule especially when T pre f etch is large. Moreover, PBS nees to explicitly unfol by large factors in orer to generate goo scheules. This may cause a lot of computations (For example, after unfole by a factor of 2 2 the total number of noes is 4 times that of the original). The PSP algorithm consistently prouces optimal or near optimal scheules, as shown by the bol figures in the tables. Even in case of long memory latency, when T pre f etch is large, the algorithm still gives goo overall scheules without oing any unfoling. Almost all of the resulting scheules are very close to the optimal. In Table 3, the average ratio of the scheule length from the PSP algorithm to that from list scheuling an PBS are 63 2% an 84 9%, respectively; an in Table 4, 26 7% an 6 9% respectively. Moreover, since we o not unfol the graph, the computation time of this algorithm is very little. Almost all the experiments are finishe in less than two to three secons. Comparing Tables 3 an 4, we also see that when the memory latency is increase, the PSP algorithm tens to create a larger partition in orer to compensate for this long latency. It shows that the larger the partition, the closer the average scheule length is to the lower boun, because the overhea (T keep ) control steps are amortize over more iterations. References [] F. Chen, S. Tongsima, an E. H.-M. Sha. Loop scheuling optimization with ata prefetching base on multiimensional retiming. In Proc. ISCA th International Conference on Parallel an Distribute Computing Systems, pages 29 34, 998. [2] F. Dahlgren an M. Dubois. Sequential harware prefetching in share-memory multiprocessors. IEEE Transactions on Parallel an Distribute Systems, Vol. 6, No. 7, pages , Jul [3] N. L. Passos an Ewin H.-M. Sha. Scheuling of uniform multi-imensional systems uner resource constraints. To appear in the IEEE Transactions on VLSI systems. [4] N. L. Passos an Ewin H.-M. Sha. Achieving full parallelism using multi-imensional retiming. IEEE Transactions on Parallel an Distribute Systems, Vol. 7, No.,, pages 5 63, Nov [5] J. Skeppstet an M. Dubois. Hybri compiler/harware prefetching for multiprocessors using low-overhea cache miss traps. In the Proceeings of the International Conference on Parallel Processing, pages , 997. [6] M. K. Tcheun, H. oon, an S. R. Maeng. An aaptive sequential prefetching scheme in share-memory multiprocessors. In the Proceeings of the International Conference on Parallel Processing, pages 36 33,

Computer Organization

Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.