An Efficient Technique of Instruction Scheduling on a Superscalar-Based Multiprocessor

Similar documents
Compiling for Advanced Architectures

Increasing Parallelism of Loops with the Loop Distribution Technique

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

DATA PREFETCHING AND DATA FORWARDING IN SHARED MEMORY MULTIPROCESSORS (*)

Enhancing Parallelism

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures

i=1 i=2 i=3 i=4 i=5 x(4) x(6) x(8)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Pipeline Parallelism and the OpenMP Doacross Construct. COMP515 - guest lecture October 27th, 2015 Jun Shirako

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Automatic Translation of Fortran Programs to Vector Form. Randy Allen and Ken Kennedy

On the Automatic Parallelization of the Perfect Benchmarks

Profiling Dependence Vectors for Loop Parallelization

Floating Point/Multicycle Pipelining in DLX

Program Behavior Characterization through Advanced Kernel Recognition

EE/CSCI 451 Spring 2018 Homework 8 Total Points: [10 points] Explain the following terms: EREW PRAM CRCW PRAM. Brent s Theorem.

Exploring Parallelism At Different Levels

A Scalable Method for Run Time Loop Parallelization

Identifying Parallelism in Construction Operations of Cyclic Pointer-Linked Data Structures 1

Splitter Placement in All-Optical WDM Networks

Resource and Service Trading in a Heterogeneous Large Distributed

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Topics on Compilers Spring Semester Christine Wagner 2011/04/13

University of Southern California Department of Electrical Engineering EE557 Fall 2001 Instructor: Michel Dubois Homework #3.

Pipelining and Vector Processing


Improving Software Pipelining with Hardware Support for Self-Spatial Loads

The Evaluation of Parallel Compilers and Trapezoidal Self- Scheduling

Coarse-Grained Parallelism

Exact Side Eects for Interprocedural Dependence Analysis Peiyi Tang Department of Computer Science The Australian National University Canberra ACT 26

Parallelizing While Loops for Multiprocessor Systems

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

A New Approach to Determining the Time-Stamping Counter's Overhead on the Pentium Pro Processors *

END-TERM EXAMINATION

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

Heuristic scheduling algorithms to access the critical section in Shared Memory Environment

Fortran 2008: what s in it for high-performance computing

Automatic Selection of GCC Optimization Options Using A Gene Weighted Genetic Algorithm

We will focus on data dependencies: when an operand is written at some point and read at a later point. Example:!

Bottleneck Steiner Tree with Bounded Number of Steiner Vertices

Parallelizing While Loops for Multiprocessor Systems y

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Performance Analysis of Parallelizing Compilers on the Perfect. Benchmarks TM Programs. Center for Supercomputing Research and Development.

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops

41st Cray User Group Conference Minneapolis, Minnesota

Hardware-based Speculation

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

High-Level Synthesis (HLS)

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

Symbolic Evaluation of Sums for Parallelising Compilers

Optimal Parallel Randomized Renaming

PARALLEL COMPUTATION OF THE SINGULAR VALUE DECOMPOSITION ON TREE ARCHITECTURES

One-Level Cache Memory Design for Scalable SMT Architectures

register allocation saves energy register allocation reduces memory accesses.

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

HIGH-LEVEL SYNTHESIS

Worst-case running time for RANDOMIZED-SELECT

Instruction Scheduling

The 3-Steiner Root Problem

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers

by Pearson Education, Inc. All Rights Reserved.

A Multiprocessor Architecture Combining Fine-Grained and Coarse-Grained Parallelism Strategies

A Controller Testability Analysis and Enhancement Technique

計算機結構 Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Predicated Software Pipelining Technique for Loops with Conditions

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Data Structure. IBPS SO (IT- Officer) Exam 2017

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

THE TRANSITIVE REDUCTION OF A DIRECTED GRAPH*

Contention-Aware Scheduling with Task Duplication

Software Optimization: Fixing Memory Problems

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Parallelisation. Michael O Boyle. March 2014

A Feasibility Study for Methods of Effective Memoization Optimization

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Run-Time Parallelization: It s Time Has Come

CS521 \ Notes for the Final Exam

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Cache-Oblivious Traversals of an Array s Pairs

A Modified Maximum Urgency First Scheduling Algorithm for Real-Time Tasks

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

EECC551 Exam Review 4 questions out of 6 questions

Lecture 12: Instruction Execution and Pipelining. William Gropp

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Unit 2: High-Level Synthesis

Assignment statement optimizations

Limits of Data-Level Parallelism

9/24/ Hash functions

A Massively Parallel Line Simplification Algorithm Implemented Using Chapel

Value Range Analysis of Conditionally Updated Variables and Pointers

TECH. CH14 Instruction Level Parallelism and Superscalar Processors. What is Superscalar? Why Superscalar? General Superscalar Organization

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

ARC Sort: Enhanced and Time Efficient Sorting Algorithm

Transcription:

An Efficient Technique of Instruction Scheduling on a Superscalar-Based Multiprocessor Rong-Yuh Hwang Department of Electronic Engineering, National Taipei Institute of Technology, Taipei, Taiwan, R. O. C. E-mail: ryh@en.tit.edu.tw Abstract An instruction scheduling approach is proposed for performance enhancement on a superscalar-based multiprocessor. The traditional list scheduling approach is not suitable for the environment because it does not consider the effect of synchronization operation. According to the LBD loop theorem, the system performance is very concerned with the position of synchronization operation. Therefore, the scheduling of synchronization operation has the highest priority in this technique. There are two aspects of performance enhancement for the instruction scheduling approach: 1) converting LBD into LFD, and 2) reducing the damage of LBD. Experimental results show that the enhancement is significant. 1. Introduction Parallelism has been explored at various processing. The lower the level, the finer the granularity of the software processes. In general, the execution of a program may use several levels. During these levels, loop level and instruction level parallelism are two important issues in the study of parallel compiler and architecture. Do loop is an important source of loop level parallelism. However, some statistical results [2, 21] display that most of the loops belong to Doacross loop; Doall loop, in fact, is an ideal loop. Because Doacross loops exist one or more loop-carried dependences in an iteration, so synchronization operations are inserted into the original program to execute Doacross loop in parallel. There are two actions for synchronization operations. 1) generate a send statement immediately following dependence source S in the program text: Send_Signal(S), and 2) generate a wait This work is supported in part by the National Science Council under grant No. NSC 85-2213-E-027-004. statement immediately before dependence sink S': Wait_Signal(S, i d), where d is dependence distance and i d is iteration number [6, 11]. An example of do loop is shown in Fig. 1 (a). There are two dependences in this figure. Array element A[I], dependence source, in statement S 3 depends on array elements A[I 2] and A[I 1], dependence sinks, in statements S 1 and S 2. In order to execute this loop in parallel, two pairs of synchronization operations are inserted into this loop. The loop after synchronization operation insertion is shown in Fig. 1 (b). Several synchronization schemes had been proposed to maintain correct order dependence during executing Doacross loop in parallel [1, 5, 10, 20]. On the other hand, several hardware and software techniques were proposed to exploit instruction level parallelism [7, 12, 13]. Bulter et al. have shown that single-instructionstream parallelism is grater than two [9]. Wall finds that the average parallelism at instruction level is around five, rarely exceeding seven, in an ordinary program [3]. Superscalar processors are designed to exploit more instruction level parallelism in user program. Only independent instructions can be executed in parallel without causing a wait state. In the past, people studies either on loop level or on instruction level independently. However, some new problems need to be resolved if both instruction level and loop level are exploited in the superscalar-based multiprocessor. Loop level parallelism is exploited by executing one iteration for each superscalar processor. In this environment, if we want to execute instruction scheduling, then some problems will be occurred. After instruction scheduling, the instruction sequence is not the same as the original one. Therefore, dependence sink may be scheduled before its corresponding Wait_Signal. This action will have a chance to access stale data [18]. Similarly, the effect is the same as above if a dependence source is behind its corresponding Send_Signal. On the other hand, according

to LBD loop theorem [17], the parallel execution time is very concerned with the number of instruction between a Send_Signal and its corresponding Wait_Signal. The major goal of traditional instruction scheduling is to find more instruction parallelism. However, these instruction scheduling approaches are not suitable for the superscalarbased environment. In this paper, we will propose a new instruction scheduling technique to schedule instruction without accessing stale data and to enhance system performance. The rest of this paper is organized as follows: The preliminary knowledge about this problem is described in Section 2. New instruction scheduling technique is proposed in Section 3. Experimental results are made in Section 4. Finally, a conclusion is made in Section 5. DOACROSS I = 1, N Wait_Signal(S 3, I 2); DO I = 1, N S 1 : B[I]=A[I 2]+E[I+1]; S 1 : B[I]=A[I 2]+E[I+1]; Wait_Signal(S 3, I 1); S 2 : G[I 3]=A[I 1] E[I+2]; S 2 : G[I 3]=A[I 1] E[I+2]; S 3 : A[I]=B[I]+C[I+3]; S 3 : A[I]=B[I]+C[I+3]; ENDDO Send_Signal(S 3 ); END_DOACROSS (a) A Do loop (b) Synchronization operation insertion for (a) Fig. 1 An example of synchronization operation insertion 2. Preliminary knowledge In this section, we described the related problem and notations used in this paper. Firstly, some notations are defined [18]. Src : Dependence source, Snk : Dependence sink, Sig : A synchronization instruction "Send_Signal", and Wat: A synchronization instruction "Wait_Signal". Let Si and Sj be arbitrary statements. Definition: Si bef Sj iff Si occurs textually before Sj. Definition: Si δ Sj means that statement Sj is dependent on statement Si. Definition: In a loop, a dependence Si δ Sj is said to be forward iff Si bef Sj. Any dependence that is not forward is a backward dependence. Data dependences include flow dependence, antidependence and output dependence [18]. Alternatively, data dependences are separated into two types: forward and backward dependences. The LFD(Lexically Forward Dependence) and LBD(Lexically Backward Dependence) are used as representing the forward and backward dependences respectively. In order to maintain the correct order dependence between Sig (Wat) and its corresponding Src (Snk), the synchronization conditions are described as follows [18]. (1). A Sig can not precede the corresponding Src. (2). A Wat can not be behind the corresponding Snk. A synchronization marker method had been proposed to maintain the synchronization conditions [18]. For a LFD, the Send_Signal is issued before its corresponding Wait_Signal, so the parallel execution time is equal to the time of executing one iteration. However, for a LBD, the Send_Signal is issued after its corresponding Wait_Signal, so the parallel execution time is equal to n d * (i j)+l, where i, j, d, and l represent the position of Send_Signal, Wait_Signal, dependence distance, and number of instruction in an iteration respectively [15, 17, 19]. The parallel execution time of the LFD loop is always smaller than the parallel execution time of the LBD loop. Therefore, performance enhancement is achieved if LBD can be converted into LFD. On the other hand, performance enhancement is also achieved if the position of Send_Signal and its corresponding Wait_Signal are as close as possible. In the next section, the related knowledge will be used to propose a new instruction scheduling technique. 3. New instruction scheduling technique 3.1. Construction of Sig, Wat, and Sigwat graph According to the discussion in the last section, the instruction sequence is not the same as the original one after instruction scheduling. However, if Send_Signal(S) is scheduled before its corresponding dependence source, we might access stale data. Similarly, Wait_Signal can not be behind its dependence. To follow the synchronization conditions, an extra dependence arc is inserted between a synchronization operation and its corresponding dependence event. The three address codes of Fig. 1(b) are as shown in Fig. 2. Observing Fig. 2, the corresponding three address codes of array elements A[I], A[I 1] and A[I 2] are instructions 26, 16, 5 respectively. To avoid accessing stale data, several extra flow dependences need to be created for instructions 11 and 16, 1 and 5, and 26 and 27. Fig. 3 is the data flow graph of Fig. 2. The extra dependence arc insertion is shown in Fig. 3. In this figure, up-triangle is used to represent Send_Signal. Similarly, down-triangle is used to represent Wait_Signal. The number attached in these triangles are used to represent whether the synchronization operations belong to the same pair of synchronization operations. For example, array element A[I] exists two pairs of synchronization operations with array elements A[I 1] and A[I 2]. Therefore, all triangles in Fig. 3 have the same number i. Next, we make some definitions of data flow graph with synchronization operations.

Definition: A Sig (Wat) graph is a subset of a data flow graph. All of its nodes are contiguous and contain one or more Send_Signal (Wait_Signal) instructions. Definition: A Sigwat graph is a subset of a data flow graph. All of its nodes are contiguous and contain one or more Send_Signal and Wait_Signal instructions. Therefore, a Sig (Wat) graph represents the statements where each one contains at least one dependence source (sink) and its corresponding Send_Signal (Wait_Signal). Sigwat graph is formed by several cases. 1) If Src and Snk are in the same statement, then this statement is formed a Sigwat graph. 2) Two statements are in the same Sigwat graph if they exist inside loop dependence or use same array index expression, and some synchronization operations consisting of Sig and Wat are in the two statements. For example, statements S 3 and S 1 in Fig. 1 exist an inside loop dependence, B[I]. A[I] in S 3 is Src, and A[I 2] in S 1 is Snk. Therefore, statements S 3 and S 1 will be formed a Sigwat graph shown in Fig. 3. Similarly, statement S 2 and Wait_Signal(S 3, I 1) are formed a Wat graph shown in Fig. 3. 1: Wait_Signal(S 3, I 2); 15: t12 4 t11; 2: t1 4 I; 16: t13 A[t12]; 3: t2 I 2; 17: t14 I+2; 4: t3 4 t2; 18: t15 4 t14; 5: t4 A[t3]; 19: t16 E[t15]; 6: t5 I+1; 20: t17 t13 t16; 7: t6 4 t5; 21: G[t9] t17; 8: t7 E[t6]; 22: t18 B[t1]; 9: t8 t4+t7; 23: t19 I+3; 10: B[t1] t8; 24: t20 4 t19; 11: Wait_Signal(S 3, I 1); 25: t21 C[t20]; 12: t9 I 3; 26: A[t1] t18+t21; 13: t10 4 t9; 27: Send_Signal(S); 14: t11 I 1; Fig. 2 The three address codes in Fig. 1(b) 3.2. New instruction scheduling technique According to the definitions above, the corresponding Wat for a Sig graph is in Sigwat or Wat graph. Similarly, the corresponding Sig for a Wat graph is also in a Sigwat or Sig graph. The Sig, Wat, and Sigwat graph do not depend on each other, so the dependence relation in Sig or Wat graphs can be converted into LFD by scheduling Sig (Wat) graphs before (after) all Sigwat graphs. This means that all dependence relations whose synchronization operations are in Sig or Wat graphs can be converted into LFD. A path which starts from a Wat node to its corresponding Sig node in a Sigwat graph is called a synchronization path SP(Wat, Sig). The LBD loop can not be converted into LFD loop if there exists a synchronization path in a Sigwat graph. For such a case, we need to reduce the damage of LBD loop. The synchronization path is the shortest distance from a Wait_Signal to its corresponding Send_Signal. Therefore, the scheduling rule for a Sigwat graph is that the node in a synchronization path needs to be scheduled contiguously. For such scheduling policy, the parallel execution time of LBD is minimum. Now, we propose a new scheduling approach, which benefits for the parallel execution time. Firstly, we find all synchronization paths SP(Wati, Sigi) in Sigwat graph and sort SP(Wati, Sigi) by the value (n/d i ) SP(Wati, Sigi) in descending order, where d i and n are the dependence distance for that dependence relation and number of iterations in a loop respectively. If some synchronization paths have the same nodes (i.e. the intersection of nodes with two synchronization paths is not null), then they need to be scheduled simultaneously; otherwise, the distance of Sig and its corresponding Wat increase and the parallel execution time will be increased. i 1 3 6 2 22 26 27 i 4 5 9 10 7 8 25 24 23 i 11 14 17 Fig. 3 Sigwat and Wat graph for Fig. 2 Consider two synchronization paths SP(Wati, Sigi) and SP(Watj, Sigj), if SP(Wati, Sigi) SP(Watj, Sigj) and (n/d i ) SP(Wati, Sigi) > (n/d j ) SP(Watj, Sigj), then the parallel execution time T SPi is larger than T SPj. For such a case, SP(Wati, Sigi) and SP(Watj, Sigj) need to be scheduled by following the scheduling rule simultaneously, and SP(Wati, Sigi) is scheduled firstly because its parallel execution time is larger than the parallel execution time of SP(Watj, Sigj). After all the synchronization paths are scheduled, the remaining nodes in Sigwat graph are to be scheduled by following the order dependence of data flow graph. After finishing the scheduling of Sigwat graph, Sig graph is scheduled. For 12 13 15 16 20 21 18 19

scheduling a Sig graph, we find the position of their corresponding Wait_Signal, and then schedule Send_Signals in Sig graph immediately before its corresponding Wait_Signal. The other nodes in Sig graph are scheduled by the list scheduling. Similarly, to get LFD in Wat graph, the Wait_Signals in Wat graph need to be scheduled after their corresponding Send_Signal. Therefore, we find the position of Send_Signal, which is the corresponding Send_Signal in Wat graph, and schedule Wait_Signal after its corresponding Send_Signal. The remaining nodes in Wat graph are then scheduled. Finally, the nodes which are not in Sig, Wat, and Sigwat graph are scheduled. (1, 2, 3, -) Wat1 (6, -, -, -) (4, 6, 11, -) Wat2 (3, 7, -, -) (5, 7, 12, -) (1, 4, 8, 23) Wat1 (8, 13, 14, -) (5, 24, -, -) (9, 15, -, -) (9, 2, 25, -) (10, 17, -, -) (10, -, -, -) (16, 18, 23, -) (22, 17, -, -) (19, 24, -, -) (26, -, -, -) (20, 22, -, -) (27, 14, 18, -) Sig (21, -, -, -) (11, 15, 19, -) Wat2 (25, -, -, -) (16, -, -, -) (26, -, -, -) (20, 12, -, -) (27, -, -, -) Sig (21, 13, -, -) (a) list scheduling (b) our scheduling result Fig. 4 Scheduling result for Fig. 3 The main advantage of this scheduling algorithm is that it never degrades the system performance. Conversely, it significantly improves the system performance. Now, we illustrate an example to show the scheduling algorithm. In Fig. 3, assuming 4-issue is employed for the superscalar machine, there are following function units: Load/store, adder, shifter, multiplier, and divider. The result of list scheduling with 4-issue in Fig. 3 is listed in Fig. 4 (a). In this figure, we find that nodes 1, 2, 3, 6, 11, 12, 14, 17, and 23 are available. Therefore, nodes 1, 2, and 3 are arranged in an instruction. Node 6 is conflicted with node 3 because of using the same function unit. According to the resource constraint and data flow graph, we can get the scheduled code shown in Fig. 4 (a). Observing Fig. 4 (a), there exist two LBDs. The longest distance from Sig to Wat2 has 12 instructions; therefore, the parallel execution time is (12 N)+13, where N represents N iterations because the dependence distance for Wat2 is 1. The scheduling result of our approach is listed in Fig. 4 (b). First, we find a synchronization path in Sigwat graph. The synchronization path contains nodes 1, 5, 9, 10, 22, 26, and 27. Therefore, we schedule these nodes in the contiguous instruction. Then the remaining nodes in Sigwat graph are scheduled. To get LFD, the Wait_Signal, node 11, in the Wat graph needs to be scheduled after its corresponding Send_Signal, node 27. Similarly, the remaining nodes in the Wat graph are scheduled. Observing Fig. 4 (b), there exists only one LBD, and the parallel execution time is ( N 2 4. Experimental results 7)+13 [<< (12 N)+13]. 4.1. Benchmarks and statistical model In order to evaluate the performance of instruction scheduling, we use Perfect benchmarks to calculate the total execution time in a loop [4, 8, 14]. The Perfect Benchmarks (PERFormance Evaluation for Costeffective Transformations) is a suite of FORTRAN 77 programs. They represent applications in a number of areas of engineering and scientific computing. In many cases they represent codes that are currently used by computational research and development groups. In this section, we use five Perfect benchmarks to make a statistics for the parallel execution time of LBD loop. To calculate the parallel execution time, we make the following assumptions: (1) shared memory multiprocessor has n processors; (2) n iterations in a loop; and (3) each instruction is executed in one time unit. Chen and Yew [2] measured the contributions of individual restructuring techniques. The transformations examined are recurrence replacement, synchronized DOACROSS loops, induction variable substitution, scalar expansion, forward substitution, reduction statements, loop interchanging, and strip-mining. Surprisingly enough, the restructuring techniques have little effect on the speed of the codes. There are a few exceptions, however. Of all the restructuring techniques examined, scalar expansion had the greatest effect on performance of the codes. Reduction replacement is another transformation that significantly improves the performance of several codes. According to their measurement, we use scalar expansion, reduction replacement, and induction variable substitution to convert a do loop into DOACROSS loop. The statistical model is constructed as shown in Fig. 5. First, we use Parafrase compiler to convert a benchmark into parallel source code. Then, we extract DO loops which cannot be parallelized by Parafrase from the parallel source code. Next, we convert a do loop into DOACROSS loop by inserting synchronization operations to execute DOACROSS loop in parallel. In fact, we merge the loops, which have already been compiled to DLX machine code, with synchronization operations which cannot be compiled by DLX compiler. Then, we translate the merged codes into internal form, the input format of our simulator. Finally, a simulator is constructed to calculate the total execution

time of the traditional list scheduling and our scheduling approach in a loop. According to the statistics in [14], there are 6 types of DOACROSS loop. These types are control dependence, anti-output dependence, induction variable, reduction operation, simple subscript expression and others that are not included in the above categories. Now, we explain some statistics gathered by using types 3 (induction variable), 4 (reduction operation), 5 (simple subscript expression), and part of type 6. Benchmark (*.c) Parafrase Compiler Statistic Data (*.gen) (*.out) Fig. 5 4.2. Statistical results Extract DOACROSS Loop DLX Compiler Simulator (*.dat) Statistical model Insert Synchronization Operation Merge DLX code & synchronization operation (*.ooo) Internal Form The characteristics of these perfect benchmarks are as listed in table 1. In this table, we find that benchmarks FLQ52, QCD, and TRACK are all LBD. And almost all LBDs are flow dependences [19]. We had made a comparison between our instruction scheduling technique and list scheduling [16]. Assuming the type of function unit in the superscalar processor include load/store unit, integer unit, floating-point unit, multiplier, divider, and shifter unit. Multiplier and divider take 3 and 6 cycles respectively, and other function units take one cycles. There are 100 iterations in each loop. Let T x-y-z be the parallel execution time, where x is "a" or "b" for list scheduling or new instruction scheduling respectively, y is 2 (4) for 2-issue (4-issue), and z is 1 (2) for the number of each function unit being 1 (2). Now, the statistics are divided into four cases. Case 1: 2-issue and the number of each function unit is one. This is used to calculate T a-2-1 and T b-2-1 of each benchmark. Case 2: 2-issue and the number of each function unit is two. This is used to calculate T a-2-2 and T b-2-2 of each benchmark. Case 3: 4-issue and the number of each function unit is one. This is used to calculate T a-4-1 and T b-4-1 of each benchmark. Case 4: 4-issue and the number of each function unit is two. This is used to calculate T a-4-2 and T b-4-2 of each benchmark. The statistical result is as shown in table 2. In this table, we find that the performance enhancement of each benchmark is significant. Some observations from this table is made as follows: 1)The parallel execution time of new instruction scheduling for each case is much the same. The reason is that the new instruction scheduling has the shortest synchronization path at any case, and the dependence is serious in the DLX code. Therefore, the parallel execution time is not significant difference between 2-issue and 4- issue. 2) For the list scheduling, T a-2-i of some benchmarks are smaller than their corresponding T a-4-i. The reason is that synchronization operations, Wat, does not depend on other elements except their corresponding dependence events. In order to increase instruction level parallelism, synchronization operations can be easily scheduled in the list scheduling approach. And this action will lengthen the synchronization path. Therefore, the parallel execution time of list scheduling with 4-issue will be increased. 3) The improved percentage of parallel execution time between list scheduling and new instruction scheduling is shown in table 3. There is a significant improvement in the statistics. The reason is that the distance from a Wat to its corresponding Snk is so far. For an assignment statement, we find that Snk is almost at the first element of the left-hand side of ':=', and the precedence associated with Snk has lower priority. On the other hand, delayed Load technique is employed to effectively use the limited registers. In fact, Snk is a Load instruction. Because of the use of delayed Load, Snk is placed near the end of the assignment statement but its corresponding Wat is far away from it. This means that a significant improvement is possible. In summary, after new instruction scheduling, there are about 83.37% and 85.1% performance enhancement for 2-issue and 4-issue respectively. 5. Conclusion A new instruction scheduling technique is proposed in this paper. This technique has not been considered by any instruction scheduling algorithms. However, the performance enhancement after this new scheduling technique is significant. Two major contributions have been made in this paper: 1) the correct synchronization conditions during instruction scheduling is maintained. 2) The system performance is enhanced by converting LBD loop into LFD loop or reducing the damage of LBD.

Table 1 Characteristics of the Perfect benchmarks Items Benchmarks FLQ52 QCD MDG TRACK ADM TOTAL lines parsed by Parafrase 1794 1664 835 1483 3809 9585 total no. of loops 82 81 20 51 148 382 No. of Dall loops 27 10 1 22 25 85 lines generated by DLX compiler 719 173 2926 537 2328 6683 total no. of LFD 0 0 6 0 6 12 total no. of LBD 7 2 32 4 14 59 Table 2 Statistic results Items 2-issue(#FU=1) 2-issue(#FU=2) 4-issue(#FU=1) 4-issue(#FU=2) Benchmarks Ta-2-1 Tb-2-1 Ta-2-2 Tb-2-2 Ta-4-1 Tb-4-1 Ta-4-2 Tb-4-2 FLQ52 25316 2726 21392 2704 26508 2720 24184 2695 QCD 2844 1850 1846 1840 4143 1849 3538 1844 MDG 57427 6375 45440 6075 61689 6372 54080 6053 TRACK 18406 1814 13288 1796 20204 1812 17679 1787 ADM 116165 20949 93652 19664 119855 20845 106559 19336 Total 220158 33714 175618 32079 232399 33598 206040 31715 Table 3 Improved percentage for the statistics Benchmarks Items References 2-issue (#FU=1) 2-issue (#FU=2) 4-issue (#FU=1) 4-issue (#FU=2) FLQ52 89.23% 87.36% 89.74% 88.86% QCD 34.95% 0.32% 55.37% 47.88% MDG 88.89% 86.63% 89.67% 88.8% TRACK 90.14% 86.48% 91.03% 89.89% ADM 81.97% 79% 82.6% 81.85% 1. C. Q. Zhu and P. C. Yew,A scheme to enforce data dependence on large multiprocessor systems, IEEE trans. Software Eng. June 1987, pp. 726-739. 2. D. K. Chen and P. C. Yew,An empirical study on DOACROSS loops, Proc. of supercomputing. November 1991, pp. 18-22. 3. D. W. Wall, Limits of instruction-level parallelism, Proc. of the 4th International Conference on Architectural Support for Programming Language and Operating System, 1991, pp 176-188. 4. G. Cybenko, L. Kipp, L. Pointer, and D. Kuck,Supercomputer performance evaluation and the Perfect Benchmarks, Proceedings of ICS, March 1990, pp. 45-53. 5. H. M. Su, and P. C. Yew,On data synchronization for multiprocessors, Proceeding of 6th Annual International Symposium on Computer Architecture, May 1989, pp. 416-423. 6. H. Zima and B. Chapman,Supercompilers for parallel and Vector Computers, Reading MA:Addison-Wesley, 1990. 7. J. R. Goodman and W.C. Hsu,Code scheduling and register allocation in large basic blocks, Proc. 1988 International Conference on Supercomputing. st. Malo, July 1988. 8. M. Berry, D. Chen, P. Koss, D. Kuck, L. Pointer, Y. Pang, R. Roloff, A., and S. Sameh,The perfect club benchmarks: effective performance evaluation of supercomputers. International Journal of Supercomputing Applications, Fall 1991, pp. 5-40. 9. M. Butler, T-Y. Yeh, Y. Patt, M. Alsup, H..Scales, and M. Shebanow,Single instruction stream parallelism is grater than two, Proc. of the 18th International Symposium on Computer Architecture, 1991, pp. 276-286. 10. M. Wolfe,Multiprocessor synchronization for concurrent loops, IEEE Software, January 1988, pp. 34-42. 11. M. Wolfe, High Performance Compilers for Parallel Computing. Reading MA:Addison-Wesley, 1995. 12. N. P. Jouppi and D. W. Wall, Available instruction-level parallelism for superscalar and superpipelined machines, Proc. of the 3rd International Conference on Architectural Support for Programming Language and Operating System, 1989, pp 272-282. 13. P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu,IMPACT: An architectural framework for multiple-instruction-issue processors, Proc. of the 18th International Symposium on Computer Architecture, 1991, pp. 266-275. 14. R. Eigenmann, J. Hoeflinger, Z. Li,, and D. Padua,Experiences in the automatic parallelization of four perfect- benchmark programs, Proceeding of the Fourth Workshop on Languages and Compilers for Parallel Computing. August 1991, pp. 87-95. 15. Rong-Yuh Hwang,Synchronization Migration for Performance Enhancement in a DOACROSS Loop, International EURO-PAR' 95 Conference on Parallel Processing, August 1995, pp. 339-350. 16. Rong-Yuh Hwang and F-J Fang,A Simulator for a New Instruction Scheduling and List Scheduling on a superscalar-based multiprocessor, Technique Report NTITEN 96-002, National Taipei Institute of Technology, January 1996. 17. Rong-Yuh Hwang and Feipei Lai,An Intelligent Code Migration Technique for Synchronization Operations on a Multiprocessor, IEE Proceedings-E Computers and Digital Techniques, March 1995, pp. 107-116. 18. Rong-Yuh Hwang and Feipei Lai,Guiding Instruction Scheduling with Synchronization markers on a Super scalar-based Multiprocessor, International Symposium on a Parallel Architectures, Algorithms, and Networks, December 1994, pp. 121-127. 19. Rong-Yuh Hwang,Synchronization Migration for Performance Enhancement in a DOACROSS Loop, International EURO-PAR'95 Conference on Parallel Processing, August 1995, pp. 339-350. 20. S. P. Midkiff and D. A. Padua,Compiler algorithm for synchronization, IEEE trans. Comput, C-36, 12, December 1987, pp. 1485-1495.

21. Z. Shen, Z. Li, and P. C. Yew,An empirical study of FORTRAN program for parallelizing compilers, IEEE trans. Parallel and Distrib., July 1990, pp. 356-364.