PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.

Size: px

Start display at page:

Download "PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France."

Melvyn Walker
5 years ago
Views:

1 : A Pipeline Path-based Scheduler Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 80 Grenoble Cedex, France rahmouni@verdon.imag.fr Abstract This paper presents a scheduling algorithm that improves on other approaches when dealing with the synthesis of control-ow dominated behavioral descriptions. It achieves this through the use of a constraintdriven path-based scheduling algorithm. The suboptimality of the original path-based algorithms when dealing with loops is overcome through a new technique for pipelining dierent loop iterations during execution path generation. Results show that the algorithm always generates the fastest solution in terms of clock cycles. Introduction Path-based scheduling algorithms () have proved themselves to be much more ecient than classical approaches when dealing with descriptions of control-ow dominated circuits. The rst application of to synthesis was made by Camposano [,,] and was based on algorithms rst proposed for microcode compaction []. generates an As Fast As Possible (AFAP) schedule for a description containing many dierent possible execution paths. This is achieved through a complex clique covering technique that identies the minimum number of cuts necessary for all paths in order to satisfy the constraints (userimposed or data-dependent). This approach however, tends to be sub-optimal when the input description contains many loops. The problem is related to the fact that, in Camposano's approach, all loop feedback edges are broken and thus no advantage can be taken of the fact that dierent loop iterations can be pipelined, implying potential parallelism beyond loop boundaries. Two other path-based approaches, namely [] and LDS [8] attempt to overcome this problem by leaving loop feedback edges intact. However, their approach is rather simplistic as they only consider one iteration. In addition, they do not cut the generated paths in an optimal way as they use an As Late As Possible (ALAP) scheduling technique to do this. Nevertheless, results published for these algorithms show that by treating loops more eciently, improvements on the original path-based approach can be made, even taking into consideration the fact that the path cuts are not optimal. In this paper, we propose a path-based scheduling algorithm that uses a clique covering technique to cut the paths in an optimal fashion while at the same time, using a new technique for pipelining loop iterations in order to identify any parallelism that may exist beyond loop boundaries. This loop pipelining goes beyond that proposed in both and LDS. The algorithm assumes that a loop may execute 0 times (when the loop condition is false for example), once or two or more times and generates dierent paths for each of these cases. Thus potential parallelism over several iterations of a loop can be identied. The algorithm, called Pipelined Path-based Scheduling () is presented in the rest of this paper. The paper is organized as follows. The basic concepts are reviewed in section. In Section, we analyze two previous path-based algorithms, and, to deduce our new algorithm,. The algorithm is illustrated in section. In section, we show some experimental results that indicate the improvements that may be obtained using. Finally, conclusions and perspectives are summarized in section. Basic Concepts In this section, we briey outline some of the basic concepts necessary for developing the algorithm.

2 . Path-based Concepts Control ow graphs are the most suitable representation for modeling control-ow dominated designs containing many (possibly nested) loops, global exceptions, multiple wait on events and procedure calls. In other words, features that reect the inherent properties of controllers. A control ow graph CFG isa graph G =(V; E), where the nodes V represent the operations such as assignment, addition as well as procedure calls, etc., and the edges E represent the precedence relation. An edge (v ;v ) E means that v is executed after v. If an operation represented by v has more than one successor represented by (v ;:::;v n ), one of them is executed. The selection of the successor depends on the conditions attached to the edges (v; v );:::;(v; v n ) and indicated on the edges. Paths in path based scheduling represent sequence of nodes (v ;:::;v n ), such that all these nodes can be executed in the same control step. Each path has a header which is the rst node of the sequence and a successor which follows the last node in the sequence. Paths with the same headers are merged into the same state. A transition is made between states S i and S j if and only if there is a Path P in S i with a successor representing the header of all the paths in S j.. Cost Function -dependent loops and loops with non-static bounds in control ow graphs introduce a major problem for the cost function of the scheduling result. This is due to the unknown number of iterations for each loop. The number of states or transitions generated by the schedule does not reect the real total execution time of an algorithm. The right way to evaluate these algorithms is to dene a new metric representing the expected number of clock cycles of a schedule [8]. This metric includes: Branch Probability: association of a branch probability to each edge in the control ow graph. These probabilities are computed by simulating the given behavioral description with a large set of dierent possible inputs. Path Probability: Let P = (v ;:::;v n ), v m = Succ(P ), the probability of executing such a path is: Prob(P) = ny i= p(v i ;v i+ ) p(v n ;v m ) () State Transition Probability: Let (P i ;:::;P j )bethe paths having S k as entry state, and S l as destination state. The probability of state transition from S k to S l is: p s (k; l) = jx r=i Prob(P r ) () Expected number of Clock Cycles of a Schedule: Let S =(S ;:::;S n ) be the set of states resulting from the Schedule. The expected number of clock cycles needed to execute the correponding input behavioral description is: X sch = nx i= X i () where X i is a random variable representing the expected number of times the state S i is executed during an execution of the behavioral description. Computing X i, 8i (;:::;n) is equivalent toresolving the following set of linear equations: X = () X i = X (X j :p s (j; i)) () 8j such that 9 a path P from S j to S i. Scheduling of Control Dominated Descriptions For control-ow dominated circuit descriptions, the most suitable scheduling techniques are those based on optimizing the dierent execution paths that may occur. The classic Path-based Scheduling() algorithm uses a minimum clique covering technique to generate an As Fast As Possible (AFAP) schedule. Other path-based algorithms such as, use a simpler, ALAP schedule preferring to concentrate on the optimization of loop execution. Both algorithms dramatically improve on the performance of data- ow dominated scheduling techniques for controldominated circuits but their results dier depending on the input description. Take, for example, the control-ow graph of gure (a). This represents a simple algorithm containing two possible execution paths starting at node and terminating at node. We assume that nodes and are synchronization nodes. In other words, they represent a statement such as"wait until S='';" in VHDL. The treatment of such synchronization statements is dierent in

3 S (a) (b) (c) S S S 0.0 all loops in the input description by removing feedback edges. This implies that each loop will require at least one state (control steps) to execute. Using statistical analysis, the number of clock cycles necessary to execute this schedule is 00. does not break loop feedback edges and can therefore potentially execute the loop in a single state. Even if a loop requires several states, the fact that we can pipeline dierent iterations of the loop means that the execution time of an algorithm containing loops scheduled with will be faster than for. This is validated in gure (c) (even for such a small example). We see that, using the same statistics as for, the execution time of the algorithm using is 0 clock cycles. In the next section, we present Figure : (a) Simple CFG input representation (b) Result of scheduling (c) Result of scheduling and. The result of scheduling the description of gure (a) using is shown in gure (b). We can see that, under certain conditions, the entire algorithm can be executed in a single state (S). Depending on the evaluation of the conditional of node, either state or S will be executed if the corresponding synchronization statement isfalse. Using the statistical analysis described in section. and the probability of taking each branch asshown in gure (b), we deduce that the execution time of this algorithm in terms of clock cycles will be.0. For, the execution paths will always be broken when a synchronization statement is encountered. Thus, will generate a circuit that executes according to the state diagram of gure (c). This diagram also has three states but note that the algorithm will always take at least two states to execute. Again according to our statistical analysis, the execution time of this circuit will be.. Thus, outperforms in this case in terms of execution time (number of states and transitions are equal). A more realistic control-ow dominated circuit description contains many (possibly nested) loops. In such cases, tends to produce better results than. Suppose we take the input graph representation as shown in gure (a). For the sake of clarity,we limit this graph to a single loop. We assume that there is a constraint violation between nodes and. As we are dealing with behavioral descriptions, it is not imperative tohave a synchronization statement within a loop body. The result of scheduling this representation using is shown in gure (b). breaks (a) (b) (c) S.0 S S Figure : (a) CFG with a single loop (b) Result of scheduling (c) Result of scheduling an algorithm that combines the advantages of both of these approaches and generates schedules that always execute in the minimum number of cycles. Pipeline Path-based Scheduling In order to benet from the advantages oered by both and, we have developed a new algorithm that uses clique covering to generate an AFAP schedule while at the same time using loop feedback edges to pipeline dierent iterations of a loop. In order to describe this algorithm, we will use the input graph representation shown in gure (a). The algorithm, known as Pipelined Path-based Scheduling () considers that a given loop will be executed 0, or or more times. If there are no

4 loops, the paths are generated in accordance with the approach.thus, for the example presented in gure, will generate the same results as. If on the other hand, loops exist, paths will be generated assuming the loop executes 0 times, once and twice. This technique is not unlike that of loop unrolling[]. Unrolling loops twice allows us to detect any interdependencies that may exist between dierent iterations of the same loop. As there will always be a dependency between dierent iterations of the same node, therefore implying a path interval, it is unnecessary to unroll loops more than twice. Of course, if the loop bounds are known at compile time, we can fully unroll the loop. The dierent paths for the input graph of gure (a) are shown in gure. The subscripts correspond to the loop iteration from which the node is taken. Thus, 0 is node for zero loop iterations (the loop condition is false), is node of the rst iteration of the loop and node is node of the second iteration. We still assume that a constraint violation is present between nodes and implying that all paths must be broken between these nodes for all loop iterations. Other path intervals are due to constraint violations (data-dependencies) between dierent iterations of the same node (nodes 0 and for example). Nevertheless, we can still benet from loop pipelining as shown in gure by the intervals generated using clique covering. before Path Path Path (a) after (b) S Figure : (a) Paths before and after Cuts generated using scheduling on the input graph of gure (c), (b) Corresponding state diagram The path f,,,g is deduced from Path. This is due to the fact that the loop is unrolled twice. The GCD DESIGN PREFETCH MMULT Constraints Loops S/L Paths States Trans Expck Adders / / / / / / Figure : Benchmark Results / 7.8 /.0 / 0.90 operations and of iteraion i can be executed in the same state as operations and of iteraion i. Experimental Results We have executed our algorithm on several benchmarks and other published examples. The results show that always nds the best solution under the same constraints when compared with the two other path-based approaches, represented by [] and []. Three benchmarks were chosen, PREFETCH[], GCD(Great Common Divisor) and MMULT[9] which represent the computation of the function ab mod n, given 0 a; b n, and lg(n). For each of them, gure shows the results produced by and the two others algorithms. Loops means the number of loops, S/L paths gives the shortest/longest path, States the number of states and Trans the number of transitions. Expck means the expected number of clock cycles needed to execute the algorithm. The results in gure show that always produces the better result of the two others algorithms. Furthermore, in the small examples, sometimes (the case of GCD) the clique covering technique and the ALAP generate the same cuts, thus producing the same results. Of more interest is the performance of when treating large realistic examples. Figure shows how performed when dealing with two relatively complex designs; a send process of the X.[8] and a telephone answering machine[]. Both of these exam-

5 ples contain complex control structures. ANSWER is avery large example making it dicult to compute the probabilities for every possible execution path and to solve the linear equations to produce the expected number of clock cycles. Nonetheless, with, we save states compared to and 9 compared to. X. ANSWER DESIGN Constraints Chaining Loops States Trans Expck Figure : performance using large example Conclusions In this paper, we have presented a new algorithm which combines clique covering techniques and a new loop pipelining technique to optimally schedule control dominated designs. Our method optimizes both the cuts (and therefore the number of states) and the loop execution time by taking into account any parallelism that may exist beyond iteration boundaries. Experimental results obtained by executing on several examples show that the new technique is a considerable improvement on previous path-based approaches. [] R.A. Bergamaschi, R. Camposano, M. Payer, "Area And Performance Optimizations in Path- Based Scheduling", Proc. EDAC'9, pp0-0, Amsterdam, February 99. [] R. Camposano, R.A. Bergamaschi, "Synthesis Using Path-Based Scheduling: Algorithms And Exercises", Proc. 7th DAC, pp0-, Orlando, June 990. [] G. Goossens, J. Vandewalle, H. D. Man, "Loop optimization in registe-transfer scheduling for DSP-systems", Proc. th DAC, pp8-8, June 989. [] K. O'Brien, M. Rahmouni, A.A.Jerraya, " A VHDL-Based Scheduling Algorithm for Control- Flow Dominated Design", th Intl. Workshop On High-Level Synthesis, California, November 99. [7] K. O'Brien, M. Rahmouni, A.A.Jerraya, ": A Scheduling Algorithm for High-Level Synthesis in VHDL", Proc. EDAC'9, Paris, France, February 99. [8] S. Bhatacharya, S. Dey, F. Brglez, "Performance Analysis and Optimization of Schedules for Conditional and Loop Intensive Specications", st Design Automation Conference, pp. 9-9, 99. [9] Howard Trickey, "Compiling Pascal Programs into Silicon", Ph D Thesis, Department of Computer Science, Stanford University, July 98. Acknowledgements The authors would like to thank Dr Kevin O'Brien and Jean Frehel for there help during this work. References [] J. A. Fisher, "Trace scheduleing: A technique for global microcode compaction", IEEE T. CAD, Vol C-0, July 98. [] R. Camposano, "Path-Based Scheduling for Synthesis", IEEE T. CAD, Vol 0(), pp8-9, January 99.

Formulation and Evaluation Of Scheduling Techniques. For Control Flow Graphs. 46, Avenue Felix Viallet, Grenoble Cedex, France

Formulation and Evaluation Of Scheduling Techniques For Control Flow Graphs Maher Rahmouni Ahmed A. Jerraya Laboratoire TIMA/lNPG,, Avenue Felix Viallet, 0 Grenoble Cedex, France Email:rahmouni@verdon.imag.fr