Incremental Tree Height Reduction For High Level Synthesis *

Size: px

Start display at page:

Download "Incremental Tree Height Reduction For High Level Synthesis *"

Vincent Bailey
6 years ago
Views:

1 Incremental Tree Height Reduction For High Level Synthesis * Alexandru Nicolau+ Roni Potasman++ +Information and Computer Science Department ++Dept. of Electrical and Computer Engineering University of California, Irvine, CA Abstract A new local and incrementoltree Height Reduction (THR) technique for parallelization of application programs is presented. Although THR was introduced many years ago-it has not been widely used in HLS scheduling systems. The two main reasons for that were the inability of most systems to compact beyond basic blocks of the program, thus limiting the strength of THR and the fact that traditionally THR required a global view of the program which made it either inefficient or impossible to integrate into local transformations. THR has several interesting prop erties: while known compaction techniques yield constant factor of speed-up (even with unlimited resources), THR has speed-up of O(n/ log n). Furthermore, THR is able to compact programs when other techniques fail (due to data depency between operations). The capability of our system to integrate THR with beyond-basic-blocks compaction and with loop pipelining means that more operations may be considered for THR which may yield much more aggressive compaction. 1 Introduction Tree Height Reduction is a well known technique for reducing the height of an expression tree from O(n) to O(1ogn) by balancing its subtrees. The height of the tree is the number of steps needed to compute the expression. Suppose the following schedule is given and the hardware constraints allow the use of only 2 adders. cycle 1: cycle 2: cycle 3: cycle 4: rl := ro + el; 52 := rl + c2; r3 := r2 + c3; 54 := 73 + r2; r5 := 73 + c4; Without any semantic transformation four steps are re- 'This work was supported in part by NSF grant CCR and ONR grant N K Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. quired to execute this program; However, using THR the schedule may be rewritten as: cycle 1: cycle 2: cycle 3: rl := ro + cl; r2 := r1 + c2; 24 := 73 + r2; 11 := c2 + c3; r3 := rl + tl; r5 := r3 + c4; Which produces the same results under the same hardware constraints but reduces the time needed to execute it from 4 to 3 steps. In general, the potential speed-up factor of THR (when the tree is fully balanced) is of O(n/logn), limited only by the resources available. Although THR was introduced many years ago tkumuch72, Ku78]-it has not been widely used in HLS or other scheduling systems. Two main reasons contributed to that: first, THR is effective only when there is a longenough chain of operations which are data depent. Unfortunately, the original THR was only applicable to operations within basic blocks. Since the average number of o p erations within a basic block is 4-5 [TjF170] (and not all of these will always form a chain) potential speed-up in basic blocks is limited. Second, the traditional implementation of THR required global information about the whole expression to be reduced. This prevented integration of THR into any of the existing local and incremental parallelizing transformations (List Scheduling, Trace Scheduling, Percolation Scheduling etc.) which can be used for HLS. In [PLNGSO] we presented a global (beyond basic blocks) approach to scheduling. By designing a set of incremental transformations for THR that integrate into our system of local transformations we overcome the previous problems associated with THR. In this context incremental and local THR has some important advantages: it is less ad hoc than the global one, it has more general application, it is easier to implement and it interfaces very well with other local parallelizing transformations and enables better control of resources. Furthermore, the local and incremental aspect of our technique will exploit potential opportunities wherever they are interspersed in the program; so even in a program that is not as regular as in the above example, we may still benefit from local opportunities interspersed throughout the program. The application of our THR is controlled by the resources available such that it only 'fills' unused resources. Thus, the traditional concern that THR may degrade per th ACM/IEEE Design Automation Conference@ ' 1991 ACM /91/0006/0770 $1.50

2 formance by generating redundant code that cannot translate into speed-up (due to limited resources) is completely eliminated by our incremental approach. In performing THR care must be taken not to violate numerical and other properties of the code. However in most cases this process can be applied without detrimental side effects. Although our implementation of THR can handle pipelined operations- for simplicity-we assume throughout the algorithm description that all operations are onecycle operations. The extension of the incremental THR algorithm to pipelined operations is straightforward. To our knowledge, this is the fir6t published local and incremental THR algorithm working across basic blocks. 2 Previous work The use of THR in HLS systems is relatively rare. In Flame1 [Tr87] an idea similar to THR is implemented by an algorithm called level compression. This algorithm picks (heuristically) a node about halfway along the critical path and tries to move it up so that the height of the path may be reduced. As height reduction proceeds, certain nodes are frozen to prevent them from taking part in further reductions. Nodes chosen for later moving must be about halfway along non-frozen portions of critical paths. This height reduction procedure usually cannot do better than approximately halve the original height. The moving process stops when there is no further height reduction or when resource bounds are exceeded at some level. In this algorithm care must be taken to avoid doing transformations that increase the graphs height. 3 THR Applications Although [Ku78] claims that applying THR to multioperation machines would be quite disappointing we found a large span of applications of THR in the context of HLS. Obviously, if one considers only basic blocks, the chain of depencies is not long enough to expose the strength of THR, but by looking at the global RTL level parallelism we are able to go past conditional jumps and have a longer chain of operations which improves the potential parallelism. This is particularly noticeable when combined with loop pipelining, when operations from different iterations make this chain even longer. While space limitations prevent a detailed example using pipelining with THR the benchmark results reflect this. The two most attractive applications of THR are digital filters and array computations. Digital filters are potential candidates for THR since they have chain of additions (resulting from the different delay elements) and since they are usually implementing loops which may be pipelined, so that much more operations may be exposed to THR. The most frequent array computations suitable for THR are sum of vector elements, dot product and simple recurrences where one can find a chain of depent operations. Although these chains of depencies are simple-they prevent any reduction to parallel form without an algorithmic change. Given enough resources, THR will reduce the computation time for all these examples from O(n) to O(1og n). 4 Algorithm Description 4.1 Background The idea behind tree height reduction is to try to compact a program at the expense of additional computation. In a design where execution of more than one operation per cycle is possible it is natural to utilize all available (unused) resources in order to increase performance. Hence, THR is adding more operations to the program, which can be executed by these free resources,.such that the total execution time of the program is reduced. THR takes advantage of the associativity and distributivity properties of arithmetic operations. For simplicity, we only present the algorithm with addition, multiplication and subtraction. It can be exted easily to programs with divisions and logical (AND, OR) operations as well. 4.2 Definitions: In this section we define some notations used later in the algorithm: Program: The program is represented by a control flow graph (CFG). The vertices (nodes) correspond to operations executed in each cycle. The edges represent flow of control from one node to its successor. Initially all nodes contain a single operation. Making a program more parallel involves compaction of several operations into one node while preserving the semantics of the sequential program. Operation: Each operation has a type (opfype) and variables which are called uses variables (for operands read) and a defvariable (for operands written). For the operation: Q := b * c the def is Q and the uses are b and c. The op-type is multiplication. Current-op: The operation currently being examined (or the operation we are trying to schedule earlier than its current cycle). Selectedsath: The Dath selected for THR. Later-definer and earlier-definer: The operations defining the uses of current-op. In a := b * c, b and c are called the definers of a. Suppose the following program is given: cycle (k) : cycle (k+l): cycle (k+2): b := d +e; c := h - e; a := b * c; h := d * g; We ll call the operation (c := h - e) the later-definer of operation (Q := b * c) while the operation (b := d + e) is called the earlier-definer of the current-op. Available variable: A variable is said to be available in cycle (k) if it is defined at cycle (k-1) or earlier. In the example above c is available in cycle (k+2) while b in cycle (k+l). Percolate Operations: Scheduling operations as soon as possible using the Percolation Scheduling (PS) transformations. PS does the actual operation motion between nodes ensuring that program correctness is preserved. Thus, PS would check for data depency preservation on all paths through the affected code and modify the schedule accordingly. While this is extremely simple in the absence of conditional jumps, the general transformations are nontrivial. Detailed discussion and description of the PS transformations are found in [Ni85, PLNGSO]. 771

3 4.3 Algorithm in detail Our local and incremental THR algorithm could be invoked in one of two ways: If during the incremental process, in which PS is trying to move an operation up from a node to its predecessor, a depency is encountered then THR is invoked to incrementally change the code to allow the motion. Alternatively, incremental THR can be invoked in the final phase of the compaction process, after all dataindepent operations have moved up as high as possible and there are still unused resources to fill. When activated for a particular operation the algorithm checks whether it could be scheduled earlier than its current cycle by introduction of a new operation which can be performed early enough so it can be used to eliminate the depency on the later-definer and advance the schedule of the current-op. Since each node may have more than one predecessor node (several incoming paths), incremental THR should be performed with respect to selectedpaths in the program. On different paths, each operation may have different later-definer and different earlier-definer, thus each path should be considered separately. Although it is usually sufficient to check only adjacent nodes in the program, hence preserve locality, it turns out that in order to achieve optimality (in the presence of sufficient resources), the whole chain of operations on the path has to be checked. This process is not needed when the resources are limited. The following algorithmic description refers to the optimal reduction on each path. The algorithm analyzes two cases differently: the first is when the associativity property of operations is used which happens whenever the current-op and its later-definer constitute one of the following pairs: ADD/ADD, ADD/SUB, SUB/ADD, SUB/SUB and MUL/MUL. The other case is when current-op is MUL and its later-defineris either ADD or SUB where the distributivity property is used. In any of these cases we try to hoist current-op from its current node (cycle) to a predecessor node, which eventually may reduce the length of the program Necessary and sufficient conditions for an operation to be hoisted: 1. One of its definers must be available at least two cycles earlier than itself on the path selected. 2. current-op s later-definer has a definer which is available at least two cycles earlier than current-op s cycle on that path. 3. If current-op is ADD or SUB then later-definer has to be either ADD or SUB. (These legal combinations constitute a legal chain). If, on the other hand, current-op is MUL, the later-definer might be either MUL, ADD or SUB. 4. Both current-op and its definers have two uses variables. 5. All relevant nodes on the path (into which new operations are added) have free resources. This does not mean that the algorithm needs to consider all paths; we may simply concentrate on only one or several important paths. Due to the incremental nature of the transformations we can stop at any point in the process and still have correct code Procedures procedure THRAnalysis(selected-path) for each node n in the selected-path do reset back-tmck flag for each operations in TI do if current-op meets the conditions then begin switch /* find which case is it */ case associativity: AssociativityAnalysis( current-op) case distributivity: DistributivityAnalysis( current-op) percolate operations on the path if back-tmck is set recheck predecessor node else check next node ( THRAnaIysis) The back-track flag causes backtracking to the previous node. This node has to be rechecked due to the possible creation of new legal chains of operations following the pushing of multiplications upward. These chains may create further THR opportunities. See example 1. procedure AssociativityAnalysis(curren+op) if current-op is SUB AND later-definer is its subtrah set sign-flag earliest-op= Find Aighest Avail-Op( current-op) if succeeded to find such an operation then begin /* add new operations recursively to path */ Climb-Up(modified op-type, earliestap s earlier-definer, current-op s laterdefiner) remove current-op from list (AssociativityAnalysis) Signflag controls the correct addition of SUB operations into the program. We need to flip the operands whenever we find a SUB and its laterdefiner is its subtrah. procedure DistributivityAnalysis(current-op) /* the procedure is called when current-op is of the form d := a * (b + c). In this case we do not try to hoist d-but rather use the distributivity property and convert d into d := a * b + a * c. */ /* add first additive (a * b) */ add new operation with (MUL type, later-definer s earlier-definer, current-op s earlier-definer) into later-definer s node /* add second additive (U * c) */ add new operation with (MUL type, later-definer s later-definer, current-op s earlier-definer) into later-definer s node /* add modified current-op (d) */ add new operation with (later-definer s type, first-additive, second-additive) into current-op s node remove current-op from list set back-track flag (DistributivityAnalysis) 772

4 procedure Find-HighestAvail-Op(selected-path) /* the procedure is searching along the selected-path for the earliest operation which meets the conditions explained in section For correctness preservation, each time a SUB is found and its later-definer is the subtrah-the operation s sign is flipped. * f (Find-HighestAvail-Op) procedure Climb- Up(type, firstsp, second-op) f * the procedure adds new operations into selected-path after the earliest operation that meets the conditions has been found by previous procedure. Calls itself recursively until reaches the later-definer of current-op. The addition of, the modified current-op is done by higher level calling procedure. */ add new operation with (type, first-op, second-op) if (didn t reach current-op s later-definer) Climb-Up(first-op s type, first-op s later-definer, the newly added operation) ( Climb- Up) 5 Examples In this section we present two examples, the first to clarify the algorithm and the second to show its application. Example 1: Suppose the following program is given and assume that a0 and all c s are available at the first cycle. a4 := a3 * c4; Step I : Let us begin, for example, with the third op eration (a3 := a2 - c3). Its earlier-definer is not defined in the previous instruction, so execute AssociativityAnalys:s(). The op-type is SUB-so set signflag and call FindXighestAvail-Op(). But, since current-op violates condition 3 quit the procedure. Step 2: Current-op is (a4 := a3 * c4). It s type is MUL and its later-definer is SUB so DistributivityAnalysis() is called. Three operations are added into the tree: (1) A MUL,operation whose uses are later-definer s earlier-definer (c3) and current-op s earlier-definer (c4). This operation gets a new def (tl) and is inserted into later-definer s cycle. (2) Another MUL whose uses are (a2) and (c4) and its def is t2. It is inserted into later-definer s cycle. (3) The reconstruction of current-op with the type of later-definer (SUB) and with uses which are the operations just added. Its defis current-op s def. The backtrack flag is set. After this step and percolation, the code has this form: a4 := t2 - tl; 11 := c3 * c4; t2 := a2 * c4; Step 3: Since bock-track flag is set cycle 3 is rechecked. (a3 := a2 - c3) cannot be hoisted for the same rea- son mentioned in step l abov-o check the next operation in this cycle (t2 := a2 * c4). The operation is MUL and its later-definer (a2) is MUL, hence by FindJiighestAvail-Op() the highest op which is (a2 := a1 * c2) is found. Now, using Climb-Up(), operations are added as follows: a MUL (t3) operation whose uses are highest op s earlier-definer (c2) and current-op s earlier-definer (c4) is added. Then another MUL, whose uses are later-definer s laterdefiner (al) and t3 is added. After this step and percolation we get: tl := c3 * c4; t2 := a1 * t3; a4 := t2 - tl; 13 := c2 * c4; Step 4: Consider (a5 := a4 - c5) as the current-op. Since it is SUB we use AssociativityAnalysis() and get:.tl := c3 * c4; t2 := a1 * t3; a4 := t2-11; t3 := c2 * c4; t4 := tl + c5; a5 := t2 - t4; Note that if resources didn t allow one of the steps (e.g., if only two subtractors were available per cycle) the incremental THR would have stopped without allowing a5 to move up, but still would produce a one cycle gain. Example 2: This example shows how incremental THR works across basic blocks. Suppose the following program: t ( r4:=r3+c3 3 4 /ifr3 > o 7 v rl:=ro+co 9 rz:=rl+cl W { r3,r4,r5,r6) ue dead hexe {IS) 2 live here This program segment has 3 basic blocks separated by conditional jumps. Conventional THR (within basic block boundaries) on this program fails since there are not enough operations in each of these 3 chains to produce any speed-up. But applying our incremental THR beyond the conditionals yields a significant compaction (from 8 cycles to 3) as shown in the next page. We assume that the two (indepent) conditionals can be executed in one cycle. In case that there is no hardware support for the execution 773

5 of two conditional jumps-the second conditional will be deferred by one cycle yielding a 4 cycles schedule. This example clarifies that the ability to move operations beyond the basic blocks exts the potential length of chains of operation which is a crucial for the applicability of THR. rl:=ro+co ; tl:=cl+c2 : t2:=c3+c4 r2:=rt+cl ; r3:=rl+tl : t3:=t2+c5 6 Experiments 4 r4:=r3+c3 ; r5:=r3+t2 : r6:=r3+t3 C E A This section details the results obtained by applying the incremental THR on the fifth order elliptic filter example [PaKn89] and the Sehwa example presented in [PaPa88]. In the following tables FDS and FDLS stand for Force Directed (List) Scheduling, PBS for Percolation Based Synthesis [PLNGSO] and PBST for PBS with THR. A. Fifth Order Elliptic Filter: Table 1 refers to the non pipelined case where the model assumes that the execution unit has to be flushed before the succeeding operation can be issued. Table 2 is for the pipelined case where the functional units can accept a new input each cycle. The results for the elliptic filter show that even though incremental THR is powerful when applied to loop body-it may yield further parallelization when combined with loop pipelining. B. Sehwo: The Sehwa example is an implementation of a digital filter with 16 points. Using the same semantics Table 2: Fifth Order Elliptic Filter - Pipelined Without loop 11 With ~OOD II as [PaPa88], our system reduces the schedule from 6 time steps to 5. Using structural pipelining rather then functional pipelining (see [PLNGSO]) incremental THR reduces the schedule from 10 steps into 8 as shown in table 3. 7 Discussion and conclusion With the advance in optimizing compiler techniques and.especially those with local transformations (like Percolation Scheduling) it is shown that by using THR there is a real possibility to compact programs across basic blocks limits even when conventional depency would appear to preclude further speed-up. There is one possible drawback for using THR. It is the notion of numerical stability. This problem is illustrated by the code sequence: a := b - c followed by: d := a * e. This may be transformed (during THR) into: d := b * e - c* e. If the values of b or c are too large but the value of their difference is still small, the order in which the expression is evaluated may be significant. We argue that.the algorithm presented here may be used selectively. In cases where numerical stability is violated, the user may disallow THR. References [AlKe82] J. R. Allen and K. Kennedy. PFC: A program to convert Fortran to parallel form. Technical Report MASC TR 82-6, Rice University, [Ni85] A. Nicolau. Uniform Parallelism Exploitation in Ordinary Programs. Proceedings of the 1985 International Conference on Parallel Processing, [I<u78] D. J. Kuck. The Structure of Computers and Computations. Vol I, New York: Wiley, [KuMuCh72] D. J. Kuck, Y. Muraoka and S. C. Chen. On the number of operations simultaneously executable in Fortran-like programs and their resulting speedup. IEEE Trans. on Computers, C-21, 12, December [PaKn89] P. G. Paulin and J. P. Knight. Force-Directed scheduling for the Behavioral Synthesis of ASIC s. IEEE trans. on CAD, Vol. 8, No. 6, June [Papa881 N. Park and A. C. Parker. Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications. IEEE Trans. on CAD, Vol. 7, No. 3, March [PLNGSO] R. Potasman, J. Lis, A. Nicolau, D. Gajski. Percolation Based Synthesis. Proc. of the ACM IEEE 27th Design Automation Conference, June [TjF170] G. S. Tjaden and M. J. Flynn. Detection and parallel execution of indepent instructions. IEEE Trans. on Computers, Vol. 19, No. 10, October [Tr87] H. Trickey. Flamel: A High-Level Hardware Compiler. IEEE Trans. on CAD, Vol. 6, No. 2, March

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz Compiler Design Fall 2015 Control-Flow Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292